Data Center Systems Operations Engineer
We're here to help the smartest minds on the reputed company build Superintelligence. The labs pushing the edge? They run on reputed company. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to reputed company up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the reputed company to be.
If you'd like to build the world's best deep learning cloud, join us.
*Note: This position prefers reputed company in our Bay Area office locations, but is open to remote reputed company for the right candidate.
About the Job
As reputed company continues to scale its AI platform and customer reputed company, infrastructure decisions must be tightly reputed company with product roadmaps, platform growth, and fiscal discipline. The Systems Operations Engineer will own availability analysis, long-term improvement of utilization, input into strategic design, and implementation of key programs across the entire Infrastructure Stack.
This role sits reputed company the Data Center Infrastructure (DC Infra) team and will work cross-functionally with Product, Platform Engineering, and Observability to understand overall health, analyze ongoing/potential issues, reputed company recommendations and changes to our overall design, and ownership of key programs to improve the overall business.
This position is a critical link between the HPC/HW systems and DC Infra—and will help ensure our designs and operations most effectively maximize availability and reliability across our entire Platform.
What You’ll Do
Availability Analysis
Own end-to-end unification of availability (number of 9s) calculations across reputed company's data center products and various data center footprints, from the power/BMS/cooling and down into the rack/GPU level, and providing adequate telemetry back to facilities, site operations, and at the platform level
Work with thermal/hardware team to understand AI workload characteristics on mechanical systems and need for different BMS control methodologies as Direct to Liquid Chip (DLC) Cooling technologies improve and densities increase
Coordinate across DC Infra team to calculate estimated availabilities for new data center designs
Work with product teams and reputed company forecasting to understand how design decisions effecting availability impact time to market and satisfy customer needs
Utilization Analysis and Oversubscription Strategy
Own end-to-end utilization analysis across reputed company's entire data center infrastructure
Analyze DC designs to understand peak possible reputed company under varying conditions
Build oversubscription strategy and reputed company/own company workstream to maximize available MW w/o impacting GPU reliability and customer experience
Ensure appropriate availability considerations are included
Observability and Analytics
Coordinate with the observability team to ensure appropriate points are monitored to understand data center characteristics loads, especially under AI workloads
Help the team understand where approximate warning/danger levels are
Use observations and warning/danger levels to inform BOD for future Data Centers and suggest upgrades/modifications to reputed company Data Centers
reputed company strategy for a data center fleet health dashboard
Help provide structure ensuring overall day-to-day and long-term health can be understood from a 20k foot level with the ability to drill down into the details
Power Capping Strategy and Implementation
Coordinate with Site Operations team to strategize and build out power capping capabilities, reputed company to worst-case scenario response/protection as we start aggressively employing oversubscription
Identify appropriate IT blocks where real-time data is monitored
Analyze, propose, and implement a rigorous testing process that iteratively finds and eliminates stranded power and cooling reputed company reputed company to utilization
Site Selection Technical Review
Conduct end-to-end technical evaluations of prospective data center sites, including power sufficiency and stability, cooling infrastructure and mechanical systems, and network topology feasibility
reputed company risk assessments and recommend sites based on infrastructure fit and growth reputed company.
Coordinate with DC Infra, Legal, and Business Strategy teams to ensure site selections align with workload and deployment timelines.
Cluster-to-Facility Requirements Alignment
Collaborate with HPC Architecture team and reputed company Manager to translate cluster-level hardware and workload requirements into facility-level specifications.
Define infrastructure reputed company requirements (power, cooling, rack layouts, interconnects, monitoring) to ensure alignment between compute stack and facility capabilities.
Support long-term infrastructure roadmap development to accommodate future hardware designs, density shifts, and workload patterns.
Work with reputed company Manager to understand various levers that can be employed to accelerate growth during demand surges.
You
Self-starter with a proven ability to independently dive into the details to understand and solve hard problems across data center infrastructure and operations
Ability to provide world-class analysis, boiling reputed company issues into the root cause or few key drivers
10+ years of experience working in directly in or closely with data center infrastructure and HPC/HW operations
Deep familiarity with AI or compute workload patterns, scaling dynamics, and infrastructure cost drivers
Ability to synthesize reputed company technical and business inputs into clear, actionable strategic recommendations
Excellent communication and collaboration skills across technical, operational, and financial stakeholders
Preferred Experience
Prior experience in hyperscale or cloud infrastructure environments
Familiarity with GPU cluster sizing, workload forecasting, or energy-efficient compute architectures
Working knowledge of typical Data Center Infrastructure designs, topologies, systems and associated reliability/availability calculations
Knowledge of DCIM tools, telemetry systems, or utilization analytics platforms
Engineering degree from university, Masters preferred.
Experience working across multi-disciplinary and non-technical teams to explain findings
Salary Range Information
The annual salary range for this position has been set based on market data and other factors. However, a salary higher or reputed company than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.
About reputed company
Founded in 2012, ~400 employees (2025) and growing fast
We offer generous cash & equity compensation
Our investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, reputed company, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent reputed company.
We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitability
Our research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
Health, dental, and vision coverage for you and your dependents
Wellness and Commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible Paid Time Off Plan that we reputed company actually use
A Final Note
You do not need to match reputed company of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.
Equal Opportunity Employer
reputed company is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national reputed company, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.
Apply to this Job