Data Center Systems Operations Engineer
Posted 2025-10-26
Remote, USA
Full Time
Immediate Start
We're here to help the smartest minds on the planet build Superintelligence. The labs pushing the edge? They run on Lambda. Our gear trains and serves their models, our infrastructure scales with them, and we move fast to keep up. If you want to work on massive, world-changing AI deployments with people who love action and hard problems, we're the place to be. If you'd like to build the world's best deep learning cloud, join us. *Note: This position prefers presence in our Bay Area office locations, but is open to remote presence for the right candidate. About the JobAs Lambda continues to scale its AI platform and customer base, infrastructure decisions must be tightly aligned with product roadmaps, platform growth, and fiscal discipline. The Systems Operations Engineer will own availability analysis, long-term improvement of utilization, input into strategic design, and implementation of key programs across the entire Infrastructure Stack. This role sits within the Data Center Infrastructure (DC Infra) team and will work cross-functionally with Product, Platform Engineering, and Observability to understand overall health, analyze ongoing/potential issues, make recommendations and changes to our overall design, and ownership of key programs to improve the overall business. This position is a critical link between the HPC/HW systems and DC Infra—and will help ensure our designs and operations most effectively maximize availability and reliability across our entire Platform. What You’ll DoAvailability AnalysisOwn end-to-end unification of availability (number of 9s) calculations across Lambda's data center products and various data center footprints, from the power/BMS/cooling and down into the rack/GPU level, and providing adequate telemetry back to facilities, site operations, and at the platform levelWork with thermal/hardware team to understand AI workload characteristics on mechanical systems and need for different BMS control methodologies as Direct to Liquid Chip (DLC) Cooling technologies improve and densities increaseCoordinate across DC Infra team to calculate estimated availabilities for new data center designsWork with product teams and capacity forecasting to understand how design decisions effecting availability impact time to market and satisfy customer needsUtilization Analysis and Oversubscription StrategyOwn end-to-end utilization analysis across Lambda's entire data center infrastructureAnalyze DC designs to understand peak possible capacity under varying conditionsBuild oversubscription strategy and lead/own company workstream to maximize available MW w/o impacting GPU reliability and customer experienceEnsure appropriate availability considerations are includedObservability and AnalyticsCoordinate with the observability team to ensure appropriate points are monitored to understand data center characteristics loads, especially under AI workloadsHelp the team understand where approximate warning/danger levels areUse observations and warning/danger levels to inform BOD for future Data Centers and suggest upgrades/modifications to current Data CentersDevelop strategy for a data center fleet health dashboardHelp provide structure ensuring overall day-to-day and long-term health can be understood from a 20k foot level with the ability to drill down into the detailsPower Capping Strategy and ImplementationCoordinate with Site Operations team to strategize and build out power capping capabilities, related to worst-case scenario response/protection as we start aggressively employing oversubscriptionIdentify appropriate IT blocks where real-time data is monitoredAnalyze, propose, and implement a rigorous testing process that iteratively finds and eliminates stranded power and cooling capacity related to utilizationSite Selection Technical ReviewConduct end-to-end technical evaluations of prospective data center sites, including power sufficiency and stability, cooling infrastructure and mechanical systems, and network topology feasibilityPerform risk assessments and recommend sites based on infrastructure fit and growth capacity. Coordinate with DC Infra, Legal, and Business Strategy teams to ensure site selections align with workload and deployment timelines. Cluster-to-Facility Requirements AlignmentCollaborate with HPC Architecture team and Capacity Manager to translate cluster-level hardware and workload requirements into facility-level specifications. Define infrastructure interface requirements (power, cooling, rack layouts, interconnects, monitoring) to ensure alignment between compute stack and facility capabilities. Support long-term infrastructure roadmap development to accommodate future hardware designs, density shifts, and workload patterns. Work with Capacity Manager to understand various levers that can be employed to accelerate growth during demand surges. YouSelf-starter with a proven ability to independently dive into the details to understand and solve hard problems across data center infrastructure and operationsAbility to provide world-class analysis, boiling complex issues into the root cause or few key drivers10+ years of experience working in directly in or closely with data center infrastructure and HPC/HW operationsDeep familiarity with AI or compute workload patterns, scaling dynamics, and infrastructure cost driversAbility to synthesize complex technical and business inputs into clear, actionable strategic recommendationsExcellent communication and collaboration skills across technical, operational, and financial stakeholdersPreferred ExperiencePrior experience in hyperscale or cloud infrastructure environmentsFamiliarity with GPU cluster sizing, workload forecasting, or energy-efficient compute architecturesWorking knowledge of typical Data Center Infrastructure designs, topologies, systems and associated reliability/availability calculationsKnowledge of DCIM tools, telemetry systems, or utilization analytics platformsEngineering degree from university, Masters preferred. Experience working across multi-disciplinary and non-technical teams to explain findingsSalary Range InformationThe annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description. About LambdaFounded in 2012, ~400 employees (2025) and growing fastWe offer generous cash & equity compensationOur investors include Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, US Innovative Technology, Gradient Ventures, Mercato Partners, SVB, 1517, Crescent Cove. We are experiencing extremely high demand for our systems, with quarter over quarter, year over year profitabilityOur research papers have been accepted into top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOGHealth, dental, and vision coverage for you and your dependentsWellness and Commuter stipends for select roles401k Plan with 2% company match (USA employees)Flexible Paid Time Off Plan that we all actually useA Final Note:You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills. Equal Opportunity EmployerLambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law. Apply to this Job