Enable javascript in your browser for better experience. Need to know to enable it? Go here.
Hero banner

Lead HPC Infrastructure Engineer

San Francisco, California, USA

We are seeking a highly accomplished engineer to take ownership of the operations and optimization of next-generation NVIDIA GB200 and GB300 GPU clusters. This role sits at the intersection of high-performance computing and AI infrastructure, where precision, automation, and scale meet innovation.

You will shape and maintain the reliability of some of the most advanced computer systems ever built; leveraging Linux, Kubernetes, Terraform, Ansible, and Helm to enable seamless, intelligent operations.

This is a rare opportunity to work on cutting-edge GPU infrastructure, solving complex challenges that push the boundaries of performance and efficiency

Office Travel: Frequent on-site work is required for this position (2–3 days/week) at our Santa Clara, CA office.

Job responsibilities

  • You will take ownership of mission-critical NVIDIA GB200 and GB300 clusters, ensuring their reliability, performance, and continuous operation.
  • You will act as the first responder and escalation point for operational issues, leading response efforts with calm and technical precision.
  • You will design, develop, and maintain Infrastructure as Code (IaC) solutions that enable automation, diagnostics, and deployment across Slurm and Kubernetes environments.
  • You will proactively analyze system logs, metrics, and telemetry to identify subtle anomalies, anticipate failures, and prevent service degradation.
  • You will perform deep, system-wide diagnostics on Grace Blackwell Superchips and NVLink fabric, driving root cause analysis and continuous improvement.
  • You will document operational knowledge — creating troubleshooting guides, procedures, and runbooks for complex or novel incidents.
  • You will lead and coordinate incident management efforts, collaborating with engineering teams and external partners to restore system stability.
  • You will mentor early-career engineers, promoting a culture of learning, ownership, and operational excellence.
  • You will communicate asynchronously and effectively, providing clear, detailed, and actionable updates to global teams.
  • You will maintain accountability and focus in a 12x7 on-call rotation, ensuring fast, accurate support for cluster operations.

Job qualifications

Technical Skills

  • You bring deep expertise in Linux systems engineering, including kernel-level troubleshooting and performance analysis.
  • You bring hands-on experience with HPC workload schedulers such as Slurm and Kubernetes (K8s) for orchestration and resource allocation.
  • You build automation and Infrastructure as Code with Terraform, Ansible, and Helm, ensuring consistency across large-scale environments.
  • You have advanced scripting proficiency in Python and Bash for automation, data parsing, and diagnostic tooling.
  • You understand GPU compute architecture, NVLink, Infiniband, and collective communication libraries (MPI, NCCL) at an expert operational level.
  • You have experience supporting frontline HPC operations in national laboratories, cloud providers, or large-scale technology organizations.

Professional Skills

  • You demonstrate strong ownership and accountability in high-stakes, time-sensitive environments.
  • You collaborate effectively across engineering, operations, and partner teams to solve critical challenges.
  • You apply structured problem-solving to diagnose and resolve undocumented or complex failures.
  • You communicate clearly and concisely, translating technical depth into clarity for both technical and non-technical audiences.
  • You work autonomously and asynchronously, managing ambiguity with focus and precision.
  • You mentor and uplift others, fostering continuous learning and a shared culture of operational excellence.

Other things to know

Learning & Development

There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.

Job Details

Country: USA
City: San Francisco, California
Date Posted: 10-07-2025
Industry: Information Technology
Employment Type: Regular

About Thoughtworks

Thoughtworks is a dynamic and inclusive community of bright and supportive colleagues who are revolutionizing tech. As a leading technology consultancy, we’re pushing boundaries through our purposeful and impactful work. For 30+ years, we’ve delivered extraordinary impact together with our clients by helping them solve complex business problems with technology as the differentiator. Bring your brilliant expertise and commitment for continuous learning to Thoughtworks. Together, let’s be extraordinary.

#LI-Remote

Salary

Benefits: https://www.thoughtworks.com/en-us/careers/benefits

The annual salary range posted is subject to many factors and may vary depending on experience, geographic location, job responsibilities, performance, skills and/or training.

Salary
$169,000$270,000 USD

Vielen Dank für dein Interesse daran ein Teil unseres Teams bei Thoughtworks zu werden. Wir werden uns deine Bewerbung so schnell wie möglich anschauen.

 

In der Zwischenzeit kannst du auf der Seite "Consultant Life" mehr über den außergewöhnlichen Impact lesen, den Thoughtworker:innen auf unsere Kund:innen, die Tech-Industrie und einander haben.

 

Das Thema Datenschutz ist uns sehr wichtig: Alle Informationen, die du uns über deine Online-Bewerbung übermittelst, werden von Thoughtworks vertraulich behandelt.

Sign up for our monthly careers newsletter