Get Started
We are looking for a curious and driven engineer eager to step into the world of high-performance computing and AI infrastructure. In this role, you’ll gain hands-on experience supporting NVIDIA GPU clusters and automation pipelines that power some of the world’s most advanced AI workloads. Working alongside seasoned engineers, you’ll learn to apply Linux, Kubernetes, Terraform, and Prometheus in real-world environments where precision and scale truly matter.
If you’re passionate about technology that defines the future of computers, this is your chance to grow within a team shaping that frontier.
Office Travel: Frequent on-site work is required for this position (2–3 days/week) at our Santa Clara, CA office.
Job responsibilities
- You will act as the initial responder to monitoring alerts, ensuring timely acknowledgment and preliminary triage of operational issues.
- You will automate operational procedures and diagnostics using established Infrastructure as Code (IaC) tools, including Bash, Python, Ansible, Terraform, and Helm, under the guidance of senior engineers.
- You will execute foundational diagnostics such as NCCL tests, DCGM (Data Center GPU Manager), Fabric Diagnostics, and designated test workloads for training and inference, following standard procedures.
- You will apply a proactive and action-oriented mindset, resolving documented issues efficiently and suggesting improvements to runbooks or automation scripts based on recurring patterns.
- You will analyze and interpret diagnostic outputs to assess system health and identify early signs of degradation or instability.
- You will document all operational activities, system status changes, and troubleshooting steps with accuracy, clarity, and timeliness.
- You will use observability tools such as Prometheus and Grafana to analyze logs and metrics, supporting senior engineers in the root cause isolation process.
- You will develop hands-on familiarity with HPC workload management tools, including Slurm and/or Kubernetes.
- You will actively participate in training sessions and knowledge-sharing initiatives to deepen your understanding of the GB200/GB300 architecture and operational best practices.
- You will maintain a high level of discipline, attention to detail, and consistency across all operational tasks.
Job qualifications
Technical Skills
- You have foundational knowledge of Linux operating systems and are comfortable with the Unix command line, including using awk, Bash, and Python for log parsing and basic automation.
- You are familiar with or have exposure to HPC systems, including HPC schedulers (e.g., Slurm) or container orchestration tools (e.g., Kubernetes).
- You are comfortable using observability platforms such as Prometheus and Grafana for log and metric visualization.
- You are familiar with Infrastructure as Code (IaC) concepts and can execute automation using tools like Ansible or Terraform.
- You have familiarity with GPU-based workloads and are eager to deepen your understanding of AI and HPC operations.
Professional Skills
- You demonstrate strong analytical ability and can follow complex procedures while interpreting technical results (e.g., NCCL tests).
- You communicate with clarity and accuracy, producing clear documentation and reports for both peers and senior engineers.
- You collaborate effectively with cross-functional teams, embracing mentorship and continuous feedback.
- You bring curiosity, persistence, and discipline, with a strong desire to learn and grow in advanced HPC operations.
- You work with attention to detail, ensuring consistency and accuracy in every task you undertake.
- You thrive in an environment that values learning, precision, and shared ownership.
Growth Expectation
We value curiosity and a growth mindset. Candidates are expected to bring a strong foundation in Linux and scripting from academic or prior professional experience.
Proficiency in advanced scripting, IaC practices, and observability tooling (e.g., Prometheus, Grafana) may be developed within the first six months through structured on-the-job training and mentorship from senior engineers.
Other things to know
Learning & Development
There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.
Job Details
Country: USA
City: San Francisco, California
Date Posted: 10-07-2025
Industry: Information Technology
Employment Type: Regular
About Thoughtworks
Thoughtworks is a dynamic and inclusive community of bright and supportive colleagues who are revolutionizing tech. As a leading technology consultancy, we’re pushing boundaries through our purposeful and impactful work. For 30+ years, we’ve delivered extraordinary impact together with our clients by helping them solve complex business problems with technology as the differentiator. Bring your brilliant expertise and commitment for continuous learning to Thoughtworks. Together, let’s be extraordinary.
Salary
Benefits: https://www.thoughtworks.com/en-us/careers/benefits
The annual salary range posted is subject to many factors and may vary depending on experience, geographic location, job responsibilities, performance, skills and/or training.
感谢您有兴趣加入 Thoughtworks。我们招聘团队的一名成员将尽快审核您的申请。
与此同时,您可以查看我们的顾问生活页面,了解更多关于Thoughtworkers 对客户、科技行业以及一起创造的非凡影响。
请注意,我们重视隐私:通过您的在线申请提交给我们的所有信息都将 为 Thoughtworks 保密。