Enable javascript in your browser for better experience. Need to know to enable it? Go here.
Published : Nov 05, 2025
Nov 2025
Trial ?

NVIDIA DCGM Exporter is an open-source tool that helps teams monitor distributed GPU training at scale. It converts proprietary telemetry from the NVIDIA Data Center GPU Manager (DCGM) into open formats compatible with standard monitoring systems. The Exporter exposes critical real-time metrics — including GPU utilization, temperature, power and ECC error counts—from both GPU and host servers. This visibility is essential for organizations fine-tuning custom LLMs or running long-duration, GPU-intensive training jobs. The straggler effect — where one slow worker bottlenecks the entire process — can reduce throughput by over 10% and waste up to 45% of allocated GPU hours. Designed for cloud-native, large-scale environments, the DCGM Exporter integrates seamlessly with Prometheus and Grafana, helping ensure every GPU operates within optimal performance bounds.

Download the PDF

 

 

 

English | Español | Português | 中文

Sign up for the Technology Radar newsletter

 

Subscribe now

Visit our archive to read previous volumes