Enable javascript in your browser for better experience. Need to know to enable it? Go here.

NVIDIA DCGM Exporter

发布于 : Nov 05, 2025
Nov 2025
试验 ?

NVIDIA DCGM Exporter is an open-source tool that helps teams monitor distributed GPU training at scale. It converts proprietary telemetry from the NVIDIA Data Center GPU Manager (DCGM) into open formats compatible with standard monitoring systems. The Exporter exposes critical real-time metrics — including GPU utilization, temperature, power and ECC error counts—from both GPU and host servers. This visibility is essential for organizations fine-tuning custom LLMs or running long-duration, GPU-intensive training jobs. The straggler effect — where one slow worker bottlenecks the entire process — can reduce throughput by over 10% and waste up to 45% of allocated GPU hours. Designed for cloud-native, large-scale environments, the DCGM Exporter integrates seamlessly with Prometheus and Grafana, helping ensure every GPU operates within optimal performance bounds.

Download the PDF

 

 

 

English | Español | Português | 中文

Sign up for the Technology Radar newsletter

 

 

Subscribe now

查看存档并阅读往期内容