Technology Radar
Published : Nov 05, 2025
NOT ON THE CURRENT EDITION
This blip is not on the current edition of the Radar. If it was on one of the last few editions, it is likely that it is still relevant. If the blip is older, it might no longer be relevant and our assessment might be different today. Unfortunately, we simply don't have the bandwidth to continuously review blips from previous editions of the Radar.
Understand more
Nov 2025
Trial
NVIDIA DCGM Exporter 是一个开源工具,帮助团队在大规模分布式 GPU 训练中进行监控。它将 NVIDIA 数据中心 GPU 管理器(DCGM) 的专有遥测数据转换为兼容标准监控系统的开放格式。Exporter 暴露关键的实时指标——包括 GPU 利用率、温度、功耗和 ECC 错误计数——涵盖 GPU 和主机服务器。这种可见性对于微调自定义 LLM 或运行长时间、高强度 GPU 训练作业的组织至关重要。滞后效应——即单个缓慢工作节点限制整个流程——可能降低吞吐量超过 10%,并浪费多达 45% 的分配 GPU 小时。DCGM Exporter 为云原生大规模环境设计,可与 Prometheus 和 Grafana 无缝集成,帮助确保每个 GPU 在最佳性能范围内运行。