Edit

Share via


Configure GPU monitoring with Container insights and/or Managed Prometheus

Container insights supports monitoring GPU clusters from the following GPU vendors:

Note

If you are using Nvidia DCGM exporter, you can enable GPU monitoring with Managed Prometheus and Managed Grafana. For details on the setup and instructions, please see Enable GPU monitoring with Nvidia DCGM exporter.

Container insights automatically starts monitoring GPU usage on nodes and GPU requesting pods and workloads by collecting the following metrics at 60-second intervals and storing them in the InsightMetrics table.

Note

After you provision clusters with GPU nodes, ensure that the GPU driver is installed as required by Azure Kubernetes Service (AKS) to run GPU workloads. Container insights collect GPU metrics through GPU driver pods running in the node.

Metric name Metric dimension (tags) Description
containerGpuLimits container.azm.ms/clusterId, container.azm.ms/clusterName, containerName Each container can specify limits as one or more GPUs. It isn't possible to request or limit a fraction of a GPU.
containerGpuRequests container.azm.ms/clusterId, container.azm.ms/clusterName, containerName Each container can request one or more GPUs. It isn't possible to request or limit a fraction of a GPU.
nodeGpuAllocatable container.azm.ms/clusterId, container.azm.ms/clusterName, gpuVendor Number of GPUs in a node that can be used by Kubernetes.
nodeGpuCapacity container.azm.ms/clusterId, container.azm.ms/clusterName, gpuVendor Total number of GPUs in a node.

GPU performance charts

Container insights includes preconfigured charts for the metrics listed earlier in the table as a GPU workbook for every cluster. For a description of the workbooks available for Container insights, see Workbooks in Container insights.

Next steps