GPU Monitoring: Beyond nvidia-smi
Dec 30, 2025 • 18 min read
Running watch -n 1 nvidia-smi is flying blind. You can see the GPU's current state, but you have no history, no alerting, no correlation with your application's request queue, and no ability to track trends over time. Production AI serving on GPUs requires proper observability: continuous metric collection, historical dashboards, and automated alerting for thermal throttling, memory pressure, and utilization bottlenecks.
1. nvidia-smi: The Baseline Tool
Start here before adding more complex tooling. Key commands to understand your GPU's state:
# Current state snapshot
nvidia-smi
# Monitor continuously (refresh every 1 second)
watch -n 1 nvidia-smi
# Detailed process info
nvidia-smi --query-compute-apps=pid,used_memory --format=csv
# Log to CSV for later analysis
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.free,power.draw --format=csv --loop=1 > gpu_log.csv
# Useful fields:
# utilization.gpu - % of time SM cores are busy computing (target: >50% for efficiency)
# utilization.memory - % of time memory interface is busy (can be high even at low gpu util)
# memory.used - Current VRAM consumption
# temperature.gpu - Die temperature (throttle starts ~83°C on most cards)
# power.draw - Watts consumed (compare to TDP for thermal headroom)2. Key GPU Metrics and What They Mean
| Metric | Good Range | If Outside Range… |
|---|---|---|
| SM Utilization | 60–95% | Below 50%: batch size too small, GPU starved. Above 99%: bottleneck, need more GPUs. |
| Memory Utilization | 70–95% | Below 50%: wasteful allocation. Above 95%: risk of OOM, reduce batch size. |
| Temperature | 40–82°C | Above 83°C: thermal throttling begins, check cooling. Above 90°C: immediate action needed. |
| Power Draw | Near TDP | Far below TDP at high util: memory-bound (data movement limiting). At TDP: compute-bound. |
| PCIe Bandwidth | Varies | Saturated: CPU↔GPU data transfer is bottleneck. Move data preprocessing to GPU. |
3. DCGM Exporter: The Production Standard
NVIDIA's DCGM (Data Center GPU Manager) provides deep telemetry beyond what nvidia-smi exposes: SM occupancy, tensor core utilization, NVLink bandwidth, page retirement events, and more. The DCGM Exporter runs as a container and exposes metrics in Prometheus format:
# Kubernetes DaemonSet — runs on every GPU node
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
spec:
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
ports:
- containerPort: 9400
name: http-metrics
securityContext:
runAsNonRoot: false
privileged: true # Required to access GPU metrics
resources:
limits:
nvidia.com/gpu: 1 # Access GPU hardware
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule4. Key DCGM Metrics for LLM Serving
# Scrape DCGM metrics manually (for debugging)
curl http://localhost:9400/metrics | grep DCGM
# Key metrics exposed by DCGM:
DCGM_FI_DEV_GPU_UTIL # SM utilization %
DCGM_FI_DEV_MEM_COPY_UTIL # Memory bus utilization %
DCGM_FI_DEV_ENC_UTIL # Video encoder utilization (not relevant for LLMs)
DCGM_FI_DEV_POWER_USAGE # Current power draw (Watts)
DCGM_FI_DEV_GPU_TEMP # GPU temperature (Celsius)
DCGM_FI_DEV_FB_USED # Framebuffer memory used (MiB)
DCGM_FI_DEV_FB_FREE # Framebuffer memory free (MiB)
DCGM_FI_DEV_NVLINK_BANDWIDTH # NVLink inter-GPU bandwidth (critical for multi-GPU)
# LLM-specific — identify memory vs compute bottleneck:
# High MEM_COPY_UTIL + Low GPU_UTIL = Memory bandwidth bound (common with large batches of small seq lengths)
# High GPU_UTIL + moderate MEM_COPY_UTIL = Compute bound (good — GPU is working hard)
# Low both = GPU is waiting for CPU or network (batching problem)5. Prometheus + Grafana Setup
# prometheus.yml — add DCGM as a scrape target
scrape_configs:
- job_name: 'dcgm'
scrape_interval: 15s
static_configs:
- targets: ['dcgm-exporter:9400'] # Or use kubernetes service discovery
# Docker Compose for local GPU monitoring stack
services:
prometheus:
image: prom/prometheus:v2.50.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:10.2.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=your_password
dcgm-exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
privileged: true
ports:
- "9400:9400"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]6. Prometheus Alerting Rules
# alerts/gpu.yml — add to Prometheus configuration
groups:
- name: GPU Alerts
rules:
- alert: GPUHighTemperature
expr: DCGM_FI_DEV_GPU_TEMP > 83
for: 2m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.gpu }} temperature above 83°C: {{ $value }}°C"
description: "Thermal throttling may begin. Check cooling and airflow."
- alert: GPUMemoryNearFull
expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.92
for: 5m
labels:
severity: critical
annotations:
summary: "GPU memory > 92% utilized on {{ $labels.instance }}"
- alert: GPUUnderUtilized
expr: DCGM_FI_DEV_GPU_UTIL < 20
for: 10m
labels:
severity: info
annotations:
summary: "GPU has been <20% utilized for 10 minutes. Investigate batching."
- alert: GPUThermalThrottling
expr: DCGM_FI_DEV_SM_ACTIVE < (DCGM_FI_DEV_GPU_UTIL * 0.7)
for: 1m
labels:
severity: warning
annotations:
summary: "Suspected thermal throttling: SM active much lower than GPU util"7. Diagnosing Common GPU Bottlenecks
Use the SM Occupancy vs Memory Usage pattern to identify your bottleneck type:
- High Memory (95%) + Low GPU Util (15%): You're memory-constrained but compute-underutilized. Your batch size is too small — increase it to use the compute cores more efficiently. Each inference call is loading model weights into memory but barely using the tensor cores.
- Low Memory (40%) + High GPU Util (95%): Compute-bound but memory-efficient. Good state for dense computation. May benefit from a faster GPU or multi-GPU setup.
- Both Low (<30%): GPU is idle — waiting for the CPU (preprocessing bottleneck) or network (slow API client). Increase parallelism in your request handling code.
- High PCIe Bandwidth + Low GPU Util: Data is streaming to GPU faster than it can process. Check for unnecessary host↔device memory copies.
Frequently Asked Questions
What Grafana dashboards should I import?
Import dashboard ID 12239 from Grafana's dashboard library — "NVIDIA DCGM Exporter Dashboard". It pre-builds panels for all key DCGM metrics. Supplement with a custom dashboard tracking your application-level throughput (tokens/second, requests/second) alongside GPU metrics for full correlation.
How does multi-GPU monitoring work?
DCGM tracks each GPU individually via the gpu label in Prometheus. For NVLink-connected multi-GPU setups (like A100 NVLink or H100), also monitor DCGM_FI_DEV_NVLINK_BANDWIDTH — low NVLink bandwidth indicates tensor parallelism is bottlenecked by inter-GPU communication rather than compute.
Conclusion
Production GPU monitoring is the difference between proactively optimizing your AI infrastructure and reacting to customer-reported slowdowns or OOM crashes. DCGM + Prometheus + Grafana gives you the observability stack to understand utilization patterns, catch thermal issues before they throttle performance, and make data-driven decisions about scaling and batching strategies. Set it up before you need it — you'll thank yourself when debugging your first production incident.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.