opncrafter

GPU Monitoring: Beyond nvidia-smi

Dec 30, 2025 • 18 min read

Running watch -n 1 nvidia-smi is flying blind. You can see the GPU's current state, but you have no history, no alerting, no correlation with your application's request queue, and no ability to track trends over time. Production AI serving on GPUs requires proper observability: continuous metric collection, historical dashboards, and automated alerting for thermal throttling, memory pressure, and utilization bottlenecks.

1. nvidia-smi: The Baseline Tool

Start here before adding more complex tooling. Key commands to understand your GPU's state:

# Current state snapshot
nvidia-smi

# Monitor continuously (refresh every 1 second)
watch -n 1 nvidia-smi

# Detailed process info  
nvidia-smi --query-compute-apps=pid,used_memory --format=csv

# Log to CSV for later analysis
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.free,power.draw     --format=csv --loop=1 > gpu_log.csv

# Useful fields:
# utilization.gpu  - % of time SM cores are busy computing (target: >50% for efficiency)
# utilization.memory - % of time memory interface is busy (can be high even at low gpu util)
# memory.used - Current VRAM consumption
# temperature.gpu - Die temperature (throttle starts ~83°C on most cards)
# power.draw - Watts consumed (compare to TDP for thermal headroom)

2. Key GPU Metrics and What They Mean

MetricGood RangeIf Outside Range…
SM Utilization60–95%Below 50%: batch size too small, GPU starved. Above 99%: bottleneck, need more GPUs.
Memory Utilization70–95%Below 50%: wasteful allocation. Above 95%: risk of OOM, reduce batch size.
Temperature40–82°CAbove 83°C: thermal throttling begins, check cooling. Above 90°C: immediate action needed.
Power DrawNear TDPFar below TDP at high util: memory-bound (data movement limiting). At TDP: compute-bound.
PCIe BandwidthVariesSaturated: CPU↔GPU data transfer is bottleneck. Move data preprocessing to GPU.

3. DCGM Exporter: The Production Standard

NVIDIA's DCGM (Data Center GPU Manager) provides deep telemetry beyond what nvidia-smi exposes: SM occupancy, tensor core utilization, NVLink bandwidth, page retirement events, and more. The DCGM Exporter runs as a container and exposes metrics in Prometheus format:

# Kubernetes DaemonSet — runs on every GPU node
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    spec:
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
        ports:
        - containerPort: 9400
          name: http-metrics
        securityContext:
          runAsNonRoot: false
          privileged: true        # Required to access GPU metrics
        resources:
          limits:
            nvidia.com/gpu: 1     # Access GPU hardware
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

4. Key DCGM Metrics for LLM Serving

# Scrape DCGM metrics manually (for debugging)
curl http://localhost:9400/metrics | grep DCGM

# Key metrics exposed by DCGM:
DCGM_FI_DEV_GPU_UTIL          # SM utilization %
DCGM_FI_DEV_MEM_COPY_UTIL     # Memory bus utilization %
DCGM_FI_DEV_ENC_UTIL          # Video encoder utilization (not relevant for LLMs)
DCGM_FI_DEV_POWER_USAGE       # Current power draw (Watts)
DCGM_FI_DEV_GPU_TEMP          # GPU temperature (Celsius)
DCGM_FI_DEV_FB_USED           # Framebuffer memory used (MiB)
DCGM_FI_DEV_FB_FREE           # Framebuffer memory free (MiB)
DCGM_FI_DEV_NVLINK_BANDWIDTH  # NVLink inter-GPU bandwidth (critical for multi-GPU)

# LLM-specific — identify memory vs compute bottleneck:
# High MEM_COPY_UTIL + Low GPU_UTIL = Memory bandwidth bound (common with large batches of small seq lengths)
# High GPU_UTIL + moderate MEM_COPY_UTIL = Compute bound (good — GPU is working hard)
# Low both = GPU is waiting for CPU or network (batching problem)

5. Prometheus + Grafana Setup

# prometheus.yml — add DCGM as a scrape target
scrape_configs:
  - job_name: 'dcgm'
    scrape_interval: 15s
    static_configs:
      - targets: ['dcgm-exporter:9400']  # Or use kubernetes service discovery

# Docker Compose for local GPU monitoring stack
services:
  prometheus:
    image: prom/prometheus:v2.50.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
    
  grafana:
    image: grafana/grafana:10.2.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your_password
    
  dcgm-exporter:
    image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
    privileged: true
    ports:
      - "9400:9400"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

6. Prometheus Alerting Rules

# alerts/gpu.yml — add to Prometheus configuration
groups:
  - name: GPU Alerts
    rules:
      - alert: GPUHighTemperature
        expr: DCGM_FI_DEV_GPU_TEMP > 83
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.gpu }} temperature above 83°C: {{ $value }}°C"
          description: "Thermal throttling may begin. Check cooling and airflow."

      - alert: GPUMemoryNearFull
        expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.92
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GPU memory > 92% utilized on {{ $labels.instance }}"

      - alert: GPUUnderUtilized
        expr: DCGM_FI_DEV_GPU_UTIL < 20
        for: 10m
        labels:
          severity: info
        annotations:
          summary: "GPU has been <20% utilized for 10 minutes. Investigate batching."

      - alert: GPUThermalThrottling
        expr: DCGM_FI_DEV_SM_ACTIVE < (DCGM_FI_DEV_GPU_UTIL * 0.7)
        for: 1m
        labels:
          severity: warning  
        annotations:
          summary: "Suspected thermal throttling: SM active much lower than GPU util"

7. Diagnosing Common GPU Bottlenecks

Use the SM Occupancy vs Memory Usage pattern to identify your bottleneck type:

  • High Memory (95%) + Low GPU Util (15%): You're memory-constrained but compute-underutilized. Your batch size is too small — increase it to use the compute cores more efficiently. Each inference call is loading model weights into memory but barely using the tensor cores.
  • Low Memory (40%) + High GPU Util (95%): Compute-bound but memory-efficient. Good state for dense computation. May benefit from a faster GPU or multi-GPU setup.
  • Both Low (<30%): GPU is idle — waiting for the CPU (preprocessing bottleneck) or network (slow API client). Increase parallelism in your request handling code.
  • High PCIe Bandwidth + Low GPU Util: Data is streaming to GPU faster than it can process. Check for unnecessary host↔device memory copies.

Frequently Asked Questions

What Grafana dashboards should I import?

Import dashboard ID 12239 from Grafana's dashboard library — "NVIDIA DCGM Exporter Dashboard". It pre-builds panels for all key DCGM metrics. Supplement with a custom dashboard tracking your application-level throughput (tokens/second, requests/second) alongside GPU metrics for full correlation.

How does multi-GPU monitoring work?

DCGM tracks each GPU individually via the gpu label in Prometheus. For NVLink-connected multi-GPU setups (like A100 NVLink or H100), also monitor DCGM_FI_DEV_NVLINK_BANDWIDTH — low NVLink bandwidth indicates tensor parallelism is bottlenecked by inter-GPU communication rather than compute.

Conclusion

Production GPU monitoring is the difference between proactively optimizing your AI infrastructure and reacting to customer-reported slowdowns or OOM crashes. DCGM + Prometheus + Grafana gives you the observability stack to understand utilization patterns, catch thermal issues before they throttle performance, and make data-driven decisions about scaling and batching strategies. Set it up before you need it — you'll thank yourself when debugging your first production incident.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK