Computer Vision: YOLO Models
Dec 29, 2025 • 18 min read
LLMs handle text. GPT-4V, Gemini Vision, Claude handle static images. But what if your agent needs to process video streams in real-time — counting objects, detecting intrusions, reading screens, or monitoring assembly lines at 60fps? YOLO (You Only Look Once) is the industry standard for real-time object detection: a single neural network that processes an image and predicts all bounding boxes simultaneously, achieving 50-200fps on modern GPUs. Understanding YOLO is essential for any AI engineer building agents that interact with the physical world.
1. The YOLO Architecture: Why It's Fast
Before YOLO, object detection used sliding windows: run the classification model on every possible bounding box position at every scale. For a 640x640 image, that's tens of thousands of inference passes per frame — impossibly slow. YOLO's insight: treat detection as a single regression problem. Divide the image into an SxS grid. Each grid cell simultaneously predicts B bounding boxes and C class probabilities. One forward pass, all boxes at once:
| Model | mAP50 | Speed (A100) | Best For |
|---|---|---|---|
| YOLOv8n (nano) | 37.3 | 1.47ms | Edge/mobile, 8GB GPU memory |
| YOLOv8s (small) | 44.9 | 2.66ms | Raspberry Pi 5, Jetson Nano |
| YOLOv8m (medium) | 50.2 | 5.86ms | Balanced quality/speed |
| YOLOv8x (large) | 53.9 | 14.37ms | Max accuracy, research |
| YOLOv11n | 39.5 | 1.55ms | Updated nano, better accuracy |
| YOLO-World | N/A (open vocab) | ~15ms | Zero-shot detection, no training needed |
2. Real-Time Video Stream Inference
pip install ultralytics
from ultralytics import YOLO
import cv2
import numpy as np
from collections import defaultdict
# Load pretrained model (downloads ~6MB .pt file on first run)
model = YOLO("yolov8n.pt") # nano = fastest
# model = YOLO("yolov8m.pt") # medium = better accuracy
# Run inference on webcam stream
cap = cv2.VideoCapture(0) # 0 = default webcam, 1 = second, "rtsp://..." for IP cameras
# Object tracking (follows objects across frames with consistent IDs)
track_history = defaultdict(lambda: [])
while cap.isOpened():
success, frame = cap.read()
if not success:
break
# Run detection AND tracking (ByteTrack algorithm for multi-object tracking)
results = model.track(
frame,
persist=True, # Maintain tracker state across frames
conf=0.3, # Minimum confidence threshold
iou=0.45, # IoU threshold for NMS (non-max suppression)
classes=[0, 2, 7], # Only detect: 0=person, 2=car, 7=truck (COCO classes)
verbose=False,
)
# Parse results for your application logic
if results[0].boxes.id is not None:
boxes = results[0].boxes.xywh.cpu() # [cx, cy, w, h] format
track_ids = results[0].boxes.id.int().cpu().tolist()
confidences = results[0].boxes.conf.cpu()
class_ids = results[0].boxes.cls.int().cpu().tolist()
class_names = [model.names[c] for c in class_ids]
for box, track_id, conf, cls_name in zip(boxes, track_ids, confidences, class_names):
cx, cy, w, h = box.tolist()
# Store trajectory for trail visualization
track_history[track_id].append((cx, cy))
if len(track_history[track_id]) > 30: # Keep last 30 positions
track_history[track_id].pop(0)
print(f"Object {track_id}: {cls_name} at ({cx:.0f},{cy:.0f}) conf={conf:.2f}")
# Draw annotated frame with Ultralytics built-in renderer
annotated_frame = results[0].plot(line_width=2, font_size=0.8)
cv2.imshow("YOLO Tracking", annotated_frame)
# Quit on 'q'
if cv2.waitKey(1) & 0xFF == ord("q"):
break
cap.release()
cv2.destroyAllWindows()3. Fine-Tuning YOLO on Custom Classes
# Train YOLO on your custom dataset in 3 steps
#
# Step 1: Create dataset.yaml
dataset_yaml = """
path: /path/to/dataset # Root directory
train: images/train # 80% of your labeled images
val: images/val # 20% of your labeled images
nc: 3 # Number of classes
names: ["helmet", "vest", "person"] # Class names in order
"""
# Step 2: Label your images
# Use Roboflow or Label Studio to annotate in YOLO format:
# Each image gets a .txt file with: class_id cx cy w h (normalized 0-1)
from ultralytics import YOLO
# Step 3: Train
model = YOLO("yolov8m.pt") # Start from COCO pretrained weights (transfer learning)
results = model.train(
data="dataset.yaml",
epochs=100,
imgsz=640, # Input resolution
batch=16, # Images per batch (reduce if VRAM insufficient)
device=0, # GPU device index (0 = first GPU)
# Augmentation (extremely important for small datasets)
augment=True,
mosaic=1.0, # Mosaic augmentation (combines 4 images)
mixup=0.1, # MixUp augmentation
degrees=15.0, # Random rotation ±15°
fliplr=0.5, # Horizontal flip
# Learning rate schedule
lr0=0.01, # Initial LR
lrf=0.001, # Final LR (cosine schedule)
warmup_epochs=3, # LR warmup
project="./runs/train",
name="custom_detector",
)
# Evaluate on test set
metrics = model.val(data="dataset.yaml")
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")
# Export for deployment
model.export(format="onnx") # For ONNX Runtime
model.export(format="tflite") # For mobile/embedded
model.export(format="coreml") # For iOS/Mac4. Instance Segmentation and Pose Estimation
# YOLO also handles segmentation and pose in same API:
# Instance Segmentation (pixel masks for each detected object)
seg_model = YOLO("yolov8m-seg.pt")
results = seg_model("scene.jpg")
for r in results:
masks = r.masks.data.cpu().numpy() # [N, H, W] boolean masks
for i, (mask, cls) in enumerate(zip(masks, r.boxes.cls)):
class_name = seg_model.names[int(cls)]
pixel_count = mask.sum()
print(f"{class_name}: {pixel_count} pixels")
# Pose Estimation (17 human body keypoints)
pose_model = YOLO("yolov8m-pose.pt")
results = pose_model("person.jpg")
keypoints = results[0].keypoints.data.cpu().numpy() # [N_persons, 17, 3] (x, y, conf)
# Keypoints: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles
KEYPOINT_NAMES = [
"nose", "left_eye", "right_eye", "left_ear", "right_ear",
"left_shoulder", "right_shoulder", "left_elbow", "right_elbow",
"left_wrist", "right_wrist", "left_hip", "right_hip",
"left_knee", "right_knee", "left_ankle", "right_ankle",
]
for person in keypoints:
for i, (x, y, conf) in enumerate(person):
if conf > 0.5:
print(f"{KEYPOINT_NAMES[i]}: ({x:.1f}, {y:.1f})")5. YOLO-World: Zero-Shot Open-Vocabulary Detection
# YOLO-World: detect ANYTHING without training, just describe it in text
# Traditional YOLO has 80 fixed COCO classes. YOLO-World has unlimited classes.
from ultralytics import YOLOWorld
model = YOLOWorld("yolov8x-worldv2.pt") # Pre-trained on vision-language data
# Set the classes you want to detect (no training required!)
model.set_classes(["forklift", "safety vest", "hard hat", "fire extinguisher"])
results = model.predict("warehouse.jpg", conf=0.3)
# Detects exactly those 4 classes in the image
# Industrial AI use case: safety compliance monitoring
model.set_classes([
"worker without helmet", # Detect safety violations!
"worker without vest",
"wet floor sign",
"blocked emergency exit",
])
# UI automation use case (for RPA agents):
model.set_classes(["login button", "username field", "password field", "submit button"])
# Agent can now locate any UI element by description
results = model.predict(screenshot, conf=0.25)
for box in results[0].boxes:
print(f"Found: {results[0].names[int(box.cls)]} at {box.xyxy[0].tolist()}")Frequently Asked Questions
How many images do I need to fine-tune YOLO on a custom class?
As a rough guide: 200-500 images per class for simple objects with consistent appearance (a specific product SKU), 500-1000 for objects with significant visual variation (cars, people in different poses), and 1000+ for fine-grained classification (detecting manufacturing defects). YOLO's mosaic augmentation is particularly effective at bootstrapping small datasets — 200 images with aggressive augmentation often outperforms 500 images without augmentation. Use Roboflow's free tier to annotate and augment your dataset before training.
How do I deploy YOLO at the edge (Raspberry Pi, Jetson)?
Export to NCNN (fastest for ARM CPU), TensorRT (fastest for Jetson GPU), or ONNX Runtime (cross-platform). For Raspberry Pi 5 with its VideoCore VII GPU: NCNN backend achieves ~8fps with YOLOv8n at 320x320. For Jetson Orin NX: TensorRT YOLOv8n achieves ~200fps at 640x640. YOLO's nano and small variants were specifically designed for edge deployment — nano runs at useful speeds even on Cortex-A53 CPUs.
Conclusion
YOLO gives AI agents visual perception capabilities that LLMs can't match for real-time video. The Ultralytics ecosystem (YOLOv8/v11) provides detection, segmentation, tracking, and pose estimation in a single unified API. YOLO-World extends this to open-vocabulary detection where any object can be described in text without training. For production agents that need to monitor physical environments — factory floors, retail spaces, security systems, or screen automation — YOLO is the fastest path from camera input to actionable events.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.