opncrafter

Computer Vision: YOLO Models

Dec 29, 2025 • 18 min read

LLMs handle text. GPT-4V, Gemini Vision, Claude handle static images. But what if your agent needs to process video streams in real-time — counting objects, detecting intrusions, reading screens, or monitoring assembly lines at 60fps? YOLO (You Only Look Once) is the industry standard for real-time object detection: a single neural network that processes an image and predicts all bounding boxes simultaneously, achieving 50-200fps on modern GPUs. Understanding YOLO is essential for any AI engineer building agents that interact with the physical world.

1. The YOLO Architecture: Why It's Fast

Before YOLO, object detection used sliding windows: run the classification model on every possible bounding box position at every scale. For a 640x640 image, that's tens of thousands of inference passes per frame — impossibly slow. YOLO's insight: treat detection as a single regression problem. Divide the image into an SxS grid. Each grid cell simultaneously predicts B bounding boxes and C class probabilities. One forward pass, all boxes at once:

ModelmAP50Speed (A100)Best For
YOLOv8n (nano)37.31.47msEdge/mobile, 8GB GPU memory
YOLOv8s (small)44.92.66msRaspberry Pi 5, Jetson Nano
YOLOv8m (medium)50.25.86msBalanced quality/speed
YOLOv8x (large)53.914.37msMax accuracy, research
YOLOv11n39.51.55msUpdated nano, better accuracy
YOLO-WorldN/A (open vocab)~15msZero-shot detection, no training needed

2. Real-Time Video Stream Inference

pip install ultralytics

from ultralytics import YOLO
import cv2
import numpy as np
from collections import defaultdict

# Load pretrained model (downloads ~6MB .pt file on first run)
model = YOLO("yolov8n.pt")   # nano = fastest
# model = YOLO("yolov8m.pt") # medium = better accuracy

# Run inference on webcam stream
cap = cv2.VideoCapture(0)  # 0 = default webcam, 1 = second, "rtsp://..." for IP cameras

# Object tracking (follows objects across frames with consistent IDs)
track_history = defaultdict(lambda: [])

while cap.isOpened():
    success, frame = cap.read()
    if not success:
        break
    
    # Run detection AND tracking (ByteTrack algorithm for multi-object tracking)
    results = model.track(
        frame,
        persist=True,         # Maintain tracker state across frames
        conf=0.3,             # Minimum confidence threshold
        iou=0.45,             # IoU threshold for NMS (non-max suppression)
        classes=[0, 2, 7],    # Only detect: 0=person, 2=car, 7=truck (COCO classes)
        verbose=False,
    )
    
    # Parse results for your application logic
    if results[0].boxes.id is not None:
        boxes = results[0].boxes.xywh.cpu()          # [cx, cy, w, h] format
        track_ids = results[0].boxes.id.int().cpu().tolist()
        confidences = results[0].boxes.conf.cpu()
        class_ids = results[0].boxes.cls.int().cpu().tolist()
        class_names = [model.names[c] for c in class_ids]
        
        for box, track_id, conf, cls_name in zip(boxes, track_ids, confidences, class_names):
            cx, cy, w, h = box.tolist()
            
            # Store trajectory for trail visualization
            track_history[track_id].append((cx, cy))
            if len(track_history[track_id]) > 30:  # Keep last 30 positions
                track_history[track_id].pop(0)
            
            print(f"Object {track_id}: {cls_name} at ({cx:.0f},{cy:.0f}) conf={conf:.2f}")
    
    # Draw annotated frame with Ultralytics built-in renderer
    annotated_frame = results[0].plot(line_width=2, font_size=0.8)
    cv2.imshow("YOLO Tracking", annotated_frame)
    
    # Quit on 'q'
    if cv2.waitKey(1) & 0xFF == ord("q"):
        break

cap.release()
cv2.destroyAllWindows()

3. Fine-Tuning YOLO on Custom Classes

# Train YOLO on your custom dataset in 3 steps
# 
# Step 1: Create dataset.yaml
dataset_yaml = """
path: /path/to/dataset    # Root directory
train: images/train       # 80% of your labeled images
val: images/val           # 20% of your labeled images

nc: 3                     # Number of classes
names: ["helmet", "vest", "person"]  # Class names in order
"""

# Step 2: Label your images
# Use Roboflow or Label Studio to annotate in YOLO format:
# Each image gets a .txt file with: class_id cx cy w h (normalized 0-1)

from ultralytics import YOLO

# Step 3: Train
model = YOLO("yolov8m.pt")  # Start from COCO pretrained weights (transfer learning)

results = model.train(
    data="dataset.yaml",
    epochs=100,
    imgsz=640,             # Input resolution
    batch=16,              # Images per batch (reduce if VRAM insufficient)
    device=0,              # GPU device index (0 = first GPU)
    
    # Augmentation (extremely important for small datasets)
    augment=True,
    mosaic=1.0,            # Mosaic augmentation (combines 4 images)
    mixup=0.1,             # MixUp augmentation
    degrees=15.0,          # Random rotation ±15°
    fliplr=0.5,            # Horizontal flip
    
    # Learning rate schedule
    lr0=0.01,              # Initial LR
    lrf=0.001,             # Final LR (cosine schedule)
    warmup_epochs=3,       # LR warmup
    
    project="./runs/train",
    name="custom_detector",
)

# Evaluate on test set
metrics = model.val(data="dataset.yaml")
print(f"mAP50: {metrics.box.map50:.3f}")
print(f"mAP50-95: {metrics.box.map:.3f}")

# Export for deployment
model.export(format="onnx")   # For ONNX Runtime
model.export(format="tflite") # For mobile/embedded
model.export(format="coreml") # For iOS/Mac

4. Instance Segmentation and Pose Estimation

# YOLO also handles segmentation and pose in same API:

# Instance Segmentation (pixel masks for each detected object)
seg_model = YOLO("yolov8m-seg.pt")
results = seg_model("scene.jpg")
 
for r in results:
    masks = r.masks.data.cpu().numpy()  # [N, H, W] boolean masks
    for i, (mask, cls) in enumerate(zip(masks, r.boxes.cls)):
        class_name = seg_model.names[int(cls)]
        pixel_count = mask.sum()
        print(f"{class_name}: {pixel_count} pixels")

# Pose Estimation (17 human body keypoints)
pose_model = YOLO("yolov8m-pose.pt")
results = pose_model("person.jpg")

keypoints = results[0].keypoints.data.cpu().numpy()  # [N_persons, 17, 3] (x, y, conf)
# Keypoints: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles
KEYPOINT_NAMES = [
    "nose", "left_eye", "right_eye", "left_ear", "right_ear",
    "left_shoulder", "right_shoulder", "left_elbow", "right_elbow",
    "left_wrist", "right_wrist", "left_hip", "right_hip",
    "left_knee", "right_knee", "left_ankle", "right_ankle",
]

for person in keypoints:
    for i, (x, y, conf) in enumerate(person):
        if conf > 0.5:
            print(f"{KEYPOINT_NAMES[i]}: ({x:.1f}, {y:.1f})")

5. YOLO-World: Zero-Shot Open-Vocabulary Detection

# YOLO-World: detect ANYTHING without training, just describe it in text
# Traditional YOLO has 80 fixed COCO classes. YOLO-World has unlimited classes.

from ultralytics import YOLOWorld

model = YOLOWorld("yolov8x-worldv2.pt")  # Pre-trained on vision-language data

# Set the classes you want to detect (no training required!)
model.set_classes(["forklift", "safety vest", "hard hat", "fire extinguisher"])

results = model.predict("warehouse.jpg", conf=0.3)
# Detects exactly those 4 classes in the image

# Industrial AI use case: safety compliance monitoring
model.set_classes([
    "worker without helmet",    # Detect safety violations!
    "worker without vest",
    "wet floor sign",
    "blocked emergency exit",
])

# UI automation use case (for RPA agents):
model.set_classes(["login button", "username field", "password field", "submit button"])
# Agent can now locate any UI element by description
results = model.predict(screenshot, conf=0.25)
for box in results[0].boxes:
    print(f"Found: {results[0].names[int(box.cls)]} at {box.xyxy[0].tolist()}")

Frequently Asked Questions

How many images do I need to fine-tune YOLO on a custom class?

As a rough guide: 200-500 images per class for simple objects with consistent appearance (a specific product SKU), 500-1000 for objects with significant visual variation (cars, people in different poses), and 1000+ for fine-grained classification (detecting manufacturing defects). YOLO's mosaic augmentation is particularly effective at bootstrapping small datasets — 200 images with aggressive augmentation often outperforms 500 images without augmentation. Use Roboflow's free tier to annotate and augment your dataset before training.

How do I deploy YOLO at the edge (Raspberry Pi, Jetson)?

Export to NCNN (fastest for ARM CPU), TensorRT (fastest for Jetson GPU), or ONNX Runtime (cross-platform). For Raspberry Pi 5 with its VideoCore VII GPU: NCNN backend achieves ~8fps with YOLOv8n at 320x320. For Jetson Orin NX: TensorRT YOLOv8n achieves ~200fps at 640x640. YOLO's nano and small variants were specifically designed for edge deployment — nano runs at useful speeds even on Cortex-A53 CPUs.

Conclusion

YOLO gives AI agents visual perception capabilities that LLMs can't match for real-time video. The Ultralytics ecosystem (YOLOv8/v11) provides detection, segmentation, tracking, and pose estimation in a single unified API. YOLO-World extends this to open-vocabulary detection where any object can be described in text without training. For production agents that need to monitor physical environments — factory floors, retail spaces, security systems, or screen automation — YOLO is the fastest path from camera input to actionable events.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK