⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

HNSW: The Highway System of Vector Search

Dec 30, 2025 • 18 min read

Every time you query a RAG system, a nearest-neighbor search runs over millions of high-dimensional vectors. The naive approach — compute similarity to every vector and return the top-K — scales as O(N·D) where N is the number of vectors and D is the embedding dimension (often 768-3072). For one million 1536-dimensional embeddings, that's 1.5 billion float multiplications per query. At 60 queries per second, that's 90 billion operations per second — impossibly expensive. HNSW (Hierarchical Navigable Small Worlds) solves this with a probabilistic multi-layer graph that achieves O(log N) search complexity with >95% recall.

1. The Core Intuition: Road Networks

The HNSW algorithm is directly analogous to a road network. Imagine you're in Los Angeles and need to reach a specific house in Boston:

Interstates (top layer): You don't start at the house level — you take I-10 East. Fewer decision points but each covers massive distance
Highways (middle layers): As you approach your destination state, transition to state highways with more options and finer granularity
Local streets (base layer): Only when you're in the right neighborhood do you navigate block-by-block

HNSW builds exactly this structure for vectors. At the top layer, there are only a handful of "hub" vectors with long-range connections. Each layer down adds more vectors and shorter connections. Search starts at the top, greedily descending to the nearest hub, then drops to the next layer for finer search, repeating until reaching the base layer where exact neighbors are found within a small local region.

2. Parameter Tuning: M, efConstruction, efSearch

import faiss
import numpy as np

# Create HNSW index with Faiss
dim = 1536  # OpenAI text-embedding-3-small dimension

index = faiss.IndexHNSWFlat(
    dim,
    32,   # M: Number of bidirectional links per node (default: 32)
          # Higher M = Better recall, More RAM, Slower insert
          # M=16: low quality, M=64: high quality, M=128: academic/benchmark
)

# Set construction-time efConstruction
index.hnsw.efConstruction = 200
# efConstruction: Size of candidate list during index building
# Higher = Better index quality, Slower initial insert
# Rule: efConstruction >= 2*M (so 200 for M=32 is good)

# Build the index by inserting vectors
embeddings = np.random.randn(1_000_000, dim).astype(np.float32)
index.add(embeddings)

# Set query-time efSearch (most tunable parameter for recall/latency tradeoff)
index.hnsw.efSearch = 64
# Higher efSearch = Better recall at query time, Slower queries
# efSearch=16: fast but ~85% recall
# efSearch=64: balanced ~95% recall
# efSearch=256: high accuracy ~99% recall (but 4x slower)

# Query
k = 10  # Return top-10 nearest neighbors
query = np.random.randn(1, dim).astype(np.float32)
distances, indices = index.search(query, k)
print(f"Top-10 indices: {indices[0]}")
print(f"Distances: {distances[0]}")

3. Pre-Filtering vs Post-Filtering: The Metadata Problem

# The challenge: "Find the 10 most similar products to this query,
# but only in the 'Electronics' category, priced under $500"
#
# Option 1: POST-FILTERING (naive, problematic)
k_fetch = 1000  # Fetch 1000 neighbors
distances, indices = index.search(query, k_fetch)
# Then filter by metadata:
results = [idx for idx in indices[0] if metadata[idx]['category'] == 'Electronics' 
           and metadata[idx]['price'] < 500]
# Problem: If only 0.5% of vectors match the filter (50K / 10M total),
# you might fetch 1000 neighbors and ZERO match your filter!
# "Recall death": the real nearest neighbors were filtered out.

# Option 2: PRE-FILTERING (correct semantics, kills the graph)
# Create separate faiss index per category — works but expensive
# Or use a filtered scan: check metadata before evaluating vectors
# Problem: destroys the HNSW graph traversal efficiency — becomes O(N)

# Option 3: BITMAP FILTERING (correct AND efficient)
# Modern vector databases (Qdrant, Weaviate, Milvus) implement this:
# 1. Build a bitmap of all matching document IDs from your filter
# 2. During HNSW traversal, skip neighbors not in the bitmap
# 3. The graph traversal continues but ignores non-matching nodes

from qdrant_client import QdrantClient, models

client = QdrantClient(host="localhost", port=6333)

# Efficient filtered search in Qdrant (bitmap filtering under the hood)
results = client.search(
    collection_name="products",
    query_vector=query_embedding,
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="category",
                match=models.MatchValue(value="Electronics"),
            ),
            models.FieldCondition(
                key="price",
                range=models.Range(lt=500),
            ),
        ]
    ),
    limit=10,
)
# Qdrant builds a payload index on 'category' and 'price' for fast filtering
# Then combines bitmap with HNSW traversal for efficient filtered ANN search

4. HNSW vs IVF: Choosing the Right Index

Property	HNSW	IVF (Flat/PQ)
Query Speed (1M vectors)	~1ms	~5ms
Memory Usage	High (M×N floats)	Low (centroids + PQ codes)
Insert Speed	Fast (O(log N))	Fast (append to bucket)
Recall @10 default	~96%	~90%
GPU Support	Limited	Excellent (faiss.GpuIndex)
Ideal Scale	< 50M vectors	> 50M vectors
Filtering Support	Bitmap in modern DBs	Harder (bucket granularity)

# Hybrid IVF-HNSW: Best of both worlds for 10M+ vectors
# Faiss IVF_HNSW: Use HNSW to navigate the IVF coarse quantizer
import faiss

nlist = 4096   # Number of IVF clusters (sqrt of dataset size as rule of thumb)
M_pq = 64      # PQ compression: 64 subspaces of 1536/64 = 24 dims each

# Build IVF + Product Quantization index (1.5TB → ~50GB)
quantizer = faiss.IndexHNSWFlat(dim, 32)            # HNSW for coarse search
index = faiss.IndexIVFPQ(quantizer, dim, nlist, M_pq, 8)  # 8-bit PQ codes

# Train on a representative sample (required for PQ)
sample = embeddings[:100_000]
index.train(sample)

# Now add all vectors
index.add(embeddings)
index.nprobe = 64  # Search 64 of the 4096 clusters per query (recall/speed tradeoff)

# Save index
faiss.write_index(index, "billion_scale.index")

Frequently Asked Questions

What's the right M value for my use case?

M=16 for high-throughput, low-recall scenarios (>1000 QPS, 85-90% recall acceptable). M=32 (default) for balanced production RAG (500 QPS, 95% recall). M=64 for high-accuracy retrieval where recall matters more than throughput (legal, medical, financial). M=128+ only for benchmarking or research. Each increment roughly doubles RAM usage but yields diminishing recall improvements above M=64. Always measure recall with your actual dataset — theoretical improvements don't always translate to practice.

Why does increasing efSearch help recall but not index quality?

efSearch controls how many candidate neighbors are considered during the query-time search through the graph — it's a runtime parameter that trades query latency for recall. efConstruction controls how thoroughly the graph was built — it's a build-time parameter that permanently determines index quality. A poorly built index (low efConstruction) cannot be rescued by high efSearch: you'd be searching a sparse, low-quality graph more thoroughly but still missing connections that should exist.

Conclusion

HNSW is the algorithmic backbone behind every major vector database — Pinecone, Weaviate, Qdrant, Chroma, and Milvus all use it for their primary index. Understanding M, efConstruction, and efSearch lets you tune the recall-latency-memory tradeoff for your specific workload. For most RAG applications at under 50M vectors, HNSW with M=32, efConstruction=200, and efSearch=64 provides excellent performance. Add filtered search via bitmap indexes (supported natively in Qdrant and Weaviate) when your queries need metadata constraints.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact