HNSW: The Highway System of Vector Search
Dec 30, 2025 โข 18 min read
Every time you query a RAG system, a nearest-neighbor search runs over millions of high-dimensional vectors. The naive approach โ compute similarity to every vector and return the top-K โ scales as O(NยทD) where N is the number of vectors and D is the embedding dimension (often 768-3072). For one million 1536-dimensional embeddings, that's 1.5 billion float multiplications per query. At 60 queries per second, that's 90 billion operations per second โ impossibly expensive. HNSW (Hierarchical Navigable Small Worlds) solves this with a probabilistic multi-layer graph that achieves O(log N) search complexity with >95% recall.
1. The Core Intuition: Road Networks
The HNSW algorithm is directly analogous to a road network. Imagine you're in Los Angeles and need to reach a specific house in Boston:
- Interstates (top layer): You don't start at the house level โ you take I-10 East. Fewer decision points but each covers massive distance
- Highways (middle layers): As you approach your destination state, transition to state highways with more options and finer granularity
- Local streets (base layer): Only when you're in the right neighborhood do you navigate block-by-block
HNSW builds exactly this structure for vectors. At the top layer, there are only a handful of "hub" vectors with long-range connections. Each layer down adds more vectors and shorter connections. Search starts at the top, greedily descending to the nearest hub, then drops to the next layer for finer search, repeating until reaching the base layer where exact neighbors are found within a small local region.
2. Parameter Tuning: M, efConstruction, efSearch
import faiss
import numpy as np
# Create HNSW index with Faiss
dim = 1536 # OpenAI text-embedding-3-small dimension
index = faiss.IndexHNSWFlat(
dim,
32, # M: Number of bidirectional links per node (default: 32)
# Higher M = Better recall, More RAM, Slower insert
# M=16: low quality, M=64: high quality, M=128: academic/benchmark
)
# Set construction-time efConstruction
index.hnsw.efConstruction = 200
# efConstruction: Size of candidate list during index building
# Higher = Better index quality, Slower initial insert
# Rule: efConstruction >= 2*M (so 200 for M=32 is good)
# Build the index by inserting vectors
embeddings = np.random.randn(1_000_000, dim).astype(np.float32)
index.add(embeddings)
# Set query-time efSearch (most tunable parameter for recall/latency tradeoff)
index.hnsw.efSearch = 64
# Higher efSearch = Better recall at query time, Slower queries
# efSearch=16: fast but ~85% recall
# efSearch=64: balanced ~95% recall
# efSearch=256: high accuracy ~99% recall (but 4x slower)
# Query
k = 10 # Return top-10 nearest neighbors
query = np.random.randn(1, dim).astype(np.float32)
distances, indices = index.search(query, k)
print(f"Top-10 indices: {indices[0]}")
print(f"Distances: {distances[0]}")3. Pre-Filtering vs Post-Filtering: The Metadata Problem
# The challenge: "Find the 10 most similar products to this query,
# but only in the 'Electronics' category, priced under $500"
#
# Option 1: POST-FILTERING (naive, problematic)
k_fetch = 1000 # Fetch 1000 neighbors
distances, indices = index.search(query, k_fetch)
# Then filter by metadata:
results = [idx for idx in indices[0] if metadata[idx]['category'] == 'Electronics'
and metadata[idx]['price'] < 500]
# Problem: If only 0.5% of vectors match the filter (50K / 10M total),
# you might fetch 1000 neighbors and ZERO match your filter!
# "Recall death": the real nearest neighbors were filtered out.
# Option 2: PRE-FILTERING (correct semantics, kills the graph)
# Create separate faiss index per category โ works but expensive
# Or use a filtered scan: check metadata before evaluating vectors
# Problem: destroys the HNSW graph traversal efficiency โ becomes O(N)
# Option 3: BITMAP FILTERING (correct AND efficient)
# Modern vector databases (Qdrant, Weaviate, Milvus) implement this:
# 1. Build a bitmap of all matching document IDs from your filter
# 2. During HNSW traversal, skip neighbors not in the bitmap
# 3. The graph traversal continues but ignores non-matching nodes
from qdrant_client import QdrantClient, models
client = QdrantClient(host="localhost", port=6333)
# Efficient filtered search in Qdrant (bitmap filtering under the hood)
results = client.search(
collection_name="products",
query_vector=query_embedding,
query_filter=models.Filter(
must=[
models.FieldCondition(
key="category",
match=models.MatchValue(value="Electronics"),
),
models.FieldCondition(
key="price",
range=models.Range(lt=500),
),
]
),
limit=10,
)
# Qdrant builds a payload index on 'category' and 'price' for fast filtering
# Then combines bitmap with HNSW traversal for efficient filtered ANN search4. HNSW vs IVF: Choosing the Right Index
| Property | HNSW | IVF (Flat/PQ) |
|---|---|---|
| Query Speed (1M vectors) | ~1ms | ~5ms |
| Memory Usage | High (MรN floats) | Low (centroids + PQ codes) |
| Insert Speed | Fast (O(log N)) | Fast (append to bucket) |
| Recall @10 default | ~96% | ~90% |
| GPU Support | Limited | Excellent (faiss.GpuIndex) |
| Ideal Scale | < 50M vectors | > 50M vectors |
| Filtering Support | Bitmap in modern DBs | Harder (bucket granularity) |
# Hybrid IVF-HNSW: Best of both worlds for 10M+ vectors
# Faiss IVF_HNSW: Use HNSW to navigate the IVF coarse quantizer
import faiss
nlist = 4096 # Number of IVF clusters (sqrt of dataset size as rule of thumb)
M_pq = 64 # PQ compression: 64 subspaces of 1536/64 = 24 dims each
# Build IVF + Product Quantization index (1.5TB โ ~50GB)
quantizer = faiss.IndexHNSWFlat(dim, 32) # HNSW for coarse search
index = faiss.IndexIVFPQ(quantizer, dim, nlist, M_pq, 8) # 8-bit PQ codes
# Train on a representative sample (required for PQ)
sample = embeddings[:100_000]
index.train(sample)
# Now add all vectors
index.add(embeddings)
index.nprobe = 64 # Search 64 of the 4096 clusters per query (recall/speed tradeoff)
# Save index
faiss.write_index(index, "billion_scale.index")Frequently Asked Questions
What's the right M value for my use case?
M=16 for high-throughput, low-recall scenarios (>1000 QPS, 85-90% recall acceptable). M=32 (default) for balanced production RAG (500 QPS, 95% recall). M=64 for high-accuracy retrieval where recall matters more than throughput (legal, medical, financial). M=128+ only for benchmarking or research. Each increment roughly doubles RAM usage but yields diminishing recall improvements above M=64. Always measure recall with your actual dataset โ theoretical improvements don't always translate to practice.
Why does increasing efSearch help recall but not index quality?
efSearch controls how many candidate neighbors are considered during the query-time search through the graph โ it's a runtime parameter that trades query latency for recall. efConstruction controls how thoroughly the graph was built โ it's a build-time parameter that permanently determines index quality. A poorly built index (low efConstruction) cannot be rescued by high efSearch: you'd be searching a sparse, low-quality graph more thoroughly but still missing connections that should exist.
Conclusion
HNSW is the algorithmic backbone behind every major vector database โ Pinecone, Weaviate, Qdrant, Chroma, and Milvus all use it for their primary index. Understanding M, efConstruction, and efSearch lets you tune the recall-latency-memory tradeoff for your specific workload. For most RAG applications at under 50M vectors, HNSW with M=32, efConstruction=200, and efSearch=64 provides excellent performance. Add filtered search via bitmap indexes (supported natively in Qdrant and Weaviate) when your queries need metadata constraints.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning โ no fluff, just working code and real-world context.