Vector Databases 101
Dec 29, 2025 β’ 15 min read
The biggest limitation of LLMs is the Context Window. You cannot feed 10,000 PDFs into a prompt. The solution is RAG (Retrieval Augmented Generation) using a Vector Database.
1. What is an Embedding?
An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. OpenAI's text-embedding-3-small model converts any sentence into a 1536-dimensional vector.
2. The Competition: Pinecone vs Chroma vs Milvus
π² Pinecone
Managed SaaS. Easiest to start. Expensive at scale.
π Chroma
Open Source. Runs locally (Docker). Great for dev.
π Milvus
Enterprise Scale. Built for billion-scale vectors.
3. Code: Metadata Filtering
Advanced RAG isn't just about vectors; it's about Hybrid Search with metadata filters.
// Search for "revenue" but ONLY in "2024" documents
const results = await index.query({
vector: queryEmbedding,
filter: {
year: { $eq: 2024 },
department: { $in: ["sales", "marketing"] }
},
topK: 5
});4. The HNSW Algorithm (How It's So Fast)
How do you search 1 million vectors in 10ms? You don't scan them all. Vector DBs use HNSW (Hierarchical Navigable Small World) graphs to navigate the "neighborhood" of your query vector in logarithmic time.
Think of it like a GPS navigation system. Instead of checking every road in the country, the GPS starts at the highway level, narrows to the city level, then the street level. HNSW does the same thing with vector space β it builds a multi-layer graph where upper layers are "highways" connecting distant regions, and lower layers are "side streets" for local precision.
The result: searching 100 million vectors takes roughly 10ms on a single CPU core β the same speed as a database index lookup. This is what makes real-time semantic search practical at production scale.
5. Building Your First RAG Pipeline End-to-End
Let's build a complete RAG system step by step. This example uses ChromaDB (free, local) and OpenAI embeddings:
Step 1: Chunk Your Documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load a PDF or text file
with open("company_handbook.txt") as f:
raw_text = f.read()
# Chunk into ~500 token pieces with 50 token overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Characters per chunk
chunk_overlap=50, # Overlap to preserve context across chunks
separators=["
", "
", " ", ""]
)
chunks = splitter.split_text(raw_text)
print(f"Created {len(chunks)} chunks")Step 2: Embed and Store
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
client = chromadb.PersistentClient(path="./db") # Persists to disk
embedding_fn = OpenAIEmbeddingFunction(
api_key="your-openai-key",
model_name="text-embedding-3-small" # $0.02 per 1M tokens
)
collection = client.create_collection(
name="handbook",
embedding_function=embedding_fn
)
# Add all chunks (ChromaDB handles embedding automatically)
collection.add(
documents=chunks,
ids=[f"chunk_{i}" for i in range(len(chunks))]
)
print(f"Stored {len(chunks)} chunks in vector DB")Step 3: Retrieve and Generate
from openai import OpenAI
openai = OpenAI(api_key="your-key")
def ask_rag(question: str) -> str:
# 1. Retrieve relevant chunks
results = collection.query(
query_texts=[question],
n_results=5 # Get top 5 most similar chunks
)
context = "
---
".join(results["documents"][0])
# 2. Generate answer grounded in context
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Answer only using the context below. Say 'I don't know' if not covered.
Context:
" + context},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
print(ask_rag("What is the vacation policy?"))6. Chunking Strategies Explained
How you split your documents dramatically affects RAG quality. The wrong chunking strategy is the #1 cause of poor retrieval:
| Strategy | Chunk Size | Best For |
|---|---|---|
| Fixed-size | 200-500 tokens | Quick prototypes, consistent document types |
| Recursive Character | 500-1000 tokens | Most use cases β respects sentence/paragraph boundaries |
| Semantic | Variable | High-quality Q&A where each chunk = one idea |
| Document-based | Whole doc | Summarization tasks, not retrieval |
| Parent-child | Small child, large parent | Best precision + context balance |
Pro tip: Use the Parent-Child strategy for best results: search on small child chunks (200 tokens, high precision), but retrieve the full parent paragraph (800 tokens) for context. LangChain's ParentDocumentRetriever implements this automatically.
7. Pinecone vs ChromaDB vs Milvus β Deep Comparison
Pinecone β Best for Production SaaS
Pinecone is a fully managed serverless vector database. You don't touch any infrastructure. It scales automatically from 0 to billions of vectors, has built-in replication, and offers a generous free tier (1 index, 100k vectors). The downside: cost at scale. At 10M vectors, expect $70-200/month depending on dimensions and queries.
Use when: Building a customer-facing product, don't want to manage infrastructure, need 99.99% uptime SLA.
ChromaDB β Best for Development and Small Scale
ChromaDB runs embedded (in-process, no separate server) or as a local Docker container. It's the fastest way to get startedβ5 lines of code and you're storing vectors. Persistent client mode saves to disk. Not designed for >10M vectors in production, but perfect for prototypes, internal tools, and datasets under 1M vectors.
Use when: Prototyping, local development, datasets under 1M vectors, need zero infrastructure overhead.
Milvus β Best for Large-Scale Enterprise
Milvus is purpose-built for billion-scale vector search in distributed environments. It supports multiple index types (HNSW, IVF_FLAT, IVF_SQ8), GPU acceleration, and horizontal sharding. It's significantly more complex to operate but provides unmatched throughput (millions of QPS). Managed version is available via Zilliz Cloud.
Use when: Storing >50M vectors, need GPU-accelerated search, require enterprise features like RBAC and audit logs.
Troubleshooting RAG Quality
Problem: "Retriever finds irrelevant chunks"
Fix 1: Increase chunk overlap to maintain context across boundaries.
Fix 2: Switch from cosine similarity to dot product if your embeddings are normalized.
Fix 3: Try a better embedding model β text-embedding-3-large outperforms text-embedding-3-small by ~10% on retrieval benchmarks.
Problem: "LLM answer doesn't match retrieved text"
Fix: Check faithfulness using Ragas evaluation. Often the LLM is ignoring your context and answering from pre-training knowledge. Add explicit instruction: "You MUST only use information from the provided context. Do NOT use your training data."
Problem: "Context window fills up with too many chunks"
Fix: Reduce n_results from 10 to 3-5, or add a reranker (Cohere Rerank, BGE-Reranker) that scores retrieved chunks by relevance and keeps only the top 3. This dramatically reduces context length while improving quality.
Frequently Asked Questions
Do I always need a vector database? Can't I just search with SQL LIKE?
SQL LIKE and full-text search (Elasticsearch) find keyword matches. Vector search finds semantic matches. If a user asks "What's the return policy?", SQL will miss documents that say "refund procedure" or "exchange terms". Vector search understands these are the same concept. Use vector search for Q&A and SQL for structured filtering (dates, categories, prices).
How much does it cost to embed 100,000 pages?
Using text-embedding-3-small at $0.02 per 1M tokens: 100,000 pages Γ 500 tokens/page = 50M tokens = ~$1.00 total. Embedding is essentially free. The ongoing cost is vector storage and query costs at your vector DB provider, typically $5-70/month depending on scale.
Can I update documents after they're embedded?
Yes. Delete the old chunks by ID and add new ones. ChromaDB supports collection.delete(ids=[...]) and Pinecone supports index.delete(ids=[...]). For frequently changing documents, design your IDs to include a hash of the content so you can detect changes efficiently.
Conclusion
Vector Databases act as the hippocampus (long-term memory) for your AI agent. Without one, your agent is amnesic β it can only reason about what fits in its context window. With a well-designed RAG pipeline, your agent can access virtually unlimited knowledge at query time, with sub-100ms retrieval latency, at pennies per thousand queries. Master this pattern and you've unlocked the most impactful technique in modern AI engineering.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning β no fluff, just working code and real-world context.