⏱ 7–10 min read🎓 Beginner → IntermediateUpdated Apr 2026

Vector Databases 101

Dec 29, 2025 • 15 min read

The biggest limitation of LLMs is the Context Window. You cannot feed 10,000 PDFs into a prompt. The solution is RAG (Retrieval Augmented Generation) using a Vector Database.

1. What is an Embedding?

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. OpenAI's text-embedding-3-small model converts any sentence into a 1536-dimensional vector.

"King" - "Man" + "Woman" ≈ "Queen"

2. The Competition: Pinecone vs Chroma vs Milvus

🌲 Pinecone

Managed SaaS. Easiest to start. Expensive at scale.

🌈 Chroma

Open Source. Runs locally (Docker). Great for dev.

🚀 Milvus

Enterprise Scale. Built for billion-scale vectors.

3. Code: Metadata Filtering

Advanced RAG isn't just about vectors; it's about Hybrid Search with metadata filters.

// Search for "revenue" but ONLY in "2024" documents
const results = await index.query({
  vector: queryEmbedding,
  filter: {
    year: { $eq: 2024 },
    department: { $in: ["sales", "marketing"] }
  },
  topK: 5
});

4. The HNSW Algorithm (How It's So Fast)

How do you search 1 million vectors in 10ms? You don't scan them all. Vector DBs use HNSW (Hierarchical Navigable Small World) graphs to navigate the "neighborhood" of your query vector in logarithmic time.

Think of it like a GPS navigation system. Instead of checking every road in the country, the GPS starts at the highway level, narrows to the city level, then the street level. HNSW does the same thing with vector space — it builds a multi-layer graph where upper layers are "highways" connecting distant regions, and lower layers are "side streets" for local precision.

The result: searching 100 million vectors takes roughly 10ms on a single CPU core — the same speed as a database index lookup. This is what makes real-time semantic search practical at production scale.

5. Building Your First RAG Pipeline End-to-End

Let's build a complete RAG system step by step. This example uses ChromaDB (free, local) and OpenAI embeddings:

Step 1: Chunk Your Documents

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load a PDF or text file
with open("company_handbook.txt") as f:
    raw_text = f.read()

# Chunk into ~500 token pieces with 50 token overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,    # Characters per chunk
    chunk_overlap=50,  # Overlap to preserve context across chunks
    separators=["

", "
", " ", ""]
)
chunks = splitter.split_text(raw_text)
print(f"Created {len(chunks)} chunks")

Step 2: Embed and Store

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

client = chromadb.PersistentClient(path="./db")  # Persists to disk

embedding_fn = OpenAIEmbeddingFunction(
    api_key="your-openai-key",
    model_name="text-embedding-3-small"  # $0.02 per 1M tokens
)

collection = client.create_collection(
    name="handbook",
    embedding_function=embedding_fn
)

# Add all chunks (ChromaDB handles embedding automatically)
collection.add(
    documents=chunks,
    ids=[f"chunk_{i}" for i in range(len(chunks))]
)
print(f"Stored {len(chunks)} chunks in vector DB")

Step 3: Retrieve and Generate

from openai import OpenAI

openai = OpenAI(api_key="your-key")

def ask_rag(question: str) -> str:
    # 1. Retrieve relevant chunks
    results = collection.query(
        query_texts=[question],
        n_results=5  # Get top 5 most similar chunks
    )
    
    context = "

---

".join(results["documents"][0])
    
    # 2. Generate answer grounded in context
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer only using the context below. Say 'I don't know' if not covered.

Context:
" + context},
            {"role": "user", "content": question}
        ]
    )
    
    return response.choices[0].message.content

print(ask_rag("What is the vacation policy?"))

6. Chunking Strategies Explained

How you split your documents dramatically affects RAG quality. The wrong chunking strategy is the #1 cause of poor retrieval:

Strategy	Chunk Size	Best For
Fixed-size	200-500 tokens	Quick prototypes, consistent document types
Recursive Character	500-1000 tokens	Most use cases — respects sentence/paragraph boundaries
Semantic	Variable	High-quality Q&A where each chunk = one idea
Document-based	Whole doc	Summarization tasks, not retrieval
Parent-child	Small child, large parent	Best precision + context balance

Pro tip: Use the Parent-Child strategy for best results: search on small child chunks (200 tokens, high precision), but retrieve the full parent paragraph (800 tokens) for context. LangChain's ParentDocumentRetriever implements this automatically.

7. Pinecone vs ChromaDB vs Milvus — Deep Comparison

Pinecone — Best for Production SaaS

Pinecone is a fully managed serverless vector database. You don't touch any infrastructure. It scales automatically from 0 to billions of vectors, has built-in replication, and offers a generous free tier (1 index, 100k vectors). The downside: cost at scale. At 10M vectors, expect $70-200/month depending on dimensions and queries.

Use when: Building a customer-facing product, don't want to manage infrastructure, need 99.99% uptime SLA.

ChromaDB — Best for Development and Small Scale

ChromaDB runs embedded (in-process, no separate server) or as a local Docker container. It's the fastest way to get started—5 lines of code and you're storing vectors. Persistent client mode saves to disk. Not designed for >10M vectors in production, but perfect for prototypes, internal tools, and datasets under 1M vectors.

Use when: Prototyping, local development, datasets under 1M vectors, need zero infrastructure overhead.

Milvus — Best for Large-Scale Enterprise

Milvus is purpose-built for billion-scale vector search in distributed environments. It supports multiple index types (HNSW, IVF_FLAT, IVF_SQ8), GPU acceleration, and horizontal sharding. It's significantly more complex to operate but provides unmatched throughput (millions of QPS). Managed version is available via Zilliz Cloud.

Use when: Storing >50M vectors, need GPU-accelerated search, require enterprise features like RBAC and audit logs.

Troubleshooting RAG Quality

Problem: "Retriever finds irrelevant chunks"

Fix 1: Increase chunk overlap to maintain context across boundaries.
Fix 2: Switch from cosine similarity to dot product if your embeddings are normalized.
Fix 3: Try a better embedding model — text-embedding-3-large outperforms text-embedding-3-small by ~10% on retrieval benchmarks.

Problem: "LLM answer doesn't match retrieved text"

Fix: Check faithfulness using Ragas evaluation. Often the LLM is ignoring your context and answering from pre-training knowledge. Add explicit instruction: "You MUST only use information from the provided context. Do NOT use your training data."

Problem: "Context window fills up with too many chunks"

Fix: Reduce n_results from 10 to 3-5, or add a reranker (Cohere Rerank, BGE-Reranker) that scores retrieved chunks by relevance and keeps only the top 3. This dramatically reduces context length while improving quality.

Frequently Asked Questions

Do I always need a vector database? Can't I just search with SQL LIKE?

SQL LIKE and full-text search (Elasticsearch) find keyword matches. Vector search finds semantic matches. If a user asks "What's the return policy?", SQL will miss documents that say "refund procedure" or "exchange terms". Vector search understands these are the same concept. Use vector search for Q&A and SQL for structured filtering (dates, categories, prices).

How much does it cost to embed 100,000 pages?

Using text-embedding-3-small at $0.02 per 1M tokens: 100,000 pages × 500 tokens/page = 50M tokens = ~$1.00 total. Embedding is essentially free. The ongoing cost is vector storage and query costs at your vector DB provider, typically $5-70/month depending on scale.

Can I update documents after they're embedded?

Yes. Delete the old chunks by ID and add new ones. ChromaDB supports collection.delete(ids=[...]) and Pinecone supports index.delete(ids=[...]). For frequently changing documents, design your IDs to include a hash of the content so you can detect changes efficiently.

Conclusion

Vector Databases act as the hippocampus (long-term memory) for your AI agent. Without one, your agent is amnesic — it can only reason about what fits in its context window. With a well-designed RAG pipeline, your agent can access virtually unlimited knowledge at query time, with sub-100ms retrieval latency, at pennies per thousand queries. Master this pattern and you've unlocked the most impactful technique in modern AI engineering.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact