⏱ 9–13 min read🎓 IntermediateUpdated Apr 2026

Project: Chat with PDF

Dec 30, 2025 • 25 min read

"Chat with PDF" is the canonical starter project for AI Engineering — not because it's trivial, but because it forces you to learn all the hard problems at once: document parsing, semantic chunking, vector embedding, similarity search, context window management, and grounded generation. Every production RAG system (enterprise search, legal document analysis, medical record querying) is a more sophisticated version of this same architecture.

Tech Stack

Frontend: React (TypeScript)

Backend: FastAPI (Python)

RAG Framework: LlamaIndex

Vector DB: ChromaDB (local)

Embeddings: text-embedding-3-small

Generation: GPT-4o-mini

1. The Full Architecture

# INGESTION PIPELINE (runs once when PDF uploaded)
# ┌──────────────┐    ┌────────────────┐    ┌──────────────────┐    ┌──────────────┐
# │   PDF File   │───>│ PyMuPDF Parse  │───>│  Chunk (512 tok) │───>│ OpenAI Embed │
# └──────────────┘    └────────────────┘    └──────────────────┘    └──────────────┘
#                                                                          │
#                                                              ┌───────────▼──────────┐
#                                                              │  ChromaDB Store       │
#                                                              │  {chunk_text, vector} │
#                                                              └───────────────────────┘

# QUERY PIPELINE (runs on every user question)
# ┌──────────────┐    ┌───────────────┐    ┌──────────────────┐    ┌──────────────┐
# │ User Question│───>│ OpenAI Embed  │───>│ ChromaDB Search  │───>│ Top-K Chunks │
# └──────────────┘    └───────────────┘    │  (cosine sim)    │    └──────┬───────┘
#                                          └──────────────────┘           │
#                         ┌─────────────────────────────────────────────────▼───────────┐
#                         │  GPT-4o-mini: "Answer this question using ONLY the context:" │
#                         └─────────────────────────────────────────────────────────────┘

2. Backend: FastAPI + LlamaIndex Ingestion

pip install fastapi uvicorn llama-index llama-index-vector-stores-chroma \
            llama-index-embeddings-openai python-multipart pymupdf chromadb

# main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
import chromadb
import os
import shutil
from pathlib import Path

app = FastAPI()
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])

# Configure LlamaIndex
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small", api_key=os.getenv("OPENAI_API_KEY"))
Settings.node_parser = SentenceSplitter(
    chunk_size=512,       # 512 tokens per chunk — sweet spot for most PDFs
    chunk_overlap=50,     # 50-token overlap prevents context loss at boundaries
)

# In-memory index store (keyed by document ID)
indices: dict[str, VectorStoreIndex] = {}

@app.post("/upload/{doc_id}")
async def upload_pdf(doc_id: str, file: UploadFile = File(...)):
    """Upload and index a PDF document."""
    if not file.filename.endswith('.pdf'):
        raise HTTPException(400, "Only PDF files are supported")
    
    # Save PDF temporarily
    upload_dir = Path(f"uploads/{doc_id}")
    upload_dir.mkdir(parents=True, exist_ok=True)
    file_path = upload_dir / file.filename
    
    with open(file_path, "wb") as f:
        f.write(await file.read())
    
    try:
        # Load and parse PDF (LlamaIndex handles multi-page PDFs automatically)
        documents = SimpleDirectoryReader(str(upload_dir)).load_data()
        
        # Create persistent ChromaDB collection per document
        chroma_client = chromadb.PersistentClient(path=f"./chroma/{doc_id}")
        collection = chroma_client.get_or_create_collection(f"doc_{doc_id}")
        vector_store = ChromaVectorStore(chroma_collection=collection)
        storage_context = StorageContext.from_defaults(vector_store=vector_store)
        
        # Build vector index (embeds all chunks, stores in ChromaDB)
        index = VectorStoreIndex.from_documents(
            documents,
            storage_context=storage_context,
            show_progress=True,
        )
        
        # Cache in memory for fast repeated queries
        indices[doc_id] = index
        
        return {
            "status": "indexed",
            "doc_id": doc_id,
            "num_pages": len(documents),
            "filename": file.filename
        }
    finally:
        shutil.rmtree(upload_dir, ignore_errors=True)

@app.get("/query/{doc_id}")
async def query_document(doc_id: str, q: str, top_k: int = 4):
    """Query an indexed document with a natural language question."""
    if doc_id not in indices:
        # Try loading from disk if not in memory
        try:
            chroma_client = chromadb.PersistentClient(path=f"./chroma/{doc_id}")
            collection = chroma_client.get_collection(f"doc_{doc_id}")
            vector_store = ChromaVectorStore(chroma_collection=collection)
            storage_context = StorageContext.from_defaults(vector_store=vector_store)
            indices[doc_id] = VectorStoreIndex.from_vector_store(vector_store, storage_context=storage_context)
        except Exception:
            raise HTTPException(404, f"Document {doc_id} not found. Please upload first.")
    
    # Configure retrieval: top_k best matching chunks
    query_engine = indices[doc_id].as_query_engine(
        similarity_top_k=top_k,
        response_mode="compact",      # Synthesizes across multiple chunks
        streaming=False,
    )

    response = query_engine.query(q)
    
    return {
        "answer": str(response),
        "sources": [
            {
                "text": node.node.get_content()[:400],
                "score": round(node.score, 3) if node.score else None,
                "page": node.node.metadata.get("page_label", "unknown"),
            }
            for node in response.source_nodes
        ]
    }

3. Retrieval Quality Improvements

Technique	When to Use	Implementation
Chunk size tuning	Default 512 fails — too short misses context, too long confuses LLM	Test: 256, 512, 1024 tokens; measure answer quality
Hybrid BM25+Vector	Users search for exact keywords (SKU-123, Section 4.2)	LlamaIndex BM25Retriever + VectorRetriever → merge results
Parent-Child chunking	Questions need broader context than single chunk provides	Store small (128 tok) child chunks, retrieve parent (512 tok) for context
Reranking	Top vector results are similar but off-topic (cosine sim misses relevance)	Pass top-20 to cross-encoder (Cohere Rerank, bge-reranker-v2)
Metadata filtering	PDFs with sections, chapters, or date ranges	Store page_number, section, date in metadata; filter before vector search

Frequently Asked Questions

How do I handle multi-document queries ("Compare Section 3 of Doc A with Doc B")?

Use a multi-index retriever with a router. Create separate LlamaIndex indices per document (as shown above), then use RouterQueryEngine with summary descriptions of each document. The router LLM reads the question and selects which index(es) to query. For comparative questions, enable multi_select=True on the router — it queries both indices and synthesizes across results. Store document metadata (filename, upload date, tags) separately and inject it into the context: "From Financial_Report_Q4.pdf (uploaded Jan 15, page 23): ..." This attribution is critical — users need to know which document the answer came from.

Vector search returns irrelevant chunks — what's wrong?

Three common causes: (1) Chunk boundaries mid-sentence: use SentenceSplitter (not TokenTextSplitter) to break at sentence boundaries — semantic coherence within chunks dramatically improves embedding quality. (2) Missing context in chunks: add "Contextual Retrieval" — prepend each chunk with a document-level summary: "From the Q4 earnings report: {chunk_text}". This embeds global context into each local chunk's embedding. (3) Wrong embedding model: text-embedding-3-small is the minimum viable choice. For technical documents, consider domain-specific models (BGE-large, E5-mistral) which outperform OpenAI embeddings on specialized corpora by 5-15% on BEIR benchmarks.

"Retrieval quality determines answer quality. The LLM can only work with what you give it."

Deep Dive into LlamaIndex →

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact