Project: Chat with PDF
Dec 30, 2025 β’ 25 min read
"Chat with PDF" is the canonical starter project for AI Engineering β not because it's trivial, but because it forces you to learn all the hard problems at once: document parsing, semantic chunking, vector embedding, similarity search, context window management, and grounded generation. Every production RAG system (enterprise search, legal document analysis, medical record querying) is a more sophisticated version of this same architecture.
Tech Stack
1. The Full Architecture
# INGESTION PIPELINE (runs once when PDF uploaded)
# ββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ
# β PDF File ββββ>β PyMuPDF Parse ββββ>β Chunk (512 tok) ββββ>β OpenAI Embed β
# ββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ
# β
# βββββββββββββΌβββββββββββ
# β ChromaDB Store β
# β {chunk_text, vector} β
# βββββββββββββββββββββββββ
# QUERY PIPELINE (runs on every user question)
# ββββββββββββββββ βββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ
# β User Questionββββ>β OpenAI Embed ββββ>β ChromaDB Search ββββ>β Top-K Chunks β
# ββββββββββββββββ βββββββββββββββββ β (cosine sim) β ββββββββ¬ββββββββ
# ββββββββββββββββββββ β
# βββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββ
# β GPT-4o-mini: "Answer this question using ONLY the context:" β
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ2. Backend: FastAPI + LlamaIndex Ingestion
pip install fastapi uvicorn llama-index llama-index-vector-stores-chroma \
llama-index-embeddings-openai python-multipart pymupdf chromadb
# main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
import chromadb
import os
import shutil
from pathlib import Path
app = FastAPI()
app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])
# Configure LlamaIndex
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small", api_key=os.getenv("OPENAI_API_KEY"))
Settings.node_parser = SentenceSplitter(
chunk_size=512, # 512 tokens per chunk β sweet spot for most PDFs
chunk_overlap=50, # 50-token overlap prevents context loss at boundaries
)
# In-memory index store (keyed by document ID)
indices: dict[str, VectorStoreIndex] = {}
@app.post("/upload/{doc_id}")
async def upload_pdf(doc_id: str, file: UploadFile = File(...)):
"""Upload and index a PDF document."""
if not file.filename.endswith('.pdf'):
raise HTTPException(400, "Only PDF files are supported")
# Save PDF temporarily
upload_dir = Path(f"uploads/{doc_id}")
upload_dir.mkdir(parents=True, exist_ok=True)
file_path = upload_dir / file.filename
with open(file_path, "wb") as f:
f.write(await file.read())
try:
# Load and parse PDF (LlamaIndex handles multi-page PDFs automatically)
documents = SimpleDirectoryReader(str(upload_dir)).load_data()
# Create persistent ChromaDB collection per document
chroma_client = chromadb.PersistentClient(path=f"./chroma/{doc_id}")
collection = chroma_client.get_or_create_collection(f"doc_{doc_id}")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Build vector index (embeds all chunks, stores in ChromaDB)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
show_progress=True,
)
# Cache in memory for fast repeated queries
indices[doc_id] = index
return {
"status": "indexed",
"doc_id": doc_id,
"num_pages": len(documents),
"filename": file.filename
}
finally:
shutil.rmtree(upload_dir, ignore_errors=True)
@app.get("/query/{doc_id}")
async def query_document(doc_id: str, q: str, top_k: int = 4):
"""Query an indexed document with a natural language question."""
if doc_id not in indices:
# Try loading from disk if not in memory
try:
chroma_client = chromadb.PersistentClient(path=f"./chroma/{doc_id}")
collection = chroma_client.get_collection(f"doc_{doc_id}")
vector_store = ChromaVectorStore(chroma_collection=collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
indices[doc_id] = VectorStoreIndex.from_vector_store(vector_store, storage_context=storage_context)
except Exception:
raise HTTPException(404, f"Document {doc_id} not found. Please upload first.")
# Configure retrieval: top_k best matching chunks
query_engine = indices[doc_id].as_query_engine(
similarity_top_k=top_k,
response_mode="compact", # Synthesizes across multiple chunks
streaming=False,
)
response = query_engine.query(q)
return {
"answer": str(response),
"sources": [
{
"text": node.node.get_content()[:400],
"score": round(node.score, 3) if node.score else None,
"page": node.node.metadata.get("page_label", "unknown"),
}
for node in response.source_nodes
]
}3. Retrieval Quality Improvements
| Technique | When to Use | Implementation |
|---|---|---|
| Chunk size tuning | Default 512 fails β too short misses context, too long confuses LLM | Test: 256, 512, 1024 tokens; measure answer quality |
| Hybrid BM25+Vector | Users search for exact keywords (SKU-123, Section 4.2) | LlamaIndex BM25Retriever + VectorRetriever β merge results |
| Parent-Child chunking | Questions need broader context than single chunk provides | Store small (128 tok) child chunks, retrieve parent (512 tok) for context |
| Reranking | Top vector results are similar but off-topic (cosine sim misses relevance) | Pass top-20 to cross-encoder (Cohere Rerank, bge-reranker-v2) |
| Metadata filtering | PDFs with sections, chapters, or date ranges | Store page_number, section, date in metadata; filter before vector search |
Frequently Asked Questions
How do I handle multi-document queries ("Compare Section 3 of Doc A with Doc B")?
Use a multi-index retriever with a router. Create separate LlamaIndex indices per document (as shown above), then use RouterQueryEngine with summary descriptions of each document. The router LLM reads the question and selects which index(es) to query. For comparative questions, enable multi_select=True on the router β it queries both indices and synthesizes across results. Store document metadata (filename, upload date, tags) separately and inject it into the context: "From Financial_Report_Q4.pdf (uploaded Jan 15, page 23): ..." This attribution is critical β users need to know which document the answer came from.
Vector search returns irrelevant chunks β what's wrong?
Three common causes: (1) Chunk boundaries mid-sentence: use SentenceSplitter (not TokenTextSplitter) to break at sentence boundaries β semantic coherence within chunks dramatically improves embedding quality. (2) Missing context in chunks: add "Contextual Retrieval" β prepend each chunk with a document-level summary: "From the Q4 earnings report: {chunk_text}". This embeds global context into each local chunk's embedding. (3) Wrong embedding model: text-embedding-3-small is the minimum viable choice. For technical documents, consider domain-specific models (BGE-large, E5-mistral) which outperform OpenAI embeddings on specialized corpora by 5-15% on BEIR benchmarks.
"Retrieval quality determines answer quality. The LLM can only work with what you give it."
Deep Dive into LlamaIndex βContinue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning β no fluff, just working code and real-world context.