PrivateGPT: Offline RAG
Dec 30, 2025 • 20 min read
Enterprises want "Chat with my documents" — but healthcare providers can't send patient records to OpenAI. Law firms can't upload privileged client communications to any cloud API. Financial institutions face GDPR and SOC 2 restrictions that make cloud LLM use complex. The solution: a fully offline RAG pipeline where every component runs locally. Documents never leave the machine. The entire ingest-embed-retrieve-generate loop runs on-premise, with surprisingly good quality.
1. Why Offline RAG Is Now Viable
Two years ago, local language models were toys. Today:
- Llama 3.1 8B running via Ollama on a MacBook M3 Pro achieves quality comparable to GPT-3.5 on document Q&A tasks
- nomic-embed-text (local embedding model) produces embeddings competitive with OpenAI's text-embedding-ada-002 on most benchmarks
- ChromaDB runs entirely in-process with SQLite, no server required
- Modern M-series Macs can run 7-8B parameter models at 30-50 tokens/second — fast enough for production use
2. The Local Stack
| Component | Tool | Cloud Replacement |
|---|---|---|
| Document parsing | Unstructured.io (local) | AWS Textract, Azure Form Recognizer |
| Text embeddings | nomic-embed-text via Ollama | OpenAI text-embedding-3-small |
| Vector storage | ChromaDB (SQLite mode) | Pinecone, Weaviate, pgvector |
| LLM generation | Llama 3.1 8B via Ollama | GPT-4o, Claude 3.5 Sonnet |
| Orchestration | LangChain (any environment) | Same — LangChain works offline |
3. Step 1: Install Ollama and Pull Models
# macOS: Install Ollama
brew install ollama
# Linux:
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the models we'll use
ollama pull llama3.1:8b # 4.7 GB — local LLM
ollama pull nomic-embed-text # 274 MB — local embedding model
# Verify — both should appear in the list
ollama list
# Quick test
ollama run llama3.1:8b "Hello, are you running locally?"
# → Response in ~2s on M3 Pro
# Ollama exposes an OpenAI-compatible API at http://localhost:11434
# You can use the standard OpenAI Python client pointed at localhost!
from openai import OpenAI
local_client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = local_client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "What is RAG?"}]
)
print(response.choices[0].message.content) # Runs 100% locally4. Step 2: Document Ingestion Pipeline
pip install langchain langchain-community chromadb unstructured[pdf] pypdf
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from pathlib import Path
# Initialize LOCAL embedding model (runs via Ollama, no API key needed)
embeddings = OllamaEmbeddings(
model="nomic-embed-text",
base_url="http://localhost:11434",
)
def ingest_documents(docs_directory: str, db_path: str = "./chroma_db"):
"""
Ingest all PDFs from a directory into a local ChromaDB.
Everything runs locally — no data leaves the machine.
"""
# Load all PDFs from directory
loader = DirectoryLoader(
docs_directory,
glob="**/*.pdf",
loader_cls=PyPDFLoader,
show_progress=True,
)
documents = loader.load()
print(f"Loaded {len(documents)} document pages")
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "], # Try paragraph breaks first
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
# Embed and store locally in ChromaDB
vectordb = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=db_path, # SQLite file — stays local
collection_name="private_docs",
)
vectordb.persist()
print(f"Stored {len(chunks)} embeddings in {db_path}")
return vectordb
# Run ingestion
vectordb = ingest_documents("./confidential_documents/")
# Progress: Loaded 847 pages → Created 2,341 chunks → Stored locally5. Step 3: RAG Query Pipeline
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Initialize local LLM
llm = Ollama(
model="llama3.1:8b",
base_url="http://localhost:11434",
temperature=0.1, # Low temperature for factual Q&A
num_ctx=4096, # Context window size
num_predict=512, # Max response length
)
# Load existing ChromaDB (from previous ingestion)
vectordb = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings,
collection_name="private_docs",
)
# Custom prompt template — tells Llama to be grounded in retrieved context
RAG_PROMPT = PromptTemplate(
input_variables=["context", "question"],
template="""You are a precise document assistant. Answer ONLY based on the provided context.
If the answer is not in the context, say "I cannot find this information in the provided documents."
Context:
{context}
Question: {question}
Answer:"""
)
# Create the RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectordb.as_retriever(search_kwargs={"k": 5}),
chain_type="stuff",
chain_type_kwargs={"prompt": RAG_PROMPT},
return_source_documents=True,
)
def query_private_docs(question: str) -> dict:
result = qa_chain.invoke({"query": question})
return {
"answer": result["result"],
"sources": [
{"page": doc.metadata.get("page", "?"), "file": doc.metadata.get("source", "?")}
for doc in result["source_documents"]
]
}
# Test it
answer = query_private_docs("What is our data retention policy for patient records?")
print(answer["answer"])
print("Sources:", answer["sources"])6. Performance Optimization
# For production on a dedicated server with GPU:
# Use GPU acceleration in Ollama
OLLAMA_GPU_LAYER_COUNT=32 ollama serve # 32 GPU layers — adjust for your VRAM
# Model quantization for CPU-only environments (no GPU):
# Q4_K_M: Good quality/speed tradeoff for CPU (~8 tok/sec on modern CPU)
# Q8_0: Higher quality, slower (~4 tok/sec)
ollama pull llama3.1:8b-instruct-q4_K_M # Recommended for CPU
# Performance benchmark:
# Hardware | Model | Speed
# MacBook M3 Pro | Llama 3.1 8B | 45 tok/sec (Metal GPU)
# R7 5800X (CPU) | Llama 3.1 8B Q4| 8 tok/sec
# RTX 4090 (GPU) | Llama 3.1 70B | 35 tok/sec
# NVIDIA A100 | Llama 3.1 70B | 80 tok/secFrequently Asked Questions
Is the quality comparable to GPT-4o for document Q&A?
For factual document retrieval (legal documents, HR policies, technical documentation), Llama 3.1 8B achieves 85-90% of GPT-4o's answer quality on structured Q&A tasks. The gap is larger for complex multi-hop reasoning or summarization of very long documents. For most enterprise "chat with your docs" use cases, the quality is sufficient and the compliance benefit outweighs the small quality delta.
How do I handle very large document collections (10,000+ PDFs)?
ChromaDB's SQLite mode handles millions of vectors on a single machine. For very large collections, switch to ChromaDB's client-server mode (runs as a separate process) or use pgvector in a local PostgreSQL instance. The ingestion pipeline will take hours for large collections — run it as a background job with progress tracking.
Conclusion
The offline RAG stack (Ollama + ChromaDB + LangChain) delivers enterprise-grade document intelligence with zero external API calls. Every document, embedding, and generated response stays within your infrastructure. For healthcare, legal, financial, and government use cases where data sovereignty is non-negotiable, this architecture enables AI-powered document Q&A while fully satisfying compliance requirements.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.