opncrafter

PrivateGPT: Offline RAG

Dec 30, 2025 • 20 min read

Enterprises want "Chat with my documents" — but healthcare providers can't send patient records to OpenAI. Law firms can't upload privileged client communications to any cloud API. Financial institutions face GDPR and SOC 2 restrictions that make cloud LLM use complex. The solution: a fully offline RAG pipeline where every component runs locally. Documents never leave the machine. The entire ingest-embed-retrieve-generate loop runs on-premise, with surprisingly good quality.

1. Why Offline RAG Is Now Viable

Two years ago, local language models were toys. Today:

  • Llama 3.1 8B running via Ollama on a MacBook M3 Pro achieves quality comparable to GPT-3.5 on document Q&A tasks
  • nomic-embed-text (local embedding model) produces embeddings competitive with OpenAI's text-embedding-ada-002 on most benchmarks
  • ChromaDB runs entirely in-process with SQLite, no server required
  • Modern M-series Macs can run 7-8B parameter models at 30-50 tokens/second — fast enough for production use

2. The Local Stack

ComponentToolCloud Replacement
Document parsingUnstructured.io (local)AWS Textract, Azure Form Recognizer
Text embeddingsnomic-embed-text via OllamaOpenAI text-embedding-3-small
Vector storageChromaDB (SQLite mode)Pinecone, Weaviate, pgvector
LLM generationLlama 3.1 8B via OllamaGPT-4o, Claude 3.5 Sonnet
OrchestrationLangChain (any environment)Same — LangChain works offline

3. Step 1: Install Ollama and Pull Models

# macOS: Install Ollama
brew install ollama

# Linux: 
curl -fsSL https://ollama.ai/install.sh | sh

# Pull the models we'll use
ollama pull llama3.1:8b               # 4.7 GB — local LLM
ollama pull nomic-embed-text          # 274 MB — local embedding model

# Verify — both should appear in the list
ollama list

# Quick test
ollama run llama3.1:8b "Hello, are you running locally?"
# → Response in ~2s on M3 Pro

# Ollama exposes an OpenAI-compatible API at http://localhost:11434
# You can use the standard OpenAI Python client pointed at localhost!
from openai import OpenAI
local_client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = local_client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "What is RAG?"}]
)
print(response.choices[0].message.content)  # Runs 100% locally

4. Step 2: Document Ingestion Pipeline

pip install langchain langchain-community chromadb unstructured[pdf] pypdf

from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from pathlib import Path

# Initialize LOCAL embedding model (runs via Ollama, no API key needed)
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434",
)

def ingest_documents(docs_directory: str, db_path: str = "./chroma_db"):
    """
    Ingest all PDFs from a directory into a local ChromaDB.
    Everything runs locally — no data leaves the machine.
    """
    # Load all PDFs from directory
    loader = DirectoryLoader(
        docs_directory,
        glob="**/*.pdf",
        loader_cls=PyPDFLoader,
        show_progress=True,
    )
    documents = loader.load()
    print(f"Loaded {len(documents)} document pages")
    
    # Split into chunks
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=50,
        separators=["\n\n", "\n", ". ", " "],  # Try paragraph breaks first
    )
    chunks = splitter.split_documents(documents)
    print(f"Created {len(chunks)} chunks")
    
    # Embed and store locally in ChromaDB
    vectordb = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=db_path,  # SQLite file — stays local
        collection_name="private_docs",
    )
    vectordb.persist()
    print(f"Stored {len(chunks)} embeddings in {db_path}")
    return vectordb

# Run ingestion
vectordb = ingest_documents("./confidential_documents/")
# Progress: Loaded 847 pages → Created 2,341 chunks → Stored locally

5. Step 3: RAG Query Pipeline

from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Initialize local LLM
llm = Ollama(
    model="llama3.1:8b",
    base_url="http://localhost:11434",
    temperature=0.1,           # Low temperature for factual Q&A
    num_ctx=4096,              # Context window size
    num_predict=512,           # Max response length
)

# Load existing ChromaDB (from previous ingestion)
vectordb = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="private_docs",
)

# Custom prompt template — tells Llama to be grounded in retrieved context
RAG_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template="""You are a precise document assistant. Answer ONLY based on the provided context.
If the answer is not in the context, say "I cannot find this information in the provided documents."

Context:
{context}

Question: {question}

Answer:"""
)

# Create the RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectordb.as_retriever(search_kwargs={"k": 5}),
    chain_type="stuff",
    chain_type_kwargs={"prompt": RAG_PROMPT},
    return_source_documents=True,
)

def query_private_docs(question: str) -> dict:
    result = qa_chain.invoke({"query": question})
    return {
        "answer": result["result"],
        "sources": [
            {"page": doc.metadata.get("page", "?"), "file": doc.metadata.get("source", "?")}
            for doc in result["source_documents"]
        ]
    }

# Test it
answer = query_private_docs("What is our data retention policy for patient records?")
print(answer["answer"])
print("Sources:", answer["sources"])

6. Performance Optimization

# For production on a dedicated server with GPU:
# Use GPU acceleration in Ollama
OLLAMA_GPU_LAYER_COUNT=32 ollama serve  # 32 GPU layers — adjust for your VRAM

# Model quantization for CPU-only environments (no GPU):
# Q4_K_M: Good quality/speed tradeoff for CPU (~8 tok/sec on modern CPU)
# Q8_0:   Higher quality, slower (~4 tok/sec)
ollama pull llama3.1:8b-instruct-q4_K_M  # Recommended for CPU

# Performance benchmark:
# Hardware           | Model           | Speed
# MacBook M3 Pro     | Llama 3.1 8B   | 45 tok/sec (Metal GPU)
# R7 5800X (CPU)    | Llama 3.1 8B Q4| 8 tok/sec
# RTX 4090 (GPU)    | Llama 3.1 70B  | 35 tok/sec
# NVIDIA A100        | Llama 3.1 70B  | 80 tok/sec

Frequently Asked Questions

Is the quality comparable to GPT-4o for document Q&A?

For factual document retrieval (legal documents, HR policies, technical documentation), Llama 3.1 8B achieves 85-90% of GPT-4o's answer quality on structured Q&A tasks. The gap is larger for complex multi-hop reasoning or summarization of very long documents. For most enterprise "chat with your docs" use cases, the quality is sufficient and the compliance benefit outweighs the small quality delta.

How do I handle very large document collections (10,000+ PDFs)?

ChromaDB's SQLite mode handles millions of vectors on a single machine. For very large collections, switch to ChromaDB's client-server mode (runs as a separate process) or use pgvector in a local PostgreSQL instance. The ingestion pipeline will take hours for large collections — run it as a background job with progress tracking.

Conclusion

The offline RAG stack (Ollama + ChromaDB + LangChain) delivers enterprise-grade document intelligence with zero external API calls. Every document, embedding, and generated response stays within your infrastructure. For healthcare, legal, financial, and government use cases where data sovereignty is non-negotiable, this architecture enables AI-powered document Q&A while fully satisfying compliance requirements.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK