Running Your Own AI Assistant on Local Hardware
A fully private, locally-running AI assistant — one that works offline, never sends your data to any third party, and costs nothing per query after the initial hardware investment — is achievable today on consumer hardware. This guide walks you through the complete setup: choosing your hardware, picking the right models, setting up inference, adding a chat interface, and layering in RAG for personal knowledge management.
Hardware Requirements
| Hardware | Best Model | Speed | Cost |
|---|---|---|---|
| Mac Studio M2 Ultra (192GB) | Llama 3.1 70B | ~8 tok/s | $3,999 |
| Mac Mini M4 Pro (64GB) | Llama 3.1 8B / Qwen 2.5 32B | ~20 tok/s | $1,399 |
| RTX 4090 (24GB VRAM) | Llama 3.1 8B (full) / 70B (Q4) | ~60 tok/s | $1,800 |
| RTX 3080 (10GB VRAM) | Mistral 7B / Llama 3.1 8B Q4 | ~30 tok/s | $650 (used) |
| Any modern CPU (no GPU) | Llama 3.1 8B Q4 (llama.cpp) | ~5-8 tok/s | Existing hardware |
Step 1: Install Ollama (Easiest Path)
# Install Ollama (Mac / Linux) curl -fsSL https://ollama.com/install.sh | sh # Pull and run a model -- it downloads automatically ollama run llama3.1:8b # First run: downloads ~4.7GB model weights # Subsequent runs: instant (cached locally) # You now have a full chat interface in your terminal. # Your data never leaves your machine. # List available models on Ollama Hub: ollama list # Other recommended models: ollama pull qwen2.5:14b # Excellent coding + reasoning ollama pull mistral:7b # Fast and well-rounded ollama pull deepseek-coder:6.7b # Best for pure code tasks ollama pull nomic-embed-text # For RAG embeddings (no GPU needed)
Step 2: Add a Chat Interface (Open WebUI)
# Open WebUI: ChatGPT-like interface for your local models # Requires Docker Desktop installed docker run -d -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ --restart always \ ghcr.io/open-webui/open-webui:main # Open http://localhost:3000 in your browser # Create a local admin account (no external auth required) # Select any Ollama model from the dropdown -- full ChatGPT-like experience # Features: # - Multi-model conversations # - File upload for document Q&A (PDF, DOCX, etc.) # - Image understanding (with vision models like llava) # - Voice input via Whisper (local, no cloud) # - Chat history saved locally to SQLite
Step 3: Add RAG for Personal Knowledge
# pip install langchain chromadb langchain-community
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.chains import RetrievalQA
# Load your personal notes/documents
loader = DirectoryLoader("~/Documents/notes/", glob="**/*.md", loader_cls=TextLoader)
docs = loader.load()
# Chunk and embed locally (nomic-embed-text via Ollama)
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
chunks,
embeddings,
persist_directory="./local_knowledge_db"
)
# Query your knowledge base with a local LLM
llm = Ollama(model="llama3.1:8b")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
)
# Ask questions about YOUR documents -- entirely locally
result = qa_chain.invoke("What did I write about the team roadmap last quarter?")
print(result["result"])
# 100% private: your notes, embeddings, and LLM responses
# never touch any external server.
Step 4: OpenAI-Compatible Local API
Ollama exposes an OpenAI-compatible API on localhost:11434. This means you can point any existing application that uses the OpenAI SDK directly at your local Ollama instance with zero code changes:
from openai import OpenAI
# Point the OpenAI SDK at your local Ollama instance
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required but ignored by Ollama
)
response = client.chat.completions.create(
model="llama3.1:8b", # Any model you've pulled
messages=[{"role": "user", "content": "Explain gradient descent simply."}]
)
print(response.choices[0].message.content)
# Works exactly like the OpenAI API -- just local and free.
# This means existing LangChain, LlamaIndex, and other frameworks
# that support OpenAI-compatible endpoints work with zero modification.
Advanced: Bypassing High-Level Tools with llama.cpp
While Ollama acts as a fantastic user-friendly wrapper, true local AI power-users eventually interact directly with the underlying llama.cpp framework to orchestrate precise quantization control and context window scaling (RoPE scaling).
# 1. Download a raw quantized GGUF format model from HuggingFace
wget https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf
# 2. Compile llama.cpp for Apple Metal (M-series GPUs)
make LLAMA_METAL=1
# 3. Serve an OpenAI-compatible API locally with strict context limits
# -c = Context Window (8192)
# -ngl = Layers to offload to GPU (99 pushes all possible layers)
./server -m ./Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
-c 8192 \
-ngl 99 \
--port 8080Conclusion
Setting up a full, private, local AI assistant stack takes under an hour in 2026. Ollama for model management, Open WebUI for the chat interface, ChromaDB for local RAG, and the OpenAI-compatible API for integrations. The result is a system that runs on your hardware, costs nothing per query, works offline, and provides zero data exposure to third parties. For personal productivity, sensitive professional work, and cost-sensitive applications, this stack is production-ready today.