Running Your Own AI Assistant on Local Hardware

A fully private, locally-running AI assistant — one that works offline, never sends your data to any third party, and costs nothing per query after the initial hardware investment — is achievable today on consumer hardware. This guide walks you through the complete setup: choosing your hardware, picking the right models, setting up inference, adding a chat interface, and layering in RAG for personal knowledge management.

Hardware Requirements

Hardware	Best Model	Speed	Cost
Mac Studio M2 Ultra (192GB)	Llama 3.1 70B	~8 tok/s	$3,999
Mac Mini M4 Pro (64GB)	Llama 3.1 8B / Qwen 2.5 32B	~20 tok/s	$1,399
RTX 4090 (24GB VRAM)	Llama 3.1 8B (full) / 70B (Q4)	~60 tok/s	$1,800
RTX 3080 (10GB VRAM)	Mistral 7B / Llama 3.1 8B Q4	~30 tok/s	$650 (used)
Any modern CPU (no GPU)	Llama 3.1 8B Q4 (llama.cpp)	~5-8 tok/s	Existing hardware

Step 1: Install Ollama (Easiest Path)

# Install Ollama (Mac / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model -- it downloads automatically
ollama run llama3.1:8b
# First run: downloads ~4.7GB model weights
# Subsequent runs: instant (cached locally)

# You now have a full chat interface in your terminal.
# Your data never leaves your machine.

# List available models on Ollama Hub:
ollama list

# Other recommended models:
ollama pull qwen2.5:14b          # Excellent coding + reasoning
ollama pull mistral:7b           # Fast and well-rounded
ollama pull deepseek-coder:6.7b  # Best for pure code tasks
ollama pull nomic-embed-text     # For RAG embeddings (no GPU needed)

Step 2: Add a Chat Interface (Open WebUI)

# Open WebUI: ChatGPT-like interface for your local models
# Requires Docker Desktop installed

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Open http://localhost:3000 in your browser
# Create a local admin account (no external auth required)
# Select any Ollama model from the dropdown -- full ChatGPT-like experience

# Features:
# - Multi-model conversations
# - File upload for document Q&A (PDF, DOCX, etc.)
# - Image understanding (with vision models like llava)
# - Voice input via Whisper (local, no cloud)
# - Chat history saved locally to SQLite

Step 3: Add RAG for Personal Knowledge

# pip install langchain chromadb langchain-community

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.chains import RetrievalQA

# Load your personal notes/documents
loader = DirectoryLoader("~/Documents/notes/", glob="**/*.md", loader_cls=TextLoader)
docs = loader.load()

# Chunk and embed locally (nomic-embed-text via Ollama)
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
    chunks,
    embeddings,
    persist_directory="./local_knowledge_db"
)

# Query your knowledge base with a local LLM
llm = Ollama(model="llama3.1:8b")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
)

# Ask questions about YOUR documents -- entirely locally
result = qa_chain.invoke("What did I write about the team roadmap last quarter?")
print(result["result"])

# 100% private: your notes, embeddings, and LLM responses
# never touch any external server.

Step 4: OpenAI-Compatible Local API

Ollama exposes an OpenAI-compatible API on localhost:11434. This means you can point any existing application that uses the OpenAI SDK directly at your local Ollama instance with zero code changes:

from openai import OpenAI

# Point the OpenAI SDK at your local Ollama instance
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Required but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.1:8b",  # Any model you've pulled
    messages=[{"role": "user", "content": "Explain gradient descent simply."}]
)

print(response.choices[0].message.content)
# Works exactly like the OpenAI API -- just local and free.

# This means existing LangChain, LlamaIndex, and other frameworks
# that support OpenAI-compatible endpoints work with zero modification.

Advanced: Bypassing High-Level Tools with llama.cpp

While Ollama acts as a fantastic user-friendly wrapper, true local AI power-users eventually interact directly with the underlying llama.cpp framework to orchestrate precise quantization control and context window scaling (RoPE scaling).

# 1. Download a raw quantized GGUF format model from HuggingFace
wget https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf

# 2. Compile llama.cpp for Apple Metal (M-series GPUs)
make LLAMA_METAL=1

# 3. Serve an OpenAI-compatible API locally with strict context limits 
# -c = Context Window (8192)
# -ngl = Layers to offload to GPU (99 pushes all possible layers)
./server -m ./Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
         -c 8192 \
         -ngl 99 \
         --port 8080

Conclusion

Setting up a full, private, local AI assistant stack takes under an hour in 2026. Ollama for model management, Open WebUI for the chat interface, ChromaDB for local RAG, and the OpenAI-compatible API for integrations. The result is a system that runs on your hardware, costs nothing per query, works offline, and provides zero data exposure to third parties. For personal productivity, sensitive professional work, and cost-sensitive applications, this stack is production-ready today.