Local AI: Running LLMs with Ollama
Dec 29, 2025 • 20 min read
Cloud APIs are powerful but come with three fundamental constraints: cost at scale, data privacy concerns, and internet dependency. Ollama solves all three by being the "Docker for LLMs" — a tool that makes pulling, running, and serving open-source language models as simple as a single terminal command. This guide covers everything from first install to building production-grade local AI applications.
1. Why Ollama Instead of Raw llama.cpp?
llama.cpp is powerful but requires manual compilation, model format conversion, and GGUF file management. Ollama wraps all of that complexity:
Without Ollama
- Compile llama.cpp from source
- Download GGUF manually from HuggingFace
- Write server startup scripts
- Manage model files yourself
- Configure GPU offload flags
With Ollama
curl | shto installollama pull llama3.1ollama serve(auto-restarts)- Automatic model storage/versioning
- Auto-detects GPU (Metal/CUDA)
2. Installation
# Mac and Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: Download installer from https://ollama.com
# Verify installation
ollama --version # Should print: ollama version 0.x.x
# Pull and run your first model (downloads ~4.7GB)
ollama run llama3.1
# >>> Enter your prompt here
# The model loads and you get an interactive chatOn Apple Silicon (M1/M2/M3), Ollama automatically uses Metal GPU acceleration. On Linux with NVIDIA, it uses CUDA. On CPU-only machines, it still works but inference is slower.
3. The Model Library
Ollama's model library covers the most popular open-source models:
| Model | Size | VRAM (Q4) | Best For |
|---|---|---|---|
| llama3.2 | 3B | 2 GB | Edge devices, fast responses |
| llama3.1 | 8B | 5 GB | General use, best quality/size ratio |
| llama3.1:70b | 70B | 40 GB | GPT-4 class quality, high-end hardware |
| mistral | 7B | 4.1 GB | European data regulations, strong reasoning |
| gemma2 | 9B | 5.5 GB | Google architecture, strong at coding |
| deepseek-coder-v2 | 16B | 9 GB | Best for code generation |
| qwen2.5 | 7B | 4.7 GB | Multilingual, strong Chinese support |
| phi3.5 | 3.8B | 2.2 GB | Microsoft model, excellent at math |
4. The REST API
Ollama runs a local HTTP server on port 11434. You can hit it directly or use the OpenAI-compatible endpoint:
Native Ollama API
// JavaScript — Native Ollama API
const response = await fetch('http://localhost:11434/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'llama3.1',
messages: [{ role: 'user', content: 'Explain neural networks in simple terms' }],
stream: false // true for streaming responses
})
});
const data = await response.json();
console.log(data.message.content);OpenAI-Compatible API (Drop-In Replacement)
from openai import OpenAI
# Point to local Ollama instead of OpenAI servers
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Required by SDK but not validated locally
)
# Use EXACTLY the same code as you would with GPT-4o
response = client.chat.completions.create(
model='llama3.1',
messages=[
{'role': 'system', 'content': 'You are a helpful coding assistant.'},
{'role': 'user', 'content': 'Write a Python function to sort a list of dicts by a key'}
],
temperature=0.7,
max_tokens=1000
)
print(response.choices[0].message.content)5. Streaming Responses
import ollama
# Python streaming with the official Ollama library
stream = ollama.chat(
model='llama3.1',
messages=[{'role': 'user', 'content': 'Write a short story about a robot'}],
stream=True,
)
# Print each token as it arrives
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
# Or use async for FastAPI / async applications:
async def stream_response(prompt: str):
async for chunk in await ollama.AsyncClient().chat(
model='llama3.1',
messages=[{'role': 'user', 'content': prompt}],
stream=True,
):
yield chunk['message']['content']6. Custom Modelfiles: Creating Specialized Assistants
A Modelfile is like a Dockerfile for LLMs. You can create custom models with specific system prompts, temperature settings, and base models:
# Modelfile
FROM llama3.1
# Set system prompt
SYSTEM """
You are a senior Python engineer with 15 years of experience.
You write clean, typed, well-documented code.
Always include type hints and docstrings.
Never produce untested pseudocode.
"""
# Configure inference parameters
PARAMETER temperature 0.3 # Lower = more deterministic
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 8192 # Context window size
# Build and run your custom model
# ollama create python-expert -f Modelfile
# ollama run python-expert7. Production Use Cases
Privacy-Sensitive Document Processing
Legal firms, healthcare providers, and financial institutions process documents with PII (patient records, contracts, financial statements) that cannot legally be sent to third-party APIs. Ollama running on-premises processes the documents with no data leaving the organization.
Offline Edge Deployments
Industrial IoT, field equipment, and offline kiosk applications need AI that works without internet. Ollama embedded in a local device provides AI features even in air-gapped environments.
Development & Testing Without API Costs
Run hundreds of automated integration tests against a local Llama model for free. Switch to GPT-4o only for production. A typical developer saves $50-200/month in API costs during development.
Custom Fine-Tuned Models
Fine-tune a Llama model on your proprietary data using QLoRA, convert to GGUF, and serve it with Ollama. Your fine-tuned company knowledge base assistant, running locally, with no data sent externally.
8. Integrating with LangChain
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_core.messages import HumanMessage, SystemMessage
# Chat model — drop-in for ChatOpenAI
llm = ChatOllama(model="llama3.1", temperature=0.7)
response = llm.invoke([
SystemMessage(content="You are a helpful assistant."),
HumanMessage(content="What is RAG?")
])
print(response.content)
# Embeddings for RAG — no OpenAI embedding costs!
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector = embeddings.embed_query("What is semantic search?")Frequently Asked Questions
How does quality compare to GPT-4o?
Llama 3.1 70B is competitive with GPT-3.5 and approaches GPT-4o on many benchmarks. For most standard tasks (summarization, Q&A, coding), the difference is small. For complex multi-step reasoning, GPT-4o still leads. The 7-8B models are roughly equivalent to GPT-3.5-turbo in capability.
Can I run Ollama on a server for team access?
Yes. Start Ollama on a server with OLLAMA_HOST=0.0.0.0 ollama serve and configure your team to point their apps at http://your-server:11434. Add Nginx with basic auth for security if accessible outside your VPN. One RTX 4090 server can handle dozens of concurrent inference requests.
How do I switch between local and cloud models seamlessly?
Use an abstraction function that reads from an environment variable: MODEL_PROVIDER=ollama uses local, MODEL_PROVIDER=openai uses GPT-4o. Both use the same OpenAI-compatible SDK interface. This lets you develop locally for free and switch to cloud for production without code changes.
Conclusion
Ollama has fundamentally changed local AI development. What required expertise in GPU programming and C++ compilation two years ago now takes five minutes. For privacy-sensitive applications, cost-conscious development, and edge deployments, Ollama's combination of simplicity, the OpenAI-compatible API, and excellent model support makes it the definitive choice for local AI inference in 2025.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.