⏱ 9–14 min read🎓 Intermediate → AdvancedUpdated Apr 2026

Local AI: Running LLMs with Ollama

Dec 29, 2025 • 20 min read

Cloud APIs are powerful but come with three fundamental constraints: cost at scale, data privacy concerns, and internet dependency. Ollama solves all three by being the "Docker for LLMs" — a tool that makes pulling, running, and serving open-source language models as simple as a single terminal command. This guide covers everything from first install to building production-grade local AI applications.

1. Why Ollama Instead of Raw llama.cpp?

llama.cpp is powerful but requires manual compilation, model format conversion, and GGUF file management. Ollama wraps all of that complexity:

Without Ollama

Compile llama.cpp from source
Download GGUF manually from HuggingFace
Write server startup scripts
Manage model files yourself
Configure GPU offload flags

With Ollama

curl | sh to install
ollama pull llama3.1
ollama serve (auto-restarts)
Automatic model storage/versioning
Auto-detects GPU (Metal/CUDA)

2. Installation

# Mac and Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: Download installer from https://ollama.com

# Verify installation
ollama --version  # Should print: ollama version 0.x.x

# Pull and run your first model (downloads ~4.7GB)
ollama run llama3.1

# >>> Enter your prompt here
# The model loads and you get an interactive chat

On Apple Silicon (M1/M2/M3), Ollama automatically uses Metal GPU acceleration. On Linux with NVIDIA, it uses CUDA. On CPU-only machines, it still works but inference is slower.

3. The Model Library

Ollama's model library covers the most popular open-source models:

Model	Size	VRAM (Q4)	Best For
llama3.2	3B	2 GB	Edge devices, fast responses
llama3.1	8B	5 GB	General use, best quality/size ratio
llama3.1:70b	70B	40 GB	GPT-4 class quality, high-end hardware
mistral	7B	4.1 GB	European data regulations, strong reasoning
gemma2	9B	5.5 GB	Google architecture, strong at coding
deepseek-coder-v2	16B	9 GB	Best for code generation
qwen2.5	7B	4.7 GB	Multilingual, strong Chinese support
phi3.5	3.8B	2.2 GB	Microsoft model, excellent at math

4. The REST API

Ollama runs a local HTTP server on port 11434. You can hit it directly or use the OpenAI-compatible endpoint:

Native Ollama API

// JavaScript — Native Ollama API
const response = await fetch('http://localhost:11434/api/chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'llama3.1',
    messages: [{ role: 'user', content: 'Explain neural networks in simple terms' }],
    stream: false  // true for streaming responses
  })
});
const data = await response.json();
console.log(data.message.content);

OpenAI-Compatible API (Drop-In Replacement)

from openai import OpenAI

# Point to local Ollama instead of OpenAI servers
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # Required by SDK but not validated locally
)

# Use EXACTLY the same code as you would with GPT-4o
response = client.chat.completions.create(
    model='llama3.1',
    messages=[
        {'role': 'system', 'content': 'You are a helpful coding assistant.'},
        {'role': 'user', 'content': 'Write a Python function to sort a list of dicts by a key'}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)

5. Streaming Responses

import ollama

# Python streaming with the official Ollama library
stream = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Write a short story about a robot'}],
    stream=True,
)

# Print each token as it arrives
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

# Or use async for FastAPI / async applications:
async def stream_response(prompt: str):
    async for chunk in await ollama.AsyncClient().chat(
        model='llama3.1',
        messages=[{'role': 'user', 'content': prompt}],
        stream=True,
    ):
        yield chunk['message']['content']

6. Custom Modelfiles: Creating Specialized Assistants

A Modelfile is like a Dockerfile for LLMs. You can create custom models with specific system prompts, temperature settings, and base models:

# Modelfile
FROM llama3.1

# Set system prompt
SYSTEM """
You are a senior Python engineer with 15 years of experience.
You write clean, typed, well-documented code.
Always include type hints and docstrings.
Never produce untested pseudocode.
"""

# Configure inference parameters
PARAMETER temperature 0.3    # Lower = more deterministic
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 8192       # Context window size

# Build and run your custom model
# ollama create python-expert -f Modelfile
# ollama run python-expert

7. Production Use Cases

Privacy-Sensitive Document Processing

Legal firms, healthcare providers, and financial institutions process documents with PII (patient records, contracts, financial statements) that cannot legally be sent to third-party APIs. Ollama running on-premises processes the documents with no data leaving the organization.

Offline Edge Deployments

Industrial IoT, field equipment, and offline kiosk applications need AI that works without internet. Ollama embedded in a local device provides AI features even in air-gapped environments.

Development & Testing Without API Costs

Run hundreds of automated integration tests against a local Llama model for free. Switch to GPT-4o only for production. A typical developer saves $50-200/month in API costs during development.

Custom Fine-Tuned Models

Fine-tune a Llama model on your proprietary data using QLoRA, convert to GGUF, and serve it with Ollama. Your fine-tuned company knowledge base assistant, running locally, with no data sent externally.

8. Integrating with LangChain

from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_core.messages import HumanMessage, SystemMessage

# Chat model — drop-in for ChatOpenAI
llm = ChatOllama(model="llama3.1", temperature=0.7)

response = llm.invoke([
    SystemMessage(content="You are a helpful assistant."),
    HumanMessage(content="What is RAG?")
])
print(response.content)

# Embeddings for RAG — no OpenAI embedding costs!
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vector = embeddings.embed_query("What is semantic search?")

Frequently Asked Questions

How does quality compare to GPT-4o?

Llama 3.1 70B is competitive with GPT-3.5 and approaches GPT-4o on many benchmarks. For most standard tasks (summarization, Q&A, coding), the difference is small. For complex multi-step reasoning, GPT-4o still leads. The 7-8B models are roughly equivalent to GPT-3.5-turbo in capability.

Can I run Ollama on a server for team access?

Yes. Start Ollama on a server with OLLAMA_HOST=0.0.0.0 ollama serve and configure your team to point their apps at http://your-server:11434. Add Nginx with basic auth for security if accessible outside your VPN. One RTX 4090 server can handle dozens of concurrent inference requests.

How do I switch between local and cloud models seamlessly?

Use an abstraction function that reads from an environment variable: MODEL_PROVIDER=ollama uses local, MODEL_PROVIDER=openai uses GPT-4o. Both use the same OpenAI-compatible SDK interface. This lets you develop locally for free and switch to cloud for production without code changes.

Conclusion

Ollama has fundamentally changed local AI development. What required expertise in GPU programming and C++ compilation two years ago now takes five minutes. For privacy-sensitive applications, cost-conscious development, and edge deployments, Ollama's combination of simplicity, the OpenAI-compatible API, and excellent model support makes it the definitive choice for local AI inference in 2025.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact