How to Build Apps Using Mistral 7B and Mixtral
The first time I ran Mistral 7B locally, it took me about 45 minutes to get it working. I wasted 20 of those minutes because I didn't understand the difference between the base model, the instruct model, and how the prompt format differs between them. The second time, I had it serving a streaming REST API in under 5 minutes using Ollama. This tutorial is designed to get you to that second-time experience immediately.
Option A: Local Deployment with Ollama (Recommended for Dev)
Ollama is the fastest way to get Mistral running locally. It handles model downloads, quantization selection, and serving a local OpenAI-compatible REST endpoint automatically.
# Install Ollama (macOS)
brew install ollama
# Pull and run Mistral 7B (Q4 quantized, ~4.1GB)
ollama run mistral
# Or pull Mixtral 8x7B (Q4 quantized, ~26GB - needs 32GB+ RAM)
ollama pull mixtral
# Start Ollama as a background server
ollama serve
# Ollama now serves an OpenAI-compatible API at localhost:11434
# Test it:
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Explain the Sliding Window Attention mechanism",
"stream": false
}'Instruct vs Base Model
Always use the mistral (instruct) model for application development, never the bare base model. The base model is pre-trained to predict next tokens in raw text. The instruct model has been fine-tuned to follow conversational instructions using Mistral's specific instruct prompt format.
The Mistral Instruct Prompt Format
Mistral uses a very specific prompt wrapping format. If you send raw text without this format to the instruct model, it will perform terribly. The format uses special tokens:
# Mistral Instruct Format (v0.1/v0.2/v0.3) # [INST] = start of instruction (user turn) # [/INST] = end of instruction # No <|system|> token in older Mistral — system goes INSIDE the first [INST] prompt = """[INST] You are a helpful AI navigator assistant for a shipping company. User question: What is the ETA for shipment #AX-4421? [/INST]""" # Multi-turn conversation format: multi_turn = """[INST] What is photosynthesis? [/INST] Photosynthesis is the process by which plants use sunlight... [INST] How does chlorophyll relate to this? [/INST]""" # Note: Mistral v0.3 introduced a system prompt token: # <s>[INST] <<SYS>> SYSTEM_PROMPT <</SYS>> USER_MESSAGE [/INST]
Option B: Building with Mistral API (Production)
For production deployments where you want managed infrastructure without self-hosting, Mistral's official API (la Plateforme at console.mistral.ai) provides access to their full model tier. Install the SDK: pip install mistralai
import os
from mistralai import Mistral
client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])
# Standard non-streaming request
response = client.chat.complete(
model="mistral-large-latest",
messages=[
{"role": "system", "content": "You are a senior Python architect. Be concise."},
{"role": "user", "content": "Explain asyncio event loops in 200 words."}
]
)
print(response.choices[0].message.content)
# Available models:
# "open-mistral-7b" -- Cheap, fast, self-hostable
# "open-mixtral-8x7b" -- Powerful, self-hostable
# "open-mixtral-8x22b" -- Most capable open model
# "mistral-small-latest" -- Efficient commercial
# "mistral-large-latest" -- Frontier commercial
# "codestral-latest" -- Code-specialized
Streaming for Real-Time UX
Like all production LLM APIs, you should always stream for user-facing applications. Mistral's SDK supports streaming natively:
import sys
# Streaming mode
with client.chat.stream(
model="mistral-large-latest",
messages=[{"role": "user", "content": "Write me a FastAPI app skeleton."}]
) as stream:
for chunk in stream:
delta = chunk.data.choices[0].delta.content
if delta:
sys.stdout.write(delta)
sys.stdout.flush() # Forces immediate output
Option C: Using Mistral via OpenAI-Compatible Interface
The most powerful architectural trick: Mistral's API is OpenAI-compatible. This means you can swap between GPT-4o and Mistral with a single environment variable change, with zero code modifications, if you use the OpenAI SDK with a custom base_url:
from openai import OpenAI
# Point the OpenAI SDK at Mistral's API endpoint
client = OpenAI(
api_key=os.environ["MISTRAL_API_KEY"],
base_url="https://api.mistral.ai/v1" # Mistral's OpenAI-compatible endpoint
)
# Standard OpenAI SDK call -- works identically!
response = client.chat.completions.create(
model="mistral-large-latest", # Just change the model string
messages=[{"role": "user", "content": "Hello!"}]
)
This pattern is critical for building model-agnostic applications. Define your LLM_PROVIDER and LLM_MODEL as environment variables. Switching between OpenAI and Mistral (or Ollama local) for A/B testing or cost optimization becomes a zero-code deployment config change.
Building a Streaming Next.js App with Mistral
Combining the Vercel AI SDK (see our Claude API guide) with Mistral's OpenAI-compatible endpoint is the fastest path to a production streaming chat application:
// app/api/chat/route.js
import { createOpenAI } from '@ai-sdk/openai';
import { streamText } from 'ai';
// Point Vercel AI SDK at Mistral's OpenAI-compatible endpoint
const mistral = createOpenAI({
apiKey: process.env.MISTRAL_API_KEY,
baseURL: 'https://api.mistral.ai/v1',
});
export const maxDuration = 60;
export async function POST(req) {
const { messages } = await req.json();
const result = await streamText({
model: mistral('mistral-large-latest'),
system: "You are a helpful AI assistant.",
messages,
});
return result.toDataStreamResponse();
}
Conclusion
Building with Mistral is remarkably approachable once you understand the three deployment paths: Ollama for local dev, Mistral API for managed production, and the OpenAI-compatible interface for model-agnostic architectures. The combination of open weights, strong performance, and API compatibility makes Mistral the most practical choice for engineers who want real deployment flexibility without sacrificing quality.