How to Build Apps Using Mistral 7B and Mixtral

The first time I ran Mistral 7B locally, it took me about 45 minutes to get it working. I wasted 20 of those minutes because I didn't understand the difference between the base model, the instruct model, and how the prompt format differs between them. The second time, I had it serving a streaming REST API in under 5 minutes using Ollama. This tutorial is designed to get you to that second-time experience immediately.

Option A: Local Deployment with Ollama (Recommended for Dev)

Ollama is the fastest way to get Mistral running locally. It handles model downloads, quantization selection, and serving a local OpenAI-compatible REST endpoint automatically.

# Install Ollama (macOS)
brew install ollama

# Pull and run Mistral 7B (Q4 quantized, ~4.1GB)
ollama run mistral

# Or pull Mixtral 8x7B (Q4 quantized, ~26GB - needs 32GB+ RAM)
ollama pull mixtral

# Start Ollama as a background server
ollama serve

# Ollama now serves an OpenAI-compatible API at localhost:11434
# Test it:
curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Explain the Sliding Window Attention mechanism",
  "stream": false
}'

Instruct vs Base Model

Always use the mistral (instruct) model for application development, never the bare base model. The base model is pre-trained to predict next tokens in raw text. The instruct model has been fine-tuned to follow conversational instructions using Mistral's specific instruct prompt format.

The Mistral Instruct Prompt Format

Mistral uses a very specific prompt wrapping format. If you send raw text without this format to the instruct model, it will perform terribly. The format uses special tokens:

# Mistral Instruct Format (v0.1/v0.2/v0.3)
# [INST] = start of instruction (user turn)
# [/INST] = end of instruction
# No <|system|> token in older Mistral — system goes INSIDE the first [INST]

prompt = """[INST] You are a helpful AI navigator assistant for a shipping company.

User question: What is the ETA for shipment #AX-4421? [/INST]"""

# Multi-turn conversation format:
multi_turn = """[INST] What is photosynthesis? [/INST]
Photosynthesis is the process by which plants use sunlight...
[INST] How does chlorophyll relate to this? [/INST]"""

# Note: Mistral v0.3 introduced a system prompt token:
# <s>[INST] <<SYS>> SYSTEM_PROMPT <</SYS>> USER_MESSAGE [/INST]

Option B: Building with Mistral API (Production)

For production deployments where you want managed infrastructure without self-hosting, Mistral's official API (la Plateforme at console.mistral.ai) provides access to their full model tier. Install the SDK: pip install mistralai

import os
from mistralai import Mistral

client = Mistral(api_key=os.environ["MISTRAL_API_KEY"])

# Standard non-streaming request
response = client.chat.complete(
    model="mistral-large-latest",
    messages=[
        {"role": "system", "content": "You are a senior Python architect. Be concise."},
        {"role": "user", "content": "Explain asyncio event loops in 200 words."}
    ]
)
print(response.choices[0].message.content)

# Available models:
# "open-mistral-7b"        -- Cheap, fast, self-hostable
# "open-mixtral-8x7b"      -- Powerful, self-hostable
# "open-mixtral-8x22b"     -- Most capable open model
# "mistral-small-latest"   -- Efficient commercial
# "mistral-large-latest"   -- Frontier commercial
# "codestral-latest"       -- Code-specialized

Streaming for Real-Time UX

Like all production LLM APIs, you should always stream for user-facing applications. Mistral's SDK supports streaming natively:

import sys

# Streaming mode
with client.chat.stream(
    model="mistral-large-latest",
    messages=[{"role": "user", "content": "Write me a FastAPI app skeleton."}]
) as stream:
    for chunk in stream:
        delta = chunk.data.choices[0].delta.content
        if delta:
            sys.stdout.write(delta)
            sys.stdout.flush()  # Forces immediate output

Option C: Using Mistral via OpenAI-Compatible Interface

The most powerful architectural trick: Mistral's API is OpenAI-compatible. This means you can swap between GPT-4o and Mistral with a single environment variable change, with zero code modifications, if you use the OpenAI SDK with a custom base_url:

from openai import OpenAI

# Point the OpenAI SDK at Mistral's API endpoint
client = OpenAI(
    api_key=os.environ["MISTRAL_API_KEY"],
    base_url="https://api.mistral.ai/v1"  # Mistral's OpenAI-compatible endpoint
)

# Standard OpenAI SDK call -- works identically!
response = client.chat.completions.create(
    model="mistral-large-latest",  # Just change the model string
    messages=[{"role": "user", "content": "Hello!"}]
)

This pattern is critical for building model-agnostic applications. Define your LLM_PROVIDER and LLM_MODEL as environment variables. Switching between OpenAI and Mistral (or Ollama local) for A/B testing or cost optimization becomes a zero-code deployment config change.

Building a Streaming Next.js App with Mistral

Combining the Vercel AI SDK (see our Claude API guide) with Mistral's OpenAI-compatible endpoint is the fastest path to a production streaming chat application:

// app/api/chat/route.js
import { createOpenAI } from '@ai-sdk/openai';
import { streamText } from 'ai';

// Point Vercel AI SDK at Mistral's OpenAI-compatible endpoint
const mistral = createOpenAI({
  apiKey: process.env.MISTRAL_API_KEY,
  baseURL: 'https://api.mistral.ai/v1',
});

export const maxDuration = 60;

export async function POST(req) {
  const { messages } = await req.json();
  
  const result = await streamText({
    model: mistral('mistral-large-latest'),
    system: "You are a helpful AI assistant.",
    messages,
  });

  return result.toDataStreamResponse();
}

Conclusion

Building with Mistral is remarkably approachable once you understand the three deployment paths: Ollama for local dev, Mistral API for managed production, and the OpenAI-compatible interface for model-agnostic architectures. The combination of open weights, strong performance, and API compatibility makes Mistral the most practical choice for engineers who want real deployment flexibility without sacrificing quality.