⏱ 8–12 min read🎓 Intermediate → AdvancedUpdated Apr 2026

FinOps: Don't Bankrupt Your Startup

Dec 30, 2025 • 20 min read

The most common mistake AI startups make: they get OpenAI working in their prototype and assume the cost is negligible. Then they launch, get traction, and discover they're spending $40,000/month on GPT-4o for queries that a $0.10/1M token model handles equally well. LLM FinOps is the discipline of routing the right queries to the cheapest model that can answer them effectively — typically reducing costs by 60-80% without users noticing any quality difference.

1. Token Pricing Math

Model	Input ($/1M tk)	Output ($/1M tk)	Best For
GPT-4o	$2.50	$10.00	Complex reasoning, code generation, vision
GPT-4o mini	$0.15	$0.60	Simple Q&A, classification, summarization
Claude 3.5 Sonnet	$3.00	$15.00	Long documents, coding, nuanced instructions
Claude 3 Haiku	$0.25	$1.25	Fast, cheap tasks — ideal default for many apps
Llama 3.1 70B (via Groq)	$0.59	$0.79	Open-source, no data sent to OpenAI/Anthropic
Llama 3.1 8B (via Groq)	$0.05	$0.08	Ultra-cheap, surprisingly capable for many tasks

# Token cost calculator — RAG is expensive because of long context
query_tokens = 50         # User's question
context_tokens = 5000     # 10 retrieved chunks × 500 tokens each
system_prompt_tokens = 200 # Your instructions

total_input_tokens = query_tokens + context_tokens + system_prompt_tokens  # 5,250
output_tokens = 400  # LLM response

# Cost with GPT-4o:
cost_gpt4o = (5250 * 2.50 / 1_000_000) + (400 * 10.00 / 1_000_000)
# = $0.0131 + $0.004 = $0.0171 per query
# At 100k queries/month = $1,713/month

# Same thing with claude-3-haiku:
cost_haiku = (5250 * 0.25 / 1_000_000) + (400 * 1.25 / 1_000_000)
# = $0.0013 + $0.0005 = $0.0018 per query
# At 100k queries/month = $183/month

# Savings: 89% cheaper, nearly identical quality for most Q&A tasks

2. The Model Cascade (Intelligent Router)

import re
from openai import OpenAI
from anthropic import Anthropic

openai_client = OpenAI()
anthropic_client = Anthropic()

COMPLEXITY_TRIGGERS = {
    "hard": [
        r"\bstep[- ]by[- ]step\b",
        r"\bcomplex\b|\banalyz\b|\barchitect\b",
        r"\bwrite.*code|debug|implement\b",
        r"\bcompare.*vs|pros.*cons\b",
    ],
    "coding": [r"\bpython|javascript|sql|typescript|bash\b"],
}

def classify_query(query: str) -> str:
    query_lower = query.lower()
    
    # Very short queries → always cheap model
    if len(query.split()) <= 8:
        return "cheap"
    
    for pattern in COMPLEXITY_TRIGGERS["coding"]:
        if re.search(pattern, query_lower):
            return "code"  # GPT-4o is notably better at code
    
    for pattern in COMPLEXITY_TRIGGERS["hard"]:
        if re.search(pattern, query_lower):
            return "hard"
    
    return "standard"  # Default to mid-tier

async def smart_llm(prompt: str, system: str = "") -> str:
    complexity = classify_query(prompt)
    
    if complexity == "cheap":
        # Llama 3.1 8B via Groq — nearly free
        response = groq_client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[{"role": "user", "content": prompt}]
        )
        
    elif complexity in ("standard",):
        # Claude 3 Haiku — great value
        response = anthropic_client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
    
    else:
        # GPT-4o or Claude 3.5 Sonnet — reserved for hard/code tasks
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}]
        )
    
    return response  # Parse based on client

3. Semantic Caching (Cache Hits = Free)

import redis
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

redis_client = redis.Redis(host='localhost', port=6379, db=0)

def get_cache_key(query_embedding: list[float], threshold: float = 0.95) -> str | None:
    """Find a semantically similar cached query."""
    # Get all cached query embeddings
    cached_keys = redis_client.keys("embedding:*")
    
    for key in cached_keys[:100]:  # Check last 100 cached entries
        cached_embedding = np.frombuffer(redis_client.hget(key, "embedding"), dtype=np.float32)
        similarity = cosine_similarity([query_embedding], [cached_embedding])[0][0]
        
        if similarity > threshold:
            return key.decode().replace("embedding:", "response:")
    
    return None

async def cached_llm_call(query: str) -> str:
    # 1. Embed the query (cheap — $0.0001 per 1M tokens)
    embedding_response = openai_client.embeddings.create(
        model="text-embedding-3-small", input=query
    )
    query_embedding = embedding_response.data[0].embedding
    
    # 2. Check semantic cache
    cache_key = get_cache_key(query_embedding)
    if cache_key:
        cached_response = redis_client.get(cache_key)
        if cached_response:
            return cached_response.decode()  # Cache hit — free!
    
    # 3. Cache miss — call the LLM
    response = await smart_llm(query)
    
    # 4. Store in cache (TTL: 24 hours for most queries)
    embedding_bytes = np.array(query_embedding, dtype=np.float32).tobytes()
    pipe = redis_client.pipeline()
    pipe.hset(f"embedding:{query}", mapping={"embedding": embedding_bytes})
    pipe.setex(f"response:{query}", 86400, response)
    pipe.execute()
    
    return response

4. Context Compression for RAG

# RAG context can be 5,000-15,000 tokens. Compress it first.
from llmlingua import PromptCompressor

# LLMLingua: Open-source context compression (Microsoft Research)
compressor = PromptCompressor(model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank")

def compress_context(chunks: list[str], query: str, target_ratio: float = 0.4) -> str:
    """Compress RAG context to 40% of original length."""
    full_context = "\n\n".join(chunks)
    
    compressed = compressor.compress_prompt(
        context=full_context,
        instruction=query,
        target_token=int(len(full_context.split()) * target_ratio),
    )
    
    return compressed["compressed_prompt"]

# Savings example:
# Original: 5,000 context tokens at GPT-4o cost = $0.0125
# After 40% compression: 2,000 tokens = $0.005
# Savings: 60% on input tokens, minimal quality degradation

# Also: use OpenAI Batch API for non-real-time workloads (50% discount)
batch_response = openai_client.batches.create(
    input_file_id="file-xxx",  # JSONL file with all requests
    endpoint="/v1/chat/completions",
    completion_window="24h",   # 50% cheaper, completed within 24 hours
)

5. Cost Monitoring Dashboard

# Track spend per user/feature for chargeback and optimization
from dataclasses import dataclass

@dataclass
class LLMUsage:
    model: str
    prompt_tokens: int
    completion_tokens: int
    user_id: str
    feature: str  # "rag_query", "chat", "summarize", etc.
    
    def cost_usd(self) -> float:
        PRICES = {
            "gpt-4o": (2.50, 10.00),
            "gpt-4o-mini": (0.15, 0.60),
            "claude-3-haiku-20240307": (0.25, 1.25),
        }
        input_price, output_price = PRICES.get(self.model, (1.0, 3.0))
        return (self.prompt_tokens * input_price + self.completion_tokens * output_price) / 1_000_000

# Log every request to your analytics DB
async def log_usage(usage: LLMUsage, db):
    await db.llm_costs.insert({
        "model": usage.model,
        "cost_usd": usage.cost_usd(),
        "user_id": usage.user_id,
        "feature": usage.feature,
        "timestamp": datetime.utcnow(),
    })
# Query weekly breaks down: which users/features cost the most

Frequently Asked Questions

How do I know when routing to a cheaper model hurts quality?

Run your test suite against both models: take 200 representative queries, get answers from both GPT-4o and Claude 3 Haiku, then use an LLM-as-Judge evaluation to score quality. For most simple Q&A and summarization tasks, Haiku scores within 5-8% of GPT-4o. Route complex analytical tasks to the better model and simple tasks to the cheaper one.

What's the most impactful first thing to optimize?

RAG context compression delivers the highest ROI with the least engineering effort. Reducing retrieved chunks from 10 to 5, or using LLMLingua to compress context to 40% of original size, cuts input token costs by 40-60% immediately. Do this before building a complex routing system.

Conclusion

LLM FinOps is not about sacrificing quality — it's about matching query complexity to model capability. Simple questions don't need GPT-4o. Repeated queries don't need any model (cache them). The semantic caching + model cascade pattern typically reduces costs by 70-85% at scale, turning a $40k/month LLM bill into $6k without users noticing any degradation.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact