FinOps: Don't Bankrupt Your Startup
Dec 30, 2025 • 20 min read
The most common mistake AI startups make: they get OpenAI working in their prototype and assume the cost is negligible. Then they launch, get traction, and discover they're spending $40,000/month on GPT-4o for queries that a $0.10/1M token model handles equally well. LLM FinOps is the discipline of routing the right queries to the cheapest model that can answer them effectively — typically reducing costs by 60-80% without users noticing any quality difference.
1. Token Pricing Math
| Model | Input ($/1M tk) | Output ($/1M tk) | Best For |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Complex reasoning, code generation, vision |
| GPT-4o mini | $0.15 | $0.60 | Simple Q&A, classification, summarization |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Long documents, coding, nuanced instructions |
| Claude 3 Haiku | $0.25 | $1.25 | Fast, cheap tasks — ideal default for many apps |
| Llama 3.1 70B (via Groq) | $0.59 | $0.79 | Open-source, no data sent to OpenAI/Anthropic |
| Llama 3.1 8B (via Groq) | $0.05 | $0.08 | Ultra-cheap, surprisingly capable for many tasks |
# Token cost calculator — RAG is expensive because of long context
query_tokens = 50 # User's question
context_tokens = 5000 # 10 retrieved chunks × 500 tokens each
system_prompt_tokens = 200 # Your instructions
total_input_tokens = query_tokens + context_tokens + system_prompt_tokens # 5,250
output_tokens = 400 # LLM response
# Cost with GPT-4o:
cost_gpt4o = (5250 * 2.50 / 1_000_000) + (400 * 10.00 / 1_000_000)
# = $0.0131 + $0.004 = $0.0171 per query
# At 100k queries/month = $1,713/month
# Same thing with claude-3-haiku:
cost_haiku = (5250 * 0.25 / 1_000_000) + (400 * 1.25 / 1_000_000)
# = $0.0013 + $0.0005 = $0.0018 per query
# At 100k queries/month = $183/month
# Savings: 89% cheaper, nearly identical quality for most Q&A tasks2. The Model Cascade (Intelligent Router)
import re
from openai import OpenAI
from anthropic import Anthropic
openai_client = OpenAI()
anthropic_client = Anthropic()
COMPLEXITY_TRIGGERS = {
"hard": [
r"\bstep[- ]by[- ]step\b",
r"\bcomplex\b|\banalyz\b|\barchitect\b",
r"\bwrite.*code|debug|implement\b",
r"\bcompare.*vs|pros.*cons\b",
],
"coding": [r"\bpython|javascript|sql|typescript|bash\b"],
}
def classify_query(query: str) -> str:
query_lower = query.lower()
# Very short queries → always cheap model
if len(query.split()) <= 8:
return "cheap"
for pattern in COMPLEXITY_TRIGGERS["coding"]:
if re.search(pattern, query_lower):
return "code" # GPT-4o is notably better at code
for pattern in COMPLEXITY_TRIGGERS["hard"]:
if re.search(pattern, query_lower):
return "hard"
return "standard" # Default to mid-tier
async def smart_llm(prompt: str, system: str = "") -> str:
complexity = classify_query(prompt)
if complexity == "cheap":
# Llama 3.1 8B via Groq — nearly free
response = groq_client.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[{"role": "user", "content": prompt}]
)
elif complexity in ("standard",):
# Claude 3 Haiku — great value
response = anthropic_client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
else:
# GPT-4o or Claude 3.5 Sonnet — reserved for hard/code tasks
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
return response # Parse based on client3. Semantic Caching (Cache Hits = Free)
import redis
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
redis_client = redis.Redis(host='localhost', port=6379, db=0)
def get_cache_key(query_embedding: list[float], threshold: float = 0.95) -> str | None:
"""Find a semantically similar cached query."""
# Get all cached query embeddings
cached_keys = redis_client.keys("embedding:*")
for key in cached_keys[:100]: # Check last 100 cached entries
cached_embedding = np.frombuffer(redis_client.hget(key, "embedding"), dtype=np.float32)
similarity = cosine_similarity([query_embedding], [cached_embedding])[0][0]
if similarity > threshold:
return key.decode().replace("embedding:", "response:")
return None
async def cached_llm_call(query: str) -> str:
# 1. Embed the query (cheap — $0.0001 per 1M tokens)
embedding_response = openai_client.embeddings.create(
model="text-embedding-3-small", input=query
)
query_embedding = embedding_response.data[0].embedding
# 2. Check semantic cache
cache_key = get_cache_key(query_embedding)
if cache_key:
cached_response = redis_client.get(cache_key)
if cached_response:
return cached_response.decode() # Cache hit — free!
# 3. Cache miss — call the LLM
response = await smart_llm(query)
# 4. Store in cache (TTL: 24 hours for most queries)
embedding_bytes = np.array(query_embedding, dtype=np.float32).tobytes()
pipe = redis_client.pipeline()
pipe.hset(f"embedding:{query}", mapping={"embedding": embedding_bytes})
pipe.setex(f"response:{query}", 86400, response)
pipe.execute()
return response4. Context Compression for RAG
# RAG context can be 5,000-15,000 tokens. Compress it first.
from llmlingua import PromptCompressor
# LLMLingua: Open-source context compression (Microsoft Research)
compressor = PromptCompressor(model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank")
def compress_context(chunks: list[str], query: str, target_ratio: float = 0.4) -> str:
"""Compress RAG context to 40% of original length."""
full_context = "\n\n".join(chunks)
compressed = compressor.compress_prompt(
context=full_context,
instruction=query,
target_token=int(len(full_context.split()) * target_ratio),
)
return compressed["compressed_prompt"]
# Savings example:
# Original: 5,000 context tokens at GPT-4o cost = $0.0125
# After 40% compression: 2,000 tokens = $0.005
# Savings: 60% on input tokens, minimal quality degradation
# Also: use OpenAI Batch API for non-real-time workloads (50% discount)
batch_response = openai_client.batches.create(
input_file_id="file-xxx", # JSONL file with all requests
endpoint="/v1/chat/completions",
completion_window="24h", # 50% cheaper, completed within 24 hours
)5. Cost Monitoring Dashboard
# Track spend per user/feature for chargeback and optimization
from dataclasses import dataclass
@dataclass
class LLMUsage:
model: str
prompt_tokens: int
completion_tokens: int
user_id: str
feature: str # "rag_query", "chat", "summarize", etc.
def cost_usd(self) -> float:
PRICES = {
"gpt-4o": (2.50, 10.00),
"gpt-4o-mini": (0.15, 0.60),
"claude-3-haiku-20240307": (0.25, 1.25),
}
input_price, output_price = PRICES.get(self.model, (1.0, 3.0))
return (self.prompt_tokens * input_price + self.completion_tokens * output_price) / 1_000_000
# Log every request to your analytics DB
async def log_usage(usage: LLMUsage, db):
await db.llm_costs.insert({
"model": usage.model,
"cost_usd": usage.cost_usd(),
"user_id": usage.user_id,
"feature": usage.feature,
"timestamp": datetime.utcnow(),
})
# Query weekly breaks down: which users/features cost the mostFrequently Asked Questions
How do I know when routing to a cheaper model hurts quality?
Run your test suite against both models: take 200 representative queries, get answers from both GPT-4o and Claude 3 Haiku, then use an LLM-as-Judge evaluation to score quality. For most simple Q&A and summarization tasks, Haiku scores within 5-8% of GPT-4o. Route complex analytical tasks to the better model and simple tasks to the cheaper one.
What's the most impactful first thing to optimize?
RAG context compression delivers the highest ROI with the least engineering effort. Reducing retrieved chunks from 10 to 5, or using LLMLingua to compress context to 40% of original size, cuts input token costs by 40-60% immediately. Do this before building a complex routing system.
Conclusion
LLM FinOps is not about sacrificing quality — it's about matching query complexity to model capability. Simple questions don't need GPT-4o. Repeated queries don't need any model (cache them). The semantic caching + model cascade pattern typically reduces costs by 70-85% at scale, turning a $40k/month LLM bill into $6k without users noticing any degradation.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.