opncrafter

DSPy: Stop Prompting, Start Programming

Jan 1, 2026 • 20 min read

Hand-crafted prompt engineering has a fundamental flaw: the prompts you write for GPT-4 break on Claude, the carefully tuned few-shot examples for your RAG pipeline degrade when you add a reranking step, and every model upgrade is a painful manual re-tuning exercise. DSPy (Declarative Self-improving Python) from Stanford treats prompts as machine code — something the compiler generates optimally, not something humans write by hand. You define the pipeline's structure and metrics; DSPy's optimizers automatically discover the best prompts and few-shot examples through systematic evaluation.

1. The Problem: Prompt Brittleness

Complex AI pipelines chain multiple LLM calls together. Each link in the chain has a hand-written prompt, and changing any component breaks the whole chain:

  • Model switching: GPT-4 prompts rarely transfer to Llama-3 without degradation — different training data, RLHF, tokenization
  • Sensitivity to phrasing: "You are an expert assistant" vs "You are a knowledgeable helper" can cause 5-15% accuracy differences on structured tasks
  • Few-shot interference: Adding an example that helps on math questions might hurt history questions in the same pipeline
  • Stage interdependence: Optimizing the retrieval step may require re-optimizing the synthesis step — cascading manual work

2. DSPy Core Concepts

ConceptAnalogyWhat It Does
SignatureFunction interfaceDeclares input/output fields with types and descriptions — not HOW to do it
ModuleNeural network layerA callable unit (ChainOfThought, ReAct, Predict) that wraps a Signature
ProgramNeural networkComposed modules forming a complete pipeline (RAG, agent, classifier)
OptimizerGradient descentAutomatically finds best prompts/examples to maximize a metric on training data
MetricLoss functionEvaluates program output quality — drives the compilation process

3. Building and Compiling a RAG Pipeline

pip install dspy-ai

import dspy

# 1. Configure the LM and Retriever
lm = dspy.LM(model="openai/gpt-4o-mini", max_tokens=1000)
rm = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")
dspy.configure(lm=lm, rm=rm)

# 2. Define Signatures (WHAT, not HOW)
class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query for answering a factoid question."""
    context: list[str] = dspy.InputField(desc="Context from previous search results")
    question: str = dspy.InputField()
    query: str = dspy.OutputField()

class GenerateAnswer(dspy.Signature):
    """Answer the question using the provided context. Be concise and factual."""
    context: list[str] = dspy.InputField(desc="May contain relevant facts from Wikipedia")
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="1-5 words answer based on context")

# 3. Build the Module (multi-hop RAG with improved retrieval)
class MultiHopRAG(dspy.Module):
    def __init__(self, num_passages=3, num_hops=2):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        # ChainOfThought: automatically generates reasoning before the output field
        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(num_hops)]
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.num_hops = num_hops

    def forward(self, question):
        context = []
        
        # Multi-hop: iteratively refine the search query based on what we've found
        for hop in range(self.num_hops):
            query = self.generate_query[hop](context=context, question=question).query
            new_passages = self.retrieve(query).passages
            context = list(dict.fromkeys(context + new_passages))  # Deduplicate
        
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

rag = MultiHopRAG()

# 4. Define Your Training Data (just 10-20 examples needed!)
trainset = [
    dspy.Example(
        question="What is the capital of France?",
        answer="Paris"
    ).with_inputs("question"),
    dspy.Example(
        question="Who invented the telephone?",
        answer="Alexander Graham Bell"
    ).with_inputs("question"),
    # ... add 10-50 examples
]

# 5. Define a Metric
def answer_exact_match(example, prediction, trace=None):
    """Returns True if predictions match within reasonable tolerance."""
    return dspy.evaluate.answer_exact_match(example, prediction)

# 6. Compile! (The "Magic")
# BootstrapFewShot: runs RAG on trainset, finds successful executions,
# uses those as few-shot demonstrations for future calls
from dspy.teleprompt import BootstrapFewShot

teleprompter = BootstrapFewShot(
    metric=answer_exact_match,
    max_bootstrapped_demos=4,   # Max demonstrations to inject per step
    max_labeled_demos=4,
    max_rounds=1,               # How many optimization iterations
)

# This takes 5-30 min depending on trainset size and model speed
compiled_rag = teleprompter.compile(rag, trainset=trainset)

# 7. Use the compiled (optimized) pipeline
result = compiled_rag(question="What year was the Eiffel Tower completed?")
print(result.answer)  # "1889"
print(result.context)  # Retrieved passages that informed the answer

4. Advanced Optimizer: MIPRO for Instruction Optimization

from dspy.teleprompt import MIPRO

# MIPRO: More advanced optimizer that also optimizes the INSTRUCTIONS
# (not just few-shot examples like BootstrapFewShot)
# Uses a "meta-learner" LLM to propose instruction improvements
# and Bayesian optimization to efficiently search the instruction space

optimizer = MIPRO(
    metric=answer_exact_match,
    auto="medium",  # "light" (fast), "medium", "heavy" (thoroughness)
    # num_candidates: how many instruction candidates to try (default 10)
    # num_trials: Bayesian optimization trials
)

# Compile with development set for validation
compiled_rag_v2 = optimizer.compile(
    rag,
    trainset=trainset,
    valset=devset,  # Separate validation set (20-50 examples)
    requires_permission_to_run=False,
)

# Compare performance
from dspy.evaluate import Evaluate

evaluate = Evaluate(
    devset=devset,
    metric=answer_exact_match,
    num_threads=4,            # Parallel evaluation
    display_progress=True,
)

baseline_score = evaluate(rag)     # Uncompiled: e.g., 62%
compiled_score = evaluate(compiled_rag_v2)  # Compiled: e.g., 78%
print(f"Improvement: +{compiled_score - baseline_score:.1f}%")

Frequently Asked Questions

How many training examples does DSPy need to be effective?

BootstrapFewShot can work with as few as 10-20 labeled examples (question + answer pairs). It doesn't need thousands of examples like traditional ML fine-tuning — it's searching for good demonstration examples to include in prompts, not updating model weights. MIPRO typically benefits from 50-100 training examples for reliable instruction optimization. The key is that examples must be representative of your actual use case and have verifiable ground truth answers. Production teams often start with 20 hand-labeled examples, discover performance improvements, then scale to 100+.

When should I use DSPy vs. just prompt engineering?

Use DSPy when you have: a multi-step pipeline where optimizing one step affects others, a need to support multiple LLM backends (switch between providers without re-prompting), a measurable quality metric you can express programmatically, and a dataset of at least 10-20 examples. Stick with hand-crafted prompts when: you're building a simple single-step LLM call, you don't have ground truth examples to optimize against, or you need maximum control over the exact prompt text for compliance/audit reasons. DSPy's compilation overhead (time + API cost) is justified for complex pipelines where systematic optimization would otherwise require days of manual prompt tuning.

Conclusion

DSPy represents a fundamental shift in how AI engineers think about LLM-powered applications. By separating the pipeline structure (Python) from the prompt content (compiled artifacts), DSPy makes AI pipelines as modular and maintainable as traditional software. Change the LLM, recompile, and get automatically optimized prompts for the new model. Add a pipeline step, recompile the whole program, and get prompts that are globally optimized across all steps. For complex, metric-driven AI applications that need to run reliably across different models and over time, DSPy is a significant improvement over manual prompt engineering.

Continue Reading

👨‍💻
Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK