Exploring DSPy: Programming, not Prompting

May 2025

LLM builders still spend alarming amounts of time hand-crafting prompts. It's fragile, unscalable, and—once you need chains, memory, or error recovery—downright painful. DSPy flips that workflow on its head: write declarative signatures, then let an optimizer learn the best prompts and examples. Think ORM versus raw SQL, but for language models.

What a Signature Looks Like

Here's the same QA task in both paradigms. Try editing the inputs below to see how each approach handles different contexts and questions:

Context + Question (try editing either field)

Traditional Prompting

DSPy Signature

Why Signatures Beat Strings

A signature is metadata. Because DSPy can see the structure, it can:

Generate few-shot examples automatically from your training data
Apply optimization algorithms like BootstrapFewShot, MIPRO, and BootstrapFewShotWithRandomSearch
Swap underlying models without touching a single prompt string
Optimize entire pipelines end-to-end rather than individual components in isolation

The Optimization Advantage

Traditional prompt engineering doesn't scale. You manually tune prompts for one model, then start over when switching to another. DSPy's optimizers solve this by treating prompts as learnable parameters:

# Define your pipeline once
class MultiHopQA(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=3)
        self.generate_query = dspy.ChainOfThought("context, question -> search_query")
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")

# Optimize for any model or dataset
optimizer = dspy.BootstrapFewShot(metric=exact_match)
compiled_qa = optimizer.compile(MultiHopQA(), trainset=train_examples)

This means you can define your pipeline once and then optimize it for different models, datasets, or performance metrics without rewriting prompts manually.

Real-World Impact

I've been experimenting with DSPy for document analysis tasks, and the results have been encouraging. The ability to chain multiple reasoning steps while maintaining optimization across the entire pipeline has proven particularly valuable. Instead of manually tuning each step in isolation, DSPy allows optimization of the entire flow end-to-end.

One area where DSPy shines is in complex reasoning tasks that require multiple steps. Traditional approaches often struggle with maintaining consistency across a pipeline when individual prompts are optimized in isolation. DSPy's holistic optimization approach addresses this by considering the entire workflow.

When Not to Use DSPy

"If the task fits in one prompt, keep it in one prompt."

For simple, one-off queries, a well-crafted prompt is often faster to implement and debug. DSPy's value emerges when you need:

Multi-step pipelines with several LM calls
Robustness across different models or datasets
Systematic optimization and benchmarking
Team-shared, maintainable code

The learning curve is steeper than traditional prompting, and the optimization process can be computationally expensive for complex pipelines with large datasets.

The Road Ahead

As foundation models become more capable and ubiquitous, the tooling around them needs to evolve beyond simple prompting. DSPy represents an important step in this direction, offering a more systematic approach to building with language models.

Just as we moved from writing raw SQL to using ORMs in traditional software development, DSPy suggests a similar evolution in AI application development. As LLMs inch toward commodity status, leverage will come from tooling that reasons about prompts instead of forcing humans to craft them manually.

While it's still early days, DSPy points toward a future where building with foundation models is less about crafting the perfect prompt and more about defining clear objectives and letting the system optimize the details.

I'd be curious to hear about your experiences with any other systematic approaches to foundation model programming.