DSPy: Keeping pipeline promises intact

7 minute read Published: 2025-10-08

Language-model pipelines began life as collections of personal tricks. We shipped features only after someone produced the right paragraph of instructions, and that paragraph lived in a notebook or a mind. The knowledge was fragile, undocumented, and always under revision.

We learned this the hard way building a classifier that stubbornly hovered around 50% accuracy for days. After countless attempts and growing frustration, we discovered two innocuous words in our prompt were poisoning the results. Remove them, and accuracy jumped to 70% overnight. The "fix" lived in one person's head, and we had no process for finding it again.

Table of Contents

The same model update that improved one part of the product dissolved carefully tuned behavior somewhere else. We couldn't say, in plain words, what the system actually believed about its task; the intent was smeared across lines of improvised text. We could change a single noun, merge two product features, or move to a different model provider and watch months of tacit knowledge evaporate overnight.

We remember the scramble when OpenAI announced in 2023 that the old text-davinci-003 completion model would be retired and we needed to migrate to gpt-3.5-turbo-instruct. We had months of notice, yet the next few weeks were consumed by rewriting prompts and hoping our carefully tuned flows would survive the switch. The experience made it obvious how little of our understanding was written down anywhere.

When Prompts Break

DSPy steps in right at that fault line. The real problem is we never kept a solid record of what each part of the pipeline is meant to do, so we keep rewriting prompts to fill the gap.

Here's what we do instead with DSPy: we write down the expectation in code first, then let the tooling gather the wording that proves the system meets it. Structure and phrasing are deliberately separated.

That stance sounds academic until we try to audit a production assistant without it. We're asked why a refund workflow suddenly refuses valid claims. The usual answer is a shrug followed by a treasure hunt through version histories of prompts. With DSPy the search begins in the program’s declaration: here is the rule we thought we were enforcing, here is the supporting text the compiler trusted last week.

Write the Contract, Teach the Words

Once we split expectation from phrasing, the process becomes more straightforward. Our program defines the inputs, outputs, and how modules pass data between each other. That source file becomes the permanent record of what the pipeline promises to do.

So here's how we write that promise in code:

class PolicyCheck(dspy.Signature):
    """Given a customer claim and policy text, determine if it's covered."""
    
    claim = dspy.InputField(desc="Customer's description of what happened")
    policy_text = dspy.InputField(desc="Relevant insurance policy section")
    answer = dspy.OutputField(desc="Yes/No with cited policy clause")
    reasoning = dspy.OutputField(desc="Brief explanation of the decision")

# The signature makes the promise explicit. The optimizer finds the words.

What happens next is the DSPy compiler searches for prompts, examples, or model weights that fulfill our promise according to our chosen metric. We use built-in optimizers (BootstrapRS, MIPROv2, GEPA, and BootstrapFinetune) that automate what we previously did manually: propose a prompt, measure its performance, keep what works, discard what doesn't. These are just different strategies for finding optimal prompts. The difference is speed and repeatability.

The optimizer's output (refined instructions, synthetic examples, perhaps a small fine-tune) serves as evidence that this promise holds for our data. When results drift, we don't rewrite the promise; we gather better evidence until the metrics improve.

Think of it this way: the signature is our immutable promise, and the optimizer is just trying out different ways to express it.

Because the metric is explicit, every optimization run leaves a paper trail. Today’s prompt ships with its score and the seed examples that backed it. When we inherit the pipeline, we can see exactly what "good" meant when it went out the door.

In a world where so much knowledge lives in scattered prompts and tribal memory, having that list in one place is surprisingly useful. We've all experienced the alternative.

Swapping Models Without Panic

Model choice would otherwise be a constant source of drift. We've all felt that pain when a new model version breaks carefully tuned prompts.

DSPy avoids this by building on LiteLLM, which provides a unified interface to OpenAI, Anthropic, Databricks, Google, local LLaMA builds, and custom deployments. Our code's promise remains unchanged while the backend model varies.

When we switch providers, the compiler simply re-runs to find prompts that work with the new model. No manual rewrite required.

That neutrality is less about vendor games than it is about keeping context. When procurement forces a switch, our core program does not flinch.

We rerun the optimizer, compare the metrics, and decide whether the new engine keeps the promise. If it doesn’t, the shortfall shows up in numbers, not guesswork. The question becomes, "what extra evidence do we need?" instead of, "who remembers how we coaxed the old model?"

Let It Rehearse

The term "self-improving" sounds a bit mystical, but the reality is more down-to-earth: the pipeline practices. When performance drops or data changes, we measure again and let the compiler find new prompts that work.

The core intent stays fixed in our source code. Improvement means collecting better supporting evidence, not inventing new behavior.

In practical terms that means rehearsal is cheap. We can schedule regular optimization runs the same way we schedule regression tests. Each run asks the same question: "given our belief, does the current evidence still convince us?" If the answer becomes no, the remediation is procedural—add data, adjust the metric, rerun—rather than an emergency writing session.

We stay in the loop where judgment matters. Our compliance team can inspect the candidate prompts the optimizer proposes and veto wordings that violate tone or policy. Those reviews fold back into the evidence base, making the next rehearsal faster. The practice cycle respects institutional knowledge instead of erasing it.

What Stays Core

The surrounding ecosystem keeps expanding—retrieval modules, observability hooks, guardrails—but the core principle remains the same.

We believe pipelines are programs. Programs state their expectations. The compiler finds the words that make any compatible model honor those expectations. Everything else is just tooling.

We see community contributions show this in motion. The Qdrant team maintains dspy-qdrant, a drop-in retriever that lets DSPy programs fetch context from a vector database without touching the surrounding modules¹. Research groups have open-sourced whole pipelines like IReRa, which breaks infer–retrieve–rank logic into DSPy modules that others can reuse². Each addition snaps into the existing structure instead of redefining it, which is exactly the rule the framework tries to enforce.

That stability invites reuse. We can share entire pipelines as living documents: here is the belief, here is the evidence, here are the hooks for observation. We don't inherit a mysterious blob of prompts; we inherit a scaffold that explains itself. When a pipeline fails in a new environment, the debugging path is already mapped.

Ship the Promise

DSPy exists for a simple reason: so the core expectation behind a pipeline doesn't evaporate after the next data change, teammate handoff, or model release.

We write the expectation down, let the system gather supporting evidence, and keep both side by side. When we ship it, we're shipping the intent and the maintenance recipe. That's the whole point.

We find humility in that approach. Instead of celebrating a clever prompt, we keep a clear statement of purpose and an automated check that the purpose still holds. The language model is just a tool collecting evidence for the expectation we already wrote.

We keep that relationship visible because it's what makes DSPy feel steady in a field that often runs on rumor.

So the practice continues: we write down the expectation, let the compiler gather the words, measure the outcome, repeat. Nothing about that cycle is flashy, but it keeps the pipeline honest.

Qdrant, "DSPy-Qdrant: Qdrant powered custom retriever module for DSPy" https://github.com/qdrant/dspy-qdrant

Tom Or Bgu et al., "xmc.dspy: Infer-Retrieve-Rank programs built on DSPy" https://github.com/KarelDO/xmc.dspy