CASE STUDY

From Unreliable Pipeline to
Production Discipline

An AI enrichment system for sales intelligence — where a single pipeline tried to be good at sixteen different tasks, and wasn't good at any of them.

A sales intelligence company had built an AI-powered enrichment pipeline. Users submit a list of companies and a natural-language request — find competitors, identify top challenges, figure out if a company is B2B or B2C, summarize recent executive changes — and the AI decides which data fields to pull, which external sources to search, and how to synthesize everything into a result.

This is a hard problem. Not because any single task is hard, but because the system has to figure out which task it's even being asked to do, then execute a different reasoning chain for each one.

It wasn't working.

What was actually wrong

Even binary classifications — "Is this company SaaS?" — gave different answers on repeated runs. Multi-step tasks like identifying pain points and matching them to product capabilities were worse: sometimes the system searched the right sources but extracted the wrong things, sometimes it skipped sources entirely, sometimes it returned confident-sounding analysis that fell apart on inspection.

But the real problem wasn't the outputs. It was that nobody could tell when the system was working and when it wasn't. Quality assessment was someone manually eyeballing results. So failures only surfaced when a user complained. And every time the team improved one task type, they'd unknowingly break two others.

Without measurement, improvement is just whack-a-mole.

What we did

The first thing was to get concrete about what "correct" means — not in general, but per task type. For competitor identification, correct means using the right combination of company data and web sources, and returning companies that actually share market segments. For a news summary, correct means fetching recent articles, filtering for relevance, and citing sources. Different tasks, different criteria, different weights for what matters most versus what's nice to have.

Once you write the criteria out, you see it immediately: these aren't variations of one task. They're sixteen different reasoning patterns, and a single prompt can't hold all of them at once.

BASELINE

40%

pass rate

BETTER PROMPTS

80%

hit a ceiling

DECOMPOSED

95%

same models

We ran the existing pipeline against the full evaluation suite. About 40% pass rate. This isn't because the model is dumb — it's because you're asking one prompt to be good at sixteen things simultaneously, and that's not how LLMs work. They're good at narrow, well-defined tasks. They're unreliable at broad, ambiguous ones.

So we tried the obvious thing first: better prompts, better pre-processing, keep the same architecture. Multiple iterations against the test suite. Got to about 80%. The remaining 20% wouldn't budge. These weren't prompt failures — they were architectural failures. The system needed to make sequential decisions (which fields matter, is internal data sufficient or do we need external search, how to synthesize), and a single pass can't reliably chain those decisions together.

The fix is task decomposition. Break the pipeline into five or six discrete steps — field selection, data sufficiency check, search strategy, synthesis, output formatting. Each step does one thing. Each step is independently testable. Each step is simple enough that an LLM can do it reliably.

The decomposed pipeline hit 95% on the same evaluation suite. Same models, same data, same tasks. The only difference is the structure of the problem you're giving the model.

The last piece was transferring this to the team — a few developers, smart engineers, but no prior experience building AI systems. They needed to internalize the loop: define criteria, measure, change one thing, measure again. Not because it's complicated, but because the instinct with AI is to tweak and eyeball and hope. The discipline is to treat it like engineering.

What changed

Toward the end of the engagement, the company made a strategic decision to expose their enrichment capabilities directly to clients through an MCP integration, rather than routing everything through the AI planning layer. The specific pipeline we'd built together was superseded.

This is the part that matters: the evaluation framework and the decomposition thinking carried straight into what the team built next. The methodology wasn't coupled to the system. The system was one application of it. When the product direction changed, the team already knew how to define task-specific success criteria, measure systematically, and decompose complex AI workflows into reliable steps.

The pipeline changed. The engineering discipline didn't.

All case studies

Facing a similar problem?

If your AI system works in demos but not reliably in production, let's talk.

Book a call

From Unreliable Pipeline toProduction Discipline

What was actually wrong

What we did

What changed

Facing a similar problem?

From Unreliable Pipeline to
Production Discipline