CASE STUDY

When the Model
Can't Reason

Building a reliable AI system for personal financial documents — where privacy means everything runs on-device, and edge models can't do what you need them to do.

A fintech company was building a mobile app for personal financial document management — privately storing, searching, and working with sensitive financial documents on-device. The key architectural constraint: financial data couldn't leave the device.

Privacy wasn't a feature. It was the product.

That means edge models. And edge models can't reason.

What was actually wrong

The team had already tried. They'd taken an edge model, given it the tasks — understand what the user wants, figure out which documents are relevant, extract the right information — and the results were bad. Not "needs prompt tuning" bad. Fundamentally bad.

The model couldn't reliably distinguish between "find my latest tax return" and "how much did I pay in taxes last year." One is retrieval. The other requires reading a document and extracting a number. An edge model treats them the same because it doesn't have the capacity to reason about the difference.

Tool use was the other gap. Frontier models can learn to call functions — search an index, open a document, extract a field. Edge models can't do this reliably. You can get them to emit keywords that trigger tools, but that's brittle. It works in demos. It breaks in production.

The core problem: the team needed frontier-model capabilities in an environment where only edge-model compute was available.

What we did

The first step was understanding the actual shape of user requests. Not "what could users ask" — that's infinite. "What will users ask" — that's finite. We mapped out the realistic combinations: request types crossed with document types crossed with the operations needed to fulfill them. Hundreds of distinct combinations, but hundreds is a number you can work with.

Once you see the problem as classification rather than reasoning, everything changes.

We used frontier models to generate thousands of training examples across these combinations — synthetic user inputs mapped to their correct use case category. Then we validated the generated data: additional frontier models as cross-checks, plus manual review at critical control points where misclassification would cause the worst user experience.

We trained a lightweight transformer — a modern BERT-class model — to classify incoming requests. This model doesn't reason. It doesn't need to. It just answers one question: what is the user trying to do? And it answers it fast, on-device, with high accuracy.

Once you know exactly what the user wants, the edge model's job gets radically simpler. Instead of "understand this request and figure out what to do," it's "here's a specific task type, here are the instructions for this exact scenario, execute." A narrow, well-defined prompt that an edge model can handle reliably.

LAYER 1

Heuristics

Obvious cases handled by pattern matching. No model needed.

LAYER 2

Classifier

Ambiguous cases classified fast and accurately, on-device.

LAYER 3

Edge LLM

Language generation with specific, matched instructions.

And when confidence is low at any layer, the system falls back gracefully rather than guessing.

What changed

The system went from unreliable — failing at basic request understanding — to roughly 80% production-quality, with graceful degradation on the remainder. No data leaves the device. No frontier model in the loop at runtime. The compute cost per request is negligible.

The deeper insight is architectural. The team's original approach was "take a hard problem and give it to a model." The approach that worked was "make the problem easier before the model sees it." Classification, heuristics, and decomposition did the heavy lifting. The LLM became the last mile, not the whole journey.

Don't ask the model to reason. Change the problem so it doesn't have to.

This is a pattern that generalizes well beyond edge. Any time you're working with a model that isn't quite capable enough for the full task — whether because of size constraints, cost, latency, or reliability requirements — the move is the same.

Facing a similar problem?

If your AI system works in demos but not reliably in production, let's talk.