CASE STUDY
How a financial research firm went from unreliable AI-generated survey analysis to a production system that analysts use daily — by changing the abstraction, not the model.
A financial research firm runs large-scale surveys — thousands of respondents, dozens of questions, recurring across multiple business clients. The goal is always some version of the same thing: figure out which financial products to sell to which audience segments. That means slicing the data by demographics, cross-tabulating responses, comparing sentiment across cohorts, spotting patterns that aren't obvious in a flat summary.
They tried doing this with AI. ChatGPT, Claude, Gemini, NotebookLM. Every approach failed the same way.
When you give a large spreadsheet to an LLM, it does one of two things. It tries to read it as text — which means it's processing thousands of rows as a token sequence, losing track of structure, hallucinating numbers, giving different answers to the same question on different runs. Or it writes a Python script with Pandas — which gets closer, but Pandas code generated by an LLM is brittle. It silently drops edge cases, handles missing values inconsistently, and falls apart on anything beyond basic aggregation.
A spreadsheet full of survey responses isn't text. It's structured data. LLMs are bad at reasoning over structured data by treating it as unstructured. You can prompt-engineer around this for simple questions like "what percentage of respondents chose option A." You can't prompt-engineer your way to reliable cross-tabulations across demographic segments with conditional filtering.
That's not a language problem. It's a query problem.
The reframe was simple: treat the spreadsheet as what it actually is — a database. We converted the survey data into DuckDB, an analytical database engine designed for exactly this kind of work. Column-oriented, built-in OLAP functions, fast aggregations across large datasets. Instead of asking an LLM to reason about data, we ask it to write SQL.
The model no longer needs to hold thousands of rows in context. It doesn't need to count, average, or cross-tabulate — the database does that. The model just needs to translate a natural language question into a structured query. That's a dramatically simpler task, and it's one LLMs are genuinely good at.
The model just needs to translate a natural language question into a structured query. That's a dramatically simpler task.
The agent is built to iterate. It reads the database schema, writes a query, inspects the result, decides if it answered the question or needs another pass. For complex questions it runs several queries in parallel — segment breakdowns across multiple dimensions — then reasons over all the results together before deciding if it needs a follow-up round. This is where agentic architecture earns its cost: the system can decompose a complex analytical question into a sequence of precise operations, each one verifiable.
Prompt engineering on top of this targeted the translation layer — making sure the agent understood survey-specific patterns. How to handle "top 3" questions. How to normalize free-text responses. How to interpret Likert scales. How to handle missing data without silently skewing results.
For testing, we built synthetic survey datasets — purpose-built to have known distributions and known correct answers. You can't test an analytical system against real data where you don't know the right answer. You need ground truth. So we manufactured it: controlled response patterns where every cross-tabulation has a verifiable result. The agent either gets the number right or it doesn't.
BEFORE
~50%
accuracy
AFTER
95%
accuracy
The system is in production and used daily across multiple client engagements. Analysts ask complex questions in natural language — segmentation, cross-tabulation, conditional filtering, trend comparison — and get reliable, verifiable answers backed by actual SQL queries they can inspect.
The team's original instinct was "we need a smarter model." The answer was a better abstraction. The model didn't get smarter. The problem got restructured so that a model of the same capability could solve it reliably. SQL gave the LLM something that Pandas and raw text never could — a precise, deterministic execution layer where the model only has to get the intent right, and the database handles the math.
Don't give the model a harder problem than it needs to solve.
If your AI system works in demos but not reliably in production, let's talk.