CASE STUDY
How a SaaS company built a natural-language BI system business users can actually trust — by teaching the model what the data means, not just what it looks like.
A SaaS company wanted to let business users ask questions about their data in plain English — revenue by quarter, pipeline conversion rates, rep performance against quota, churn by cohort — and get answers they could trust the way they trust a report from their BI team. The data lived across HubSpot and Sage, CRM and ERP, two systems with different schemas, different conventions, different definitions of the same words.
They had a working prototype. The LLM could read the schema and write SQL. The problem was you couldn't trust the answers.
The system had three failure modes, and they all came from the same root cause: the model had the schema but not the business context.
FAILURE 1
The model picks a plausible reading and runs with it. Calendar-year gross bookings when the user meant fiscal-year net revenue.
FAILURE 2
Ask "break down revenue by quarter" twice and get Q1/Q2/Q3/Q4 one time and 2025-01/2025-04/2025-07/2025-10 the next.
FAILURE 3
"How are we doing on retention?" is ambiguous. An analyst would ask for clarification. The model just picks assumptions. Confidently. Silently.
"What were our sales last quarter?" sounds simple, but the company prorates subscription revenue, their fiscal year starts in April, and returns are handled differently depending on product line. The model doesn't know any of this.
Both are valid SQL. Only one is correct.
Prompt engineering was the obvious first move. Encode the business rules — fiscal year definitions, revenue calculation methods, return handling — directly into the prompt. This helped, but it's fragile. Change one line in a prompt that contains dozens of business rules and you fix revenue calculations but break quota aggregations. Every improvement is a potential regression.
So before going further, we built the evaluation layer. Hundreds of test cases — natural language questions paired with verified SQL and known-correct results. Some reverse-engineered from existing dashboards. Some sourced from stakeholders: "this is the question I ask every Monday, and this is what the right answer looks like." The test suite had to cover enough ground that any prompt change could be measured against it immediately: are we improving overall, or just trading one failure for another?
With measurement in place, the architecture problem became clear. The model needed business context, but you can't put hundreds of business rules and conventions into a single prompt. The solution was dynamic few-shot examples. We built a retrieval layer over a golden set of hundreds of verified question-SQL pairs — real questions with correct, analyst-approved queries that encode all the implicit knowledge: how the company formats quarters, how it calculates net revenue, which joins are correct across HubSpot and Sage, how to handle nulls in pipeline stage data.
The model doesn't need to know every business rule. It needs to see how similar questions were answered correctly, and extrapolate.
When a user asks a question, the system retrieves the most relevant examples from the golden set and includes them in the prompt. This solved all three problems at once. The examples enforce consistency — if every golden example formats quarters as Q1 FY25, the model follows suit. They encode domain logic — the model sees how revenue was calculated in similar queries and infers the pattern. And for ambiguity, we built a separate detection step with its own example set: the model learned when a question was too vague to answer, and asks for clarification instead of guessing.
BEFORE
75%
accuracy
AFTER
97%
accuracy
The system is in staging, serving real business queries across CRM and ERP data. Users ask questions in natural language and get results that match what their BI team would produce — same definitions, same conventions, same formatting.
The golden example set is the real asset. It's a machine-readable encoding of how this specific company thinks about its data. Every time someone verifies a new query, the set grows, and the system gets more accurate. It's not a static system that degrades. It's a flywheel that improves as people use it.
A schema tells the model what the data looks like. Examples teach it what the data means.
If your AI system works in demos but not reliably in production, let's talk.