CASE STUDY

When the Evaluation Criteria
Don't Exist Yet

A production studio asked if AI could evaluate microdrama content — where there was no rubric, no framework, and even human experts judged on instinct. Building the evaluation criteria was the project.

Most case studies describe a common starting point: a system exists, it's unreliable, and the first step is building an evaluation framework to measure what "good" means. But what happens when there's no system and no framework? When even the human experts evaluate on instinct rather than criteria?

A production studio needed to evaluate microdrama content — short-form series of sixty to eighty two-minute episodes, produced at volume. Script quality before production. Episode quality after filming. Whether the hook episode would convert free viewers to paying subscribers. They were doing all of this manually, based on experience and gut feel.

The question wasn't "fix our AI." It was "can AI do this at all?"

Building judgment from scratch

The first problem was defining what "good" means in a domain that doesn't have a rubric. Microdrama isn't film, isn't television, isn't TikTok. It has its own conventions — pacing structures, cliffhanger mechanics, emotional escalation patterns across dozens of episodes, specific rules about when internal monologue works versus external dialogue. None of this was written down in a form an AI system could use.

We used frontier models to reverse-engineer the genre's implicit rules. Analyze successful series. Identify the structural patterns. Cross-reference against industry knowledge about what drives retention and conversion in short-form serialized content. The output was a systematic evaluation framework — roughly twenty dimensions covering narrative structure, pacing, hook placement, character dynamics, emotional arc, and episode-level production quality.

This framework didn't exist before the project. Building it was the project.

From scripts to video

Script analysis was the first application. Feed a script through the framework and get structured feedback: where the pacing drops, which episode works as the conversion hook, where dialogue should shift from external to internal monologue, what can be improved before a single frame is shot.

Video analysis was harder. We fed full two-minute episodes into Gemini's multimodal model and evaluated against the same framework plus production-specific dimensions — lighting, camera work, actor delivery, adherence to the original script.

In one scene, a character slaps someone across the face — and Gemini's safety filters silently omitted it from the analysis. We only caught the gap because we could cross-reference the model's scene-by-scene breakdown against the original script.

This required multiple passes. The fix was a multi-pass pipeline: initial analysis, deviation detection against the script, then targeted re-analysis of flagged gaps. The kind of reliability problem that's invisible if you're evaluating by eyeballing outputs.

Why this matters as methodology

This was a proof of concept, completed in a week. It didn't go to production. But the studio confirmed the analysis matched and extended their own expert judgment — the framework caught what their best reviewers would catch, and surfaced things they hadn't noticed.

The point isn't the specific application. It's that Evidence-Based AI Engineering works even when the evaluation criteria don't exist yet. The methodology's first step — define what "good" means, concretely, measurably — still applies. It just means the framework itself becomes the primary deliverable, and the AI system is built to embody it.

The judgment isn't subjective. It's just undocumented.

All case studies

Facing a similar problem?

If your AI system works in demos but not reliably in production, let's talk.

Book a call

When the Evaluation CriteriaDon't Exist Yet

Building judgment from scratch

From scripts to video

Why this matters as methodology

Facing a similar problem?

When the Evaluation Criteria
Don't Exist Yet