AI LABS · EVALUATION STUDIO · TEST IT BEFORE IT SHIPS

Test it before it ships.

For the teams who stopped trusting the eval script. Codify the rubric, score every release against it, and ship the result with proof attached.

Talk to AI Labs See pricing

RUBRIC

Codified once

Your judges. Your weights. Versioned.

RUN

Every release

Same rubric. Different model. Same standard.

PROOF

Attached

Scorecards and reviewer notes follow the release.

HOW IT WORKS

Three steps. No eval script.

Codify the rubric. Run every release against it. Ship with the proof.

STEP 01

WHAT WE CODIFY

Define the rubric

Encode the judgment your team already trusts. Versioned, reviewable, and the same on every run.

→

STEP 02

WHAT WE SCORE

Run every release against it

Each release passes through the same rubric. Drift, regression, and bias show up before the gate.

→

STEP 03

WHAT WE SIGN

Ship with proof attached

Scorecards, reviewer notes, and decisions stay with the release. Sealed when it ships.

RUBRIC ANATOMY · WHAT WE CODIFY

A rubric is governed before it scores.

Rubric Studio Cloud turns prompt context, model outputs, and expert judgment into approved criteria, criterion-level grading, evidence capture, scorecards, and regression memory. Four named steps. One evaluation record.

01 · AI DRAFT

Draft criteria from the prompt context

The model reads the prompt, retrieved context, and prior failures. It proposes criteria and warnings. Nothing activates yet.

↳ REVIEW REQUIRED

02 · EXPERT APPROVAL

No AI-drafted rubric ships without a human signoff

A named approver reads the draft, edits the criteria, sets the weights, and signs the version. The history is permanent.

↳ BLOCKED UNTIL SIGNED

03 · WORKER GRADING

Criterion-level grades with required evidence

Reviewers grade each criterion with the rubric and the evidence visible. Blocker reasons are first-class fields, not free-text notes.

↳ CRITERION LEVEL

04 · SCORECARD MEMORY

Submitted grades roll into release reads

Each grade contributes to failure breakdowns, regression bank entries, and the model scorecard the release team carries to approval.

↳ STORED · INDEXED

METHODOLOGY · HOW WE SCORE

Five practices that make a rubric survive a release.

A rubric that holds up under release pressure is not a Likert scale. It is a weighted contract with ground-truth anchors, a calibration loop, and a memory of the cases that have already escaped.

WEIGHTED CRITERIA

The rubric carries weights, not just labels

Each criterion has a weight, a passing threshold, and a fail-state contract. A release that wins on grounding but loses on disclosure is not an averaged-out pass — it is a hold.

GROUND-TRUTH ANCHORS

Every run carries reference cases

We anchor each scoring run with a small set of ground-truth cases the team has agreed on. Drift in the score against the anchors is itself a signal.

BASELINE vs CANDIDATE

Two columns, same rubric

Each run produces a side-by-side: the current production model and the candidate, scored on the same cases against the same criteria. The deltas are reviewable.

JUDGE CALIBRATION

AI judges with human override

AI judges run first for scale, with calibration scores held against a human reviewer panel. Disagreement above threshold routes to expert review with the case attached.

REGRESSION MEMORY

Every escaped failure becomes a permanent check

The cases the rubric missed get added to the next release's required set. Misses do not get to escape twice.

WHAT COMES OUT

What your team leaves with.

Every run leaves something the team can act on — and something the next release has to clear.

Scorecards

One read on what passed, what failed, and what needs another look.

↳ ARTIFACT

Review queues

Cases the rubric flags get routed to the right reviewer with the rubric reading attached.

↳ ARTIFACT

Replay suites

Every escaped case becomes a repeatable check the next release must pass.

↳ ARTIFACT

Decision timelines

What changed. Who approved it. What the rubric said at the time.

↳ ARTIFACT

Evidence packets

Rubric, reviewer notes, and verdict — ready when someone asks.

↳ ARTIFACT

TRACE ANATOMY · WHAT THE SCORE IS TIED TO

The score sits on the reasoning path.

Every score in the studio is tied to the exact prompt, retrieval, tool call, answer, judge reading, and human override that produced it. The reviewer never has to ask “what did the model see?” — it is attached.

01 · PROMPT

The customer or test case the run is grounded on.

→

02 · RETRIEVAL

Retrieved context, policy lookups, and prior case memory.

→

03 · TOOL CALL

External calls — account, policy, knowledge base, action.

→

04 · ANSWER

The candidate's reply, attached to the path that produced it.

→

05 · JUDGE SCORE

AI judge scoring each criterion with calibration to humans.

→

06 · HUMAN OVERRIDE

Reviewer accepts, edits, or holds the verdict with reason.

EXAMPLE RUN · SUPPORT ASSISTANT V42

Customer asks for a refund exception after the policy window. The candidate offered partial credit but missed the disclosure copy the policy requires. AI judges scored 82/100 with escalation clarity below threshold. The reviewer holds the release until the disclosure copy is fixed. Forty-one cases route to the regression bank in the same step.

READING · CANDIDATE vs BASELINE

CALIBRATED JUDGES · AI + HUMAN

Two AI judges. One human reviewer. One signed verdict.

The judges run in parallel for scale. Each one is calibrated against a human panel on a rolling sample. When the judges disagree above threshold — or when either one falls outside its calibration band — the case routes to a reviewer with the rubric, the case, and the AI readings attached.

JUDGE A

Policy fidelity

Reads the candidate's answer against the policy and grades grounding, citation accuracy, and contradiction risk.

JUDGE B

Customer impact

Reads the candidate's answer against the customer's stated need and grades resolution, tone, and downstream effect.

REVIEWER

Final disclosure

A human reviewer sees both AI judge readings and the case. They accept, edit, or hold. Their verdict signs the release packet.

WHERE IT FITS

In the loop, this is where you test.

Test the run. Review the hard cases. Recruit the right specialist. Remember the misses. Approve what's right.

Test

● YOU ARE HERE

Review

Recruit

Remember

Approve

ON THE RECORD · A FRONTIER AI LAB

“We replaced four eval scripts and a slack thread with one rubric and a scorecard. The release meeting takes twenty minutes now. The hold-or-ship decision is already on the page.”
Release lead · a frontier AI lab

FAQ · WHAT TEAMS ASK

Four common questions, direct answers.

Q · 01

Does this replace our existing eval suite?

Often, yes — but we can also wrap it. Most teams keep their existing test sets and bring them under one rubric. The deltas are read across all of them.

Q · 02

How do AI judges stay honest?

Each judge has a rolling calibration sample scored by a human panel. If a judge drifts outside its calibration band, runs that depend on it are flagged and the band is re-fit.

Q · 03

What if the rubric itself is wrong?

Every rubric is versioned and signed. When a release surfaces a missing criterion, the rubric gets a new version, the criterion is added, and the prior decisions stay tied to the version that produced them.

Q · 04

Can we run this against custom models?

Yes. The studio is model-agnostic. We have run it against frontier APIs, on-prem checkpoints, and customer-fine-tuned models in the same release gate.

RELATED MODULES