Define the rubric
Encode the judgment your team already trusts. Versioned, reviewable, and the same on every run.
→For the teams who stopped trusting the eval script. Codify the rubric, score every release against it, and ship the result with proof attached.
Your judges. Your weights. Versioned.
Same rubric. Different model. Same standard.
Scorecards and reviewer notes follow the release.
Codify the rubric. Run every release against it. Ship with the proof.
Encode the judgment your team already trusts. Versioned, reviewable, and the same on every run.
→Each release passes through the same rubric. Drift, regression, and bias show up before the gate.
→Scorecards, reviewer notes, and decisions stay with the release. Sealed when it ships.
Rubric Studio Cloud turns prompt context, model outputs, and expert judgment into approved criteria, criterion-level grading, evidence capture, scorecards, and regression memory. Four named steps. One evaluation record.
The model reads the prompt, retrieved context, and prior failures. It proposes criteria and warnings. Nothing activates yet.
A named approver reads the draft, edits the criteria, sets the weights, and signs the version. The history is permanent.
Reviewers grade each criterion with the rubric and the evidence visible. Blocker reasons are first-class fields, not free-text notes.
Each grade contributes to failure breakdowns, regression bank entries, and the model scorecard the release team carries to approval.
A rubric that holds up under release pressure is not a Likert scale. It is a weighted contract with ground-truth anchors, a calibration loop, and a memory of the cases that have already escaped.
Each criterion has a weight, a passing threshold, and a fail-state contract. A release that wins on grounding but loses on disclosure is not an averaged-out pass — it is a hold.
We anchor each scoring run with a small set of ground-truth cases the team has agreed on. Drift in the score against the anchors is itself a signal.
Each run produces a side-by-side: the current production model and the candidate, scored on the same cases against the same criteria. The deltas are reviewable.
AI judges run first for scale, with calibration scores held against a human reviewer panel. Disagreement above threshold routes to expert review with the case attached.
The cases the rubric missed get added to the next release's required set. Misses do not get to escape twice.
Every run leaves something the team can act on — and something the next release has to clear.
One read on what passed, what failed, and what needs another look.
Cases the rubric flags get routed to the right reviewer with the rubric reading attached.
Every escaped case becomes a repeatable check the next release must pass.
What changed. Who approved it. What the rubric said at the time.
Rubric, reviewer notes, and verdict — ready when someone asks.
Every score in the studio is tied to the exact prompt, retrieval, tool call, answer, judge reading, and human override that produced it. The reviewer never has to ask “what did the model see?” — it is attached.
The customer or test case the run is grounded on.
→Retrieved context, policy lookups, and prior case memory.
→External calls — account, policy, knowledge base, action.
→The candidate's reply, attached to the path that produced it.
→AI judge scoring each criterion with calibration to humans.
→Reviewer accepts, edits, or holds the verdict with reason.
Customer asks for a refund exception after the policy window. The candidate offered partial credit but missed the disclosure copy the policy requires. AI judges scored 82/100 with escalation clarity below threshold. The reviewer holds the release until the disclosure copy is fixed. Forty-one cases route to the regression bank in the same step.
The judges run in parallel for scale. Each one is calibrated against a human panel on a rolling sample. When the judges disagree above threshold — or when either one falls outside its calibration band — the case routes to a reviewer with the rubric, the case, and the AI readings attached.
Reads the candidate's answer against the policy and grades grounding, citation accuracy, and contradiction risk.
Reads the candidate's answer against the customer's stated need and grades resolution, tone, and downstream effect.
A human reviewer sees both AI judge readings and the case. They accept, edit, or hold. Their verdict signs the release packet.
Test the run. Review the hard cases. Recruit the right specialist. Remember the misses. Approve what's right.
“We replaced four eval scripts and a slack thread with one rubric and a scorecard. The release meeting takes twenty minutes now. The hold-or-ship decision is already on the page.”
Often, yes — but we can also wrap it. Most teams keep their existing test sets and bring them under one rubric. The deltas are read across all of them.
Each judge has a rolling calibration sample scored by a human panel. If a judge drifts outside its calibration band, runs that depend on it are flagged and the band is re-fit.
Every rubric is versioned and signed. When a release surfaces a missing criterion, the rubric gets a new version, the criterion is added, and the prior decisions stay tied to the version that produced them.
Yes. The studio is model-agnostic. We have run it against frontier APIs, on-prem checkpoints, and customer-fine-tuned models in the same release gate.
Every issue. Every reviewer. One screen.
See the page →Every escaped failure becomes a gate the next release cannot cross.
See the page →The record builds as the work is done.
See the page →Bring the rubric your team already trusts. We'll make it the bar every release has to clear.