Grade your agents like you test your code.
Write an eval once — a string assertion or an LLM judge with your rubric — and run it against any trace. Bundle evals into a gate and a regressing change fails the branch-protection check before it merges.
01
Two graders, zero boilerplate
Start with output-contains: pass when the agent's final text contains (or never contains) a string — free and instant, the obvious first check. When you need judgment, hand the run to an LLM judge with a rubric you write. Either way you get a pass/fail, a 0–1 score, and a one-sentence reason on every run.
02
A judge you can trust to be strict
The judge sees the original task, the agent's output, and your rubric — then returns a structured verdict, not a vibe. It defaults to Claude Haiku 4.5 to keep grading cheap, won't invent criteria you didn't specify, and errs toward failing when a rubric is ambiguous. Judge runs honor your team budget and respect BYOK + PII redaction.
03
Bundle into a gate, block the regression
Group the evals that matter into a named gate with a slug, point it at a canvas and a test input, and the gate runs the agent then grades the resulting trace. The gate passes only when every selected eval passes — one red check and the whole gate is red.
04
Wire it into CI in one step
Drop the generated GitHub Action into your repo, add a scoped token secret, and every pull request POSTs to your gate's run endpoint. Add the check to branch protection and a regression blocks merge — with the verdict, cost, and a deep link back to the exact trace in the run summary.
2 · contains + judge
Grader kinds
Claude Haiku 4.5
Default judge
1 GitHub Action
CI hook
Part of one platform
Evals works hand in hand with Observability.
Speed plus trust — prove your agents got better this week.
Evals is one piece of Stackon, the observability-first workspace for teams running Claude and Codex. Start free and instrument your first run today.