Stackon
observability · regression gating

Grade your agents like you test your code.

Write an eval once — a string assertion or an LLM judge with your rubric — and run it against any trace. Bundle evals into a gate and a regressing change fails the branch-protection check before it merges.

gate · checkout-agentpassed
containsoutput-contains · ticket-idpass
containsnever-contains · secretspass
judgejudge · tone rubric0.94pass
judgejudge · cites a source0.88pass
gate passed4 / 4
stackon / eval-gate$0.0021 · 1.8s
Output-contains & LLM-judge gradersGates that block a PR from CIPass-rate history on every eval

01

Two graders, zero boilerplate

Start with output-contains: pass when the agent's final text contains (or never contains) a string — free and instant, the obvious first check. When you need judgment, hand the run to an LLM judge with a rubric you write. Either way you get a pass/fail, a 0–1 score, and a one-sentence reason on every run.

gate · checkout-agentpassed
containsoutput-contains · ticket-idpass
containsnever-contains · secretspass
judgejudge · tone rubric0.94pass
judgejudge · cites a source0.88pass
gate passed4 / 4
stackon / eval-gate$0.0021 · 1.8s

02

A judge you can trust to be strict

The judge sees the original task, the agent's output, and your rubric — then returns a structured verdict, not a vibe. It defaults to Claude Haiku 4.5 to keep grading cheap, won't invent criteria you didn't specify, and errs toward failing when a rubric is ambiguous. Judge runs honor your team budget and respect BYOK + PII redaction.

gate · checkout-agentpassed
containsoutput-contains · ticket-idpass
containsnever-contains · secretspass
judgejudge · tone rubric0.94pass
judgejudge · cites a source0.88pass
gate passed4 / 4
stackon / eval-gate$0.0021 · 1.8s

03

Bundle into a gate, block the regression

Group the evals that matter into a named gate with a slug, point it at a canvas and a test input, and the gate runs the agent then grades the resulting trace. The gate passes only when every selected eval passes — one red check and the whole gate is red.

canvas · pr-reviewrunning
Plannerdone
Coderlive
Reviewerqueued
agent.run · 3 spans2 / 3 nodes · streaming

04

Wire it into CI in one step

Drop the generated GitHub Action into your repo, add a scoped token secret, and every pull request POSTs to your gate's run endpoint. Add the check to branch protection and a regression blocks merge — with the verdict, cost, and a deep link back to the exact trace in the run summary.

trace · run_8c4fok · 742ms · $0.0053
agent.plan742ms
tools.search_code86ms
llm.complete_refactor612ms
tools.edit_file78ms
evals.no_regression54ms
agentllmtooleval5 spans · 3,007 tok

2 · contains + judge

Grader kinds

Claude Haiku 4.5

Default judge

1 GitHub Action

CI hook

Speed plus trust — prove your agents got better this week.

Evals is one piece of Stackon, the observability-first workspace for teams running Claude and Codex. Start free and instrument your first run today.