observability · regression gating

Grade your agents like you test your code.

Write an eval once — a string assertion or an LLM judge with your rubric — and run it against any trace. Bundle evals into a gate and a regressing change fails the branch-protection check before it merges.

Start free Open Evals →

gate · checkout-agentpassed

containsoutput-contains · ticket-idpass

containsnever-contains · secretspass

judgejudge · tone rubric0.94pass

judgejudge · cites a source0.88pass

gate passed4 / 4

stackon / eval-gate$0.0021 · 1.8s

Output-contains & LLM-judge gradersGates that block a PR from CIPass-rate history on every eval

Two graders, zero boilerplate

Start with output-contains: pass when the agent's final text contains (or never contains) a string — free and instant, the obvious first check. When you need judgment, hand the run to an LLM judge with a rubric you write. Either way you get a pass/fail, a 0–1 score, and a one-sentence reason on every run.

gate · checkout-agentpassed

containsoutput-contains · ticket-idpass

containsnever-contains · secretspass

judgejudge · tone rubric0.94pass

judgejudge · cites a source0.88pass

gate passed4 / 4

stackon / eval-gate$0.0021 · 1.8s

A judge you can trust to be strict

The judge sees the original task, the agent's output, and your rubric — then returns a structured verdict, not a vibe. It defaults to Claude Haiku 4.5 to keep grading cheap, won't invent criteria you didn't specify, and errs toward failing when a rubric is ambiguous. Judge runs honor your team budget and respect BYOK + PII redaction.

gate · checkout-agentpassed

containsoutput-contains · ticket-idpass

containsnever-contains · secretspass

judgejudge · tone rubric0.94pass

judgejudge · cites a source0.88pass

gate passed4 / 4

stackon / eval-gate$0.0021 · 1.8s

Bundle into a gate, block the regression

Group the evals that matter into a named gate with a slug, point it at a canvas and a test input, and the gate runs the agent then grades the resulting trace. The gate passes only when every selected eval passes — one red check and the whole gate is red.

canvas · pr-reviewrunning

Plannerdone

Coderlive

Reviewerqueued

agent.run · 3 spans2 / 3 nodes · streaming

Wire it into CI in one step

Drop the generated GitHub Action into your repo, add a scoped token secret, and every pull request POSTs to your gate's run endpoint. Add the check to branch protection and a regression blocks merge — with the verdict, cost, and a deep link back to the exact trace in the run summary.

trace · run_8c4fok · 742ms · $0.0053

agent.plan742ms

tools.search_code86ms

llm.complete_refactor612ms

tools.edit_file78ms

evals.no_regression54ms

agentllmtooleval5 spans · 3,007 tok

2 · contains + judge

Grader kinds

Claude Haiku 4.5

Default judge

1 GitHub Action