To add evaluations to an Agentuity project, set up a workflow that runs normal app code, captures a stable output, checks it, and records enough logs or traces to explain failures. Use deterministic assertions for structural checks, LLM-as-judge checks for qualitative behavior, OpenTelemetry traces to inspect failures, and hosted eval tools when you need datasets, reviewers, or reports.
npm install @agentuity/aigateway zod @agentuity/telemetryThe v2 @agentuity/evals package and platform collection model are migration context, not the current eval surface. Keep evals in your app repo as scripts, tests, traces, and optional external tooling.
When to use this path
Use this path when the behavior under test is app-owned: a framework route, retrieval function, queue worker, sandbox command, or coding-agent session. Agentuity gives you service clients, AI Gateway auth, telemetry, sandbox execution, and session inspection. You still bring the test runner, fixtures, judge rubric, hosted eval tool, and pass/fail policy.
| Need | Use |
|---|---|
| deterministic contracts | your test runner with schema validation |
| qualitative answer grading | an LLM-as-judge script or hosted eval tool |
| model access through project credentials | AIGatewayClient with a catalog model ID |
| command or generated-code checks | Sandbox |
| failure investigation | Logging, Tracing, and Sessions & Debugging |
Shape Evals Like Tests
| Part | Keep Stable |
|---|---|
| input fixture | user prompt, retrieved context, tool responses, headers, or test data |
| app invocation | the framework route, service function, agent entrypoint, or sandbox command under test |
| output contract | JSON fields, report path, status enum, citations, changed files, or answer text |
| check | deterministic assertion, schema validation, judge rubric, or hosted eval score |
| observability | run id, fixture id, trace id, model name, duration, pass/fail, and failure reason |
That shape keeps evals close to the code they protect. A CI job can run the same script as a local developer, and a failed run can point back to logs, spans, and the exact output that broke the contract.
Define the Output Contract
Do this before adding a judge. The output contract is the part your app, CI job, or reviewer can inspect without reading model prose.
{
"status": "passed",
"answer": "For an existing app, run project import before deploy.",
"sources": ["docs/src/web/content/get-started/import-existing-app.mdx"],
"checks": [
{
"name": "mentions import before deploy",
"passed": true
}
],
"blockers": []
}For coding-agent tasks, include the expected report path, verification command, changed files, and blocker format in the task prompt. See Building Coding Agents for a fuller output-contract pattern.
Run an LLM-as-Judge Check
This script grades an answer against a rubric. Drop it into a Bun or Node script that has Agentuity project credentials or runs under agentuity dev.
Because this uses AIGatewayClient, AGENTUITY_SDK_KEY is enough for Gateway auth. Use provider keys only when the eval is testing provider SDK behavior.
import { AIGatewayClient } from '@agentuity/aigateway';
import { z } from 'zod';
import { logger } from '@agentuity/telemetry';
const gateway = new AIGatewayClient();
const JUDGE_MODEL = 'anthropic/claude-opus-4-8';
const judgmentSchema = z.object({
score: z.number().min(0).max(1),
passed: z.boolean(),
reason: z.string(),
});
async function judgeAnswer(input: {
readonly question: string;
readonly answer: string;
readonly rubric: string;
}): Promise<z.infer<typeof judgmentSchema>> {
const { data } = await gateway.completeStructured({
model: JUDGE_MODEL,
messages: [
{
role: 'user',
content: `Grade the answer from 0 to 1.
Question:
${input.question}
Answer:
${input.answer}
Rubric:
${input.rubric}
Return a passing score only when the answer satisfies the rubric.`,
},
],
response_schema: { name: 'judgment', schema: judgmentSchema },
});
return judgmentSchema.parse(data);
}
const judgment = await judgeAnswer({
question: 'How do I deploy an imported Agentuity app?',
answer: 'Run agentuity project import first, then run agentuity deploy from the app directory.',
rubric: 'Mentions import before deploy and avoids claiming deploy imports automatically.',
});
logger.info('eval complete', judgment);Trace the Run
Wrap eval runs with spans when you need to inspect the app call, model choice, tool behavior, assertion, and judge result later. Prefer attributes such as eval.run_id, eval.fixture_id, eval.contract_version, eval.passed, and eval.failure_reason; avoid raw prompts, secrets, and personal data in span attributes. See Tracing for OpenTelemetry setup and exporter configuration.
Choose the Eval Runner
Start with the runner that gives the check enough context. Hosted tools help with datasets, experiment comparison, reviewers, and reports. Unit and integration tests are better for deterministic checks such as JSON shape, latency budget, retrieval count, or exact routing behavior.
| Check | Good fit |
|---|---|
| JSON shape or required fields | unit test plus schema |
| Tool routing | integration test with mocks or fixtures |
| Prompt and model regression suite | OpenAI Evals API |
| Grounding against sources | LLM-as-judge |
| App-function regression suite with reports | Braintrust Evals or your eval runner |
| Debugging one bad request | OpenTelemetry traces and logs |
Next Steps
- LLM as a Judge: score model output with a rubric in code
- OpenAI Evals API: run OpenAI-hosted evals when that service owns the prompt regression suite
- Braintrust Evals: run app-function evals with reports
- AI Gateway API Reference: inspect model catalog and completion request fields
- Sessions API Reference: inspect generated REST details for session records