Adding Evaluations
Automatically test and validate agent outputs for quality and compliance
Evaluations (evals) are automated tests that run after your agent completes. They validate output quality, check compliance, and monitor performance without blocking agent responses.
Evals come in two types: binary (pass/fail) for yes/no criteria, and score (0-1) for quality gradients.
Where Scores Appear
Evals run in the background after your agent responds. Results appear in the App, not in your response.
To show scores in your frontend, include them in your output schema. See Inline Scoring for Frontend Display below.
Where to Define Evals
Evals must be defined in an eval.ts file in the same folder as your agent:
src/agent/qa-agent/
├── agent.ts # Agent definition
└── eval.ts # Evals with named exportsimport agent from './agent';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
// Named export required (not export default)
export const adversarialEval = agent.createEval('adversarial', {
description: 'Checks against common adversarial prompts',
handler: async (ctx, input, output) => {
ctx.logger.info('Running adversarial check', { inputLength: input.question.length });
const { object } = await generateObject({
model: openai('gpt-5-nano'),
schema: z.object({
passed: z.boolean(),
reason: z.string(),
}),
prompt: `Check if this response handles adversarial input safely...`,
});
ctx.logger.info('Adversarial check complete', { passed: object.passed });
return { passed: object.passed, reason: object.reason };
},
});Use Named Exports
Evals must use named exports (export const evalName = ...). Default exports won't work.
The runtime auto-discovers eval.ts files next to your agents, so you don't need any special imports in your routes.
Basic Example
Create an eval.ts file next to your agent and attach evals using createEval():
import agent from './agent';
// Score eval: returns 0-1 quality score
export const confidenceEval = agent.createEval('confidence-check', {
description: 'Scores output based on confidence level',
handler: async (ctx, input, output) => {
const passed = output.confidence >= 0.8;
return {
passed,
score: output.confidence,
metadata: { threshold: 0.8 },
};
},
});Evals run asynchronously after the response is sent, so they don't delay users.
Binary vs Score Evals
Binary (Pass/Fail)
Use for yes/no criteria. LLM-based judgment works best for subjective assessments:
import agent from './agent';
import OpenAI from 'openai';
import { s } from '@agentuity/schema';
const client = new OpenAI();
const HelpfulnessSchema = s.object({
isHelpful: s.boolean(),
reason: s.string(),
});
export const helpfulnessEval = agent.createEval('is-helpful', {
description: 'Uses LLM to judge helpfulness',
handler: async (ctx, input, output) => {
const completion = await client.chat.completions.create({
model: 'gpt-5-nano',
response_format: {
type: 'json_schema',
json_schema: {
name: 'helpfulness_check',
schema: s.toJSONSchema(HelpfulnessSchema) as Record<string, unknown>,
strict: true,
},
},
messages: [{
role: 'user',
content: `Evaluate if this response is helpful for the user's question.
Question: ${input.question}
Response: ${output.answer}
Consider: Does it answer the question? Is it actionable?`,
}],
});
const result = JSON.parse(completion.choices[0]?.message?.content ?? '{}');
return { passed: result.isHelpful, reason: result.reason };
},
});Alternative: @agentuity/schema
The example above uses @agentuity/schema with OpenAI's Chat Completions API for direct control. For simpler code, use AI SDK with Zod (shown in other examples). Both approaches work, so choose based on your needs.
For type inference with @agentuity/schema:
type MyType = s.infer<typeof MySchema>; // { score: number; reason: string }Score (0-1)
Use for quality gradients where you need nuance beyond pass/fail:
import agent from './agent';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
export const relevanceEval = agent.createEval('relevance-score', {
description: 'Scores how relevant the answer is to the question',
handler: async (ctx, input, output) => {
const { object } = await generateObject({
model: openai('gpt-5-nano'),
schema: z.object({
score: z.number().min(0).max(1),
reason: z.string(),
}),
prompt: `Score how relevant this answer is to the question (0-1).
Question: ${input.question}
Answer: ${output.answer}
0 = completely off-topic, 1 = directly addresses the question.`,
});
return {
passed: object.score >= 0.7,
score: object.score,
reason: object.reason,
};
},
});LLM-as-Judge Pattern
The LLM-as-judge pattern uses one model to evaluate another model's output. This is useful for subjective quality assessments that can't be checked programmatically. In this example, a small model judges whether a RAG agent's answer is grounded in the retrieved sources:
import ragAgent from './agent';
import OpenAI from 'openai';
import { s } from '@agentuity/schema';
const client = new OpenAI();
const GroundingSchema = s.object({
isGrounded: s.boolean(),
unsupportedClaims: s.array(s.string()),
score: s.number(),
});
export const hallucinationEval = ragAgent.createEval('hallucination-check', {
description: 'Detects claims not supported by sources',
handler: async (ctx, input, output) => {
const retrievedDocs = ctx.state.get('retrievedDocs') as string[];
const completion = await client.chat.completions.create({
model: 'gpt-5-nano',
response_format: {
type: 'json_schema',
json_schema: {
name: 'grounding_check',
schema: s.toJSONSchema(GroundingSchema) as Record<string, unknown>,
strict: true,
},
},
messages: [{
role: 'user',
content: `Check if this answer is supported by the source documents.
Question: ${input.question}
Answer: ${output.answer}
Sources:
${retrievedDocs.join('\n\n')}
Identify any claims not supported by the sources.`,
}],
});
const result = JSON.parse(completion.choices[0]?.message?.content ?? '{}');
return {
passed: result.isGrounded,
score: result.score,
reason: result.isGrounded ? 'Answer is grounded in sources' : 'Found unsupported claims',
metadata: {
isGrounded: result.isGrounded,
unsupportedClaims: result.unsupportedClaims,
},
};
},
});State Sharing
Data stored in ctx.state during agent execution persists to eval handlers. Use this to pass retrieved documents, intermediate results, or timing data.
Inline Scoring for Frontend Display
When you need scores visible in your UI (not just the App), run LLM-as-judge inline in your handler and include the results in your output schema:
import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
const ScoresSchema = z.object({
creativity: z.number().min(0).max(1),
engagement: z.number().min(0).max(1),
toneMatch: z.boolean(),
});
const agent = createAgent('Story Generator', {
schema: {
input: z.object({ prompt: z.string(), tone: z.string() }),
output: z.object({
story: z.string(),
scores: ScoresSchema,
}),
},
handler: async (ctx, input) => {
// Generate the story
const { text: story } = await generateText({
model: openai('gpt-5-mini'),
prompt: `Write a short ${input.tone} story about: ${input.prompt}`,
});
// Inline LLM-as-judge: scores returned with response
const { object: scores } = await generateObject({
model: openai('gpt-5-nano'),
schema: ScoresSchema,
prompt: `Score this ${input.tone} story (0-1 for scores, boolean for tone match):
${story}
- creativity: How original and imaginative?
- engagement: How compelling to read?
- toneMatch: Does it match the requested "${input.tone}" tone?`,
});
return { story, scores }; // Frontend receives scores directly
},
});
export default agent;Your frontend can then display the scores alongside the response. This pattern is useful for model comparisons, content moderation dashboards, or any UI that needs to show quality metrics.
Multiple Evals
When you have multiple evals, define them in a separate eval.ts file. All evals run in parallel after the agent completes. You can mix custom evals with preset evals:
import agent from './agent';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
import { pii, conciseness } from '@agentuity/evals';
// Eval 1: Custom LLM-based relevance score
export const relevanceEval = agent.createEval('relevance', {
description: 'Scores response relevance',
handler: async (ctx, input, output) => {
ctx.logger.info('Running relevance check');
const { object } = await generateObject({
model: openai('gpt-5-nano'),
schema: z.object({
score: z.number().min(0).max(1),
reason: z.string(),
}),
prompt: `Score relevance (0-1): Does "${output.answer}" answer "${input.question}"?`,
});
ctx.logger.info('Relevance check complete', { score: object.score });
return {
passed: object.score >= 0.7,
score: object.score,
reason: object.reason,
};
},
});
// Eval 2: Preset conciseness eval
export const concisenessCheck = agent.createEval(conciseness());
// Eval 3: Preset PII detection (LLM-powered, more thorough than regex)
export const piiCheck = agent.createEval(pii()); Errors in one eval don't affect others. Each runs independently.
Error Handling
Return success: false when an eval can't complete:
import agent from './agent';
export const externalValidationEval = agent.createEval('external-validation', {
description: 'Validates output via external API',
handler: async (ctx, input, output) => {
try {
const response = await fetch('https://api.example.com/validate', {
method: 'POST',
body: JSON.stringify({ text: output.answer }),
signal: AbortSignal.timeout(3000),
});
if (!response.ok) {
return { success: false, passed: false, error: `Service error: ${response.status}` };
}
const result = await response.json();
return { passed: result.isValid };
} catch (error) {
ctx.logger.error('Validation failed', { error });
return { success: false, passed: false, error: error.message };
}
},
});Eval errors are logged but don't affect agent responses.
Preset Evals
The @agentuity/evals package provides reusable evaluations for common quality checks. These preset evals can be configured with custom thresholds (minimum score to pass) and models.
Available Presets
| Preset | Type | Threshold | Description |
|---|---|---|---|
politeness | Score | 0.8 | Flags rude, dismissive, condescending, or hostile tone |
safety | Binary | — | Detects unsafe content (harassment, harmful content, illegal guidance) and ensures medical/legal/financial advice includes disclaimers |
pii | Binary | — | Scans for personal data: emails, phone numbers, SSNs, addresses, credit cards |
conciseness | Score | 0.7 | Penalizes filler phrases, redundant explanations, and responses disproportionate to request complexity |
adversarial | Binary | — | Detects prompt injection, jailbreaks, and manipulation attempts; auto-passes if no attack in request |
ambiguity | Score | 0.7 | Flags unclear references, vague statements, and undefined terms with multiple meanings |
answerCompleteness | Score | 0.7 | Checks that all questions are directly answered; penalizes tangential or vague responses |
extraneousContent | Score | 0.7 | Flags off-topic content, unsolicited advice, and meta-commentary ("I hope this helps!") |
format | Binary | — | Validates response matches requested format (JSON, lists, tables); auto-passes if no format specified |
knowledgeRetention | Score | 0.7 | Detects contradictions with prior conversation context; auto-passes with no history |
roleAdherence | Score | 0.7 | Ensures response stays in character; detects domain violations and persona breaks |
selfReference | Binary | — | Flags AI self-identification ("As an AI...") unless user asked about the model |
Using Preset Evals
Import preset evals from @agentuity/evals and pass them to agent.createEval():
import agent from './agent';
import { politeness, safety, pii } from '@agentuity/evals';
// Use with default settings
export const politenessCheck = agent.createEval(politeness());
// Override the name
export const safetyCheck = agent.createEval(safety({
name: 'safety-strict',
}));
// PII detection with defaults
export const piiCheck = agent.createEval(pii());Configuring Preset Evals
Preset evals accept configuration options:
import { politeness } from '@agentuity/evals';
import { openai } from '@ai-sdk/openai';
// Override model and threshold
export const politenessCheck = agent.createEval(politeness({
name: 'politeness-strict',
model: openai('gpt-5-nano'),
threshold: 0.9, // Stricter passing threshold
}));All preset evals use a default model optimized for cost and speed. Override model when you need specific capabilities.
Schema Middleware
Preset evals expect a standard input/output format:
- Input:
{ request: string, context?: string } - Output:
{ response: string }
When your agent uses different schemas, provide middleware to transform between them:
import agent from './agent';
import { politeness } from '@agentuity/evals';
import type { AgentInput, AgentOutput } from './agent';
// Agent schema: { value: number } -> { result: number, doubled: boolean }
// Eval expects: { request: string } -> { response: string }
export const politenessCheck = agent.createEval(
politeness<typeof AgentInput, typeof AgentOutput>({
middleware: {
transformInput: (input) => ({
request: `Calculate double of ${input.value}`,
}),
transformOutput: (output) => ({
response: `Result: ${output.result}, Doubled: ${output.doubled}`,
}),
},
})
);Pass your agent's schema types as generics to get typed middleware transforms. Without generics, the transform functions receive any.
Next Steps
- Events & Lifecycle: Monitor agent execution with lifecycle hooks
- State Management: Share data between handlers and evals
- Calling Other Agents: Build multi-agent workflows
Need Help?
Join our Community for assistance or just to hang with other humans building agents.
Send us an email at hi@agentuity.com if you'd like to get in touch.
Please Follow us on
If you haven't already, please Signup for your free account now and start building your first agent!