Agents

Adding Evaluations

Automatically test and validate agent outputs for quality and compliance

Evaluations (evals) are automated tests that run after your agent completes. They validate output quality, check compliance, and monitor performance without blocking agent responses.

Evals come in two types: binary (pass/fail) for yes/no criteria, and score (0-1) for quality gradients.

Where Scores Appear

Evals run in the background after your agent responds. Results appear in the App, not in your response.

To show scores in your frontend, include them in your output schema. See Inline Scoring for Frontend Display below.

Where to Define Evals

Evals must be defined in an eval.ts file in the same folder as your agent:

src/agent/qa-agent/
├── agent.ts       # Agent definition
└── eval.ts        # Evals with named exports
src/agent/qa-agent/eval.ts
import agent from './agent';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
 
// Named export required (not export default)
export const adversarialEval = agent.createEval('adversarial', { 
  description: 'Checks against common adversarial prompts',
  handler: async (ctx, input, output) => {
    ctx.logger.info('Running adversarial check', { inputLength: input.question.length });
 
    const { object } = await generateObject({
      model: openai('gpt-5-nano'),
      schema: z.object({
        passed: z.boolean(),
        reason: z.string(),
      }),
      prompt: `Check if this response handles adversarial input safely...`,
    });
 
    ctx.logger.info('Adversarial check complete', { passed: object.passed });
    return { passed: object.passed, reason: object.reason };
  },
});

Use Named Exports

Evals must use named exports (export const evalName = ...). Default exports won't work.

The runtime auto-discovers eval.ts files next to your agents, so you don't need any special imports in your routes.

Basic Example

Create an eval.ts file next to your agent and attach evals using createEval():

src/agent/qa-agent/eval.ts
import agent from './agent';
 
// Score eval: returns 0-1 quality score
export const confidenceEval = agent.createEval('confidence-check', {
  description: 'Scores output based on confidence level',
  handler: async (ctx, input, output) => {
    const passed = output.confidence >= 0.8;
    return {
      passed,
      score: output.confidence,
      metadata: { threshold: 0.8 },
    };
  },
});

Evals run asynchronously after the response is sent, so they don't delay users.

Binary vs Score Evals

Binary (Pass/Fail)

Use for yes/no criteria. LLM-based judgment works best for subjective assessments:

src/agent/qa-agent/eval.ts
import agent from './agent';
import OpenAI from 'openai';
import { s } from '@agentuity/schema';
 
const client = new OpenAI();
 
const HelpfulnessSchema = s.object({
  isHelpful: s.boolean(),
  reason: s.string(),
});
 
export const helpfulnessEval = agent.createEval('is-helpful', {
  description: 'Uses LLM to judge helpfulness',
  handler: async (ctx, input, output) => {
    const completion = await client.chat.completions.create({
      model: 'gpt-5-nano',
      response_format: {
        type: 'json_schema',
        json_schema: {
          name: 'helpfulness_check',
          schema: s.toJSONSchema(HelpfulnessSchema) as Record<string, unknown>,
          strict: true,
        },
      },
      messages: [{
        role: 'user',
        content: `Evaluate if this response is helpful for the user's question.
 
Question: ${input.question}
Response: ${output.answer}
 
Consider: Does it answer the question? Is it actionable?`,
      }],
    });
 
    const result = JSON.parse(completion.choices[0]?.message?.content ?? '{}');
    return { passed: result.isHelpful, reason: result.reason }; 
  },
});

Alternative: @agentuity/schema

The example above uses @agentuity/schema with OpenAI's Chat Completions API for direct control. For simpler code, use AI SDK with Zod (shown in other examples). Both approaches work, so choose based on your needs.

For type inference with @agentuity/schema:

type MyType = s.infer<typeof MySchema>;  // { score: number; reason: string }

Score (0-1)

Use for quality gradients where you need nuance beyond pass/fail:

src/agent/qa-agent/eval.ts
import agent from './agent';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
 
export const relevanceEval = agent.createEval('relevance-score', {
  description: 'Scores how relevant the answer is to the question',
  handler: async (ctx, input, output) => {
    const { object } = await generateObject({
      model: openai('gpt-5-nano'),
      schema: z.object({
        score: z.number().min(0).max(1),
        reason: z.string(),
      }),
      prompt: `Score how relevant this answer is to the question (0-1).
 
Question: ${input.question}
Answer: ${output.answer}
 
0 = completely off-topic, 1 = directly addresses the question.`,
    });
 
    return {
      passed: object.score >= 0.7,
      score: object.score,
      reason: object.reason,
    }; 
  },
});

LLM-as-Judge Pattern

The LLM-as-judge pattern uses one model to evaluate another model's output. This is useful for subjective quality assessments that can't be checked programmatically. In this example, a small model judges whether a RAG agent's answer is grounded in the retrieved sources:

src/agent/rag-agent/eval.ts
import ragAgent from './agent';
import OpenAI from 'openai';
import { s } from '@agentuity/schema';
 
const client = new OpenAI();
 
const GroundingSchema = s.object({
  isGrounded: s.boolean(),
  unsupportedClaims: s.array(s.string()),
  score: s.number(),
});
 
export const hallucinationEval = ragAgent.createEval('hallucination-check', {
  description: 'Detects claims not supported by sources',
  handler: async (ctx, input, output) => {
    const retrievedDocs = ctx.state.get('retrievedDocs') as string[];
 
    const completion = await client.chat.completions.create({
      model: 'gpt-5-nano',
      response_format: {
        type: 'json_schema',
        json_schema: {
          name: 'grounding_check',
          schema: s.toJSONSchema(GroundingSchema) as Record<string, unknown>,
          strict: true,
        },
      },
      messages: [{
        role: 'user',
        content: `Check if this answer is supported by the source documents.
 
Question: ${input.question}
Answer: ${output.answer}
 
Sources:
${retrievedDocs.join('\n\n')}
 
Identify any claims not supported by the sources.`,
      }],
    });
 
    const result = JSON.parse(completion.choices[0]?.message?.content ?? '{}');
    return {
      passed: result.isGrounded,
      score: result.score,
      reason: result.isGrounded ? 'Answer is grounded in sources' : 'Found unsupported claims',
      metadata: {
        isGrounded: result.isGrounded,
        unsupportedClaims: result.unsupportedClaims,
      },
    };
  },
});

State Sharing

Data stored in ctx.state during agent execution persists to eval handlers. Use this to pass retrieved documents, intermediate results, or timing data.

Inline Scoring for Frontend Display

When you need scores visible in your UI (not just the App), run LLM-as-judge inline in your handler and include the results in your output schema:

import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
 
const ScoresSchema = z.object({
  creativity: z.number().min(0).max(1),
  engagement: z.number().min(0).max(1),
  toneMatch: z.boolean(),
});
 
const agent = createAgent('Story Generator', {
  schema: {
    input: z.object({ prompt: z.string(), tone: z.string() }),
    output: z.object({
      story: z.string(),
      scores: ScoresSchema,
    }),
  },
  handler: async (ctx, input) => {
    // Generate the story
    const { text: story } = await generateText({
      model: openai('gpt-5-mini'),
      prompt: `Write a short ${input.tone} story about: ${input.prompt}`,
    });
 
    // Inline LLM-as-judge: scores returned with response
    const { object: scores } = await generateObject({
      model: openai('gpt-5-nano'),
      schema: ScoresSchema,
      prompt: `Score this ${input.tone} story (0-1 for scores, boolean for tone match):
 
${story}
 
- creativity: How original and imaginative?
- engagement: How compelling to read?
- toneMatch: Does it match the requested "${input.tone}" tone?`,
    });
 
    return { story, scores };  // Frontend receives scores directly
  },
});
 
export default agent;

Your frontend can then display the scores alongside the response. This pattern is useful for model comparisons, content moderation dashboards, or any UI that needs to show quality metrics.

Multiple Evals

When you have multiple evals, define them in a separate eval.ts file. All evals run in parallel after the agent completes. You can mix custom evals with preset evals:

src/agent/qa-agent/eval.ts
import agent from './agent';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
import { pii, conciseness } from '@agentuity/evals';
 
// Eval 1: Custom LLM-based relevance score
export const relevanceEval = agent.createEval('relevance', {
  description: 'Scores response relevance',
  handler: async (ctx, input, output) => {
    ctx.logger.info('Running relevance check');
 
    const { object } = await generateObject({
      model: openai('gpt-5-nano'),
      schema: z.object({
        score: z.number().min(0).max(1),
        reason: z.string(),
      }),
      prompt: `Score relevance (0-1): Does "${output.answer}" answer "${input.question}"?`,
    });
 
    ctx.logger.info('Relevance check complete', { score: object.score });
    return {
      passed: object.score >= 0.7,
      score: object.score,
      reason: object.reason,
    };
  },
});
 
// Eval 2: Preset conciseness eval
export const concisenessCheck = agent.createEval(conciseness()); 
 
// Eval 3: Preset PII detection (LLM-powered, more thorough than regex)
export const piiCheck = agent.createEval(pii()); 

Errors in one eval don't affect others. Each runs independently.

Error Handling

Return success: false when an eval can't complete:

src/agent/my-agent/eval.ts
import agent from './agent';
 
export const externalValidationEval = agent.createEval('external-validation', {
  description: 'Validates output via external API',
  handler: async (ctx, input, output) => {
    try {
      const response = await fetch('https://api.example.com/validate', {
        method: 'POST',
        body: JSON.stringify({ text: output.answer }),
        signal: AbortSignal.timeout(3000),
      });
 
      if (!response.ok) {
        return { success: false, passed: false, error: `Service error: ${response.status}` };
      }
 
      const result = await response.json();
      return { passed: result.isValid };
    } catch (error) {
      ctx.logger.error('Validation failed', { error });
      return { success: false, passed: false, error: error.message };
    }
  },
});

Eval errors are logged but don't affect agent responses.

Preset Evals

The @agentuity/evals package provides reusable evaluations for common quality checks. These preset evals can be configured with custom thresholds (minimum score to pass) and models.

Available Presets

PresetTypeThresholdDescription
politenessScore0.8Flags rude, dismissive, condescending, or hostile tone
safetyBinaryDetects unsafe content (harassment, harmful content, illegal guidance) and ensures medical/legal/financial advice includes disclaimers
piiBinaryScans for personal data: emails, phone numbers, SSNs, addresses, credit cards
concisenessScore0.7Penalizes filler phrases, redundant explanations, and responses disproportionate to request complexity
adversarialBinaryDetects prompt injection, jailbreaks, and manipulation attempts; auto-passes if no attack in request
ambiguityScore0.7Flags unclear references, vague statements, and undefined terms with multiple meanings
answerCompletenessScore0.7Checks that all questions are directly answered; penalizes tangential or vague responses
extraneousContentScore0.7Flags off-topic content, unsolicited advice, and meta-commentary ("I hope this helps!")
formatBinaryValidates response matches requested format (JSON, lists, tables); auto-passes if no format specified
knowledgeRetentionScore0.7Detects contradictions with prior conversation context; auto-passes with no history
roleAdherenceScore0.7Ensures response stays in character; detects domain violations and persona breaks
selfReferenceBinaryFlags AI self-identification ("As an AI...") unless user asked about the model

Using Preset Evals

Import preset evals from @agentuity/evals and pass them to agent.createEval():

src/agent/chat/eval.ts
import agent from './agent';
import { politeness, safety, pii } from '@agentuity/evals';
 
// Use with default settings
export const politenessCheck = agent.createEval(politeness()); 
 
// Override the name
export const safetyCheck = agent.createEval(safety({
  name: 'safety-strict',
}));
 
// PII detection with defaults
export const piiCheck = agent.createEval(pii());

Configuring Preset Evals

Preset evals accept configuration options:

import { politeness } from '@agentuity/evals';
import { openai } from '@ai-sdk/openai';
 
// Override model and threshold
export const politenessCheck = agent.createEval(politeness({
  name: 'politeness-strict',
  model: openai('gpt-5-nano'),
  threshold: 0.9,  // Stricter passing threshold
}));

All preset evals use a default model optimized for cost and speed. Override model when you need specific capabilities.

Schema Middleware

Preset evals expect a standard input/output format:

  • Input: { request: string, context?: string }
  • Output: { response: string }

When your agent uses different schemas, provide middleware to transform between them:

src/agent/calculator/eval.ts
import agent from './agent';
import { politeness } from '@agentuity/evals';
import type { AgentInput, AgentOutput } from './agent';
 
// Agent schema: { value: number } -> { result: number, doubled: boolean }
// Eval expects: { request: string } -> { response: string }
 
export const politenessCheck = agent.createEval(
  politeness<typeof AgentInput, typeof AgentOutput>({
    middleware: {
      transformInput: (input) => ({
        request: `Calculate double of ${input.value}`,
      }),
      transformOutput: (output) => ({
        response: `Result: ${output.result}, Doubled: ${output.doubled}`,
      }),
    },
  })
);

Pass your agent's schema types as generics to get typed middleware transforms. Without generics, the transform functions receive any.

Next Steps

Need Help?

Join our DiscordCommunity for assistance or just to hang with other humans building agents.

Send us an email at hi@agentuity.com if you'd like to get in touch.

Please Follow us on

If you haven't already, please Signup for your free account now and start building your first agent!