Adding Evaluations — Agentuity Documentation

Adding Evaluations

Automatically test and validate agent outputs for quality and compliance

Evaluations (evals) are automated tests that run after your agent completes. They validate output quality, check compliance, and monitor performance without blocking agent responses.

Why Evals?

Most evaluation tools test the LLM: did the model respond appropriately? That's fine for chatbots, but agents aren't single LLM calls. They're entire runs with multiple model calls, tool executions, and orchestration working together.

Agent failures can happen anywhere in the run—a tool call that returned bad data, a state bug that corrupted context, and more. Testing just the LLM response misses most of this.

Agentuity evals test the whole run—every tool call, state change, and orchestration step. They run on every session in production, so you catch issues with real traffic.

The result:

  • Full-run evaluation: Test the entire agent execution, not just LLM responses
  • Production monitoring: Once configured, evals run automatically on every session
  • Async by default: Evals don't block responses, so users aren't waiting
  • Preset library: Common checks (PII, safety, hallucination) available out of the box

Evals come in two types: binary (pass/fail) for yes/no criteria, and score (0-1) for quality gradients.

Where to Define Evals

Evals must be defined in an eval.ts file in the same folder as your agent:

src/agent/qa-agent/
├── agent.ts       # Agent definition
└── eval.ts        # Evals with named exports
typescriptsrc/agent/qa-agent/eval.ts
import agent from './agent';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
 
// Named export required (not export default)
export const adversarialEval = agent.createEval('adversarial', { 
  description: 'Checks against common adversarial prompts',
  handler: async (ctx, input, output) => {
    ctx.logger.info('Running adversarial check', { inputLength: input.question.length });
 
    const { object } = await generateObject({
      model: openai('gpt-5.4-nano'),
      schema: z.object({
        passed: z.boolean(),
        reason: z.string(),
      }),
      prompt: `Check if this response handles adversarial input safely...`,
    });
 
    ctx.logger.info('Adversarial check complete', { passed: object.passed });
    return { passed: object.passed, reason: object.reason };
  },
});

Basic Example

Create an eval.ts file next to your agent and attach evals using createEval():

typescriptsrc/agent/qa-agent/eval.ts
import agent from './agent';
 
// Score eval: returns 0-1 quality score
export const confidenceEval = agent.createEval('confidence-check', {
  description: 'Scores output based on confidence level',
  handler: async (ctx, input, output) => {
    const passed = output.confidence >= 0.8;
    return {
      passed,
      score: output.confidence,
      metadata: { threshold: 0.8 },
    };
  },
});

Evals run asynchronously after the response is sent, so they don't delay users.

Binary vs Score Evals

Binary (Pass/Fail)

Use for yes/no criteria. LLM-based judgment works best for subjective assessments:

typescriptsrc/agent/qa-agent/eval.ts
import agent from './agent';
import OpenAI from 'openai';
import { s } from '@agentuity/schema';
 
const client = new OpenAI();
 
const HelpfulnessSchema = s.object({
  isHelpful: s.boolean(),
  reason: s.string(),
});
 
export const helpfulnessEval = agent.createEval('is-helpful', {
  description: 'Uses LLM to judge helpfulness',
  handler: async (ctx, input, output) => {
    const completion = await client.chat.completions.create({
      model: 'gpt-5.4-nano',
      response_format: {
        type: 'json_schema',
        json_schema: {
          name: 'helpfulness_check',
          schema: s.toJSONSchema(HelpfulnessSchema) as Record<string, unknown>,
          strict: true,
        },
      },
      messages: [{
        role: 'user',
        content: `Evaluate if this response is helpful for the user's question.
 
Question: ${input.question}
Response: ${output.answer}
 
Consider: Does it answer the question? Is it actionable?`,
      }],
    });
 
    const result = JSON.parse(completion.choices[0]?.message?.content ?? '{}');
    return { passed: result.isHelpful, reason: result.reason }; 
  },
});

Score (0-1)

Use for quality gradients where you need nuance beyond pass/fail:

typescriptsrc/agent/qa-agent/eval.ts
import agent from './agent';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
 
export const relevanceEval = agent.createEval('relevance-score', {
  description: 'Scores how relevant the answer is to the question',
  handler: async (ctx, input, output) => {
    const { object } = await generateObject({
      model: openai('gpt-5.4-nano'),
      schema: z.object({
        score: z.number().min(0).max(1),
        reason: z.string(),
      }),
      prompt: `Score how relevant this answer is to the question (0-1).
 
Question: ${input.question}
Answer: ${output.answer}
 
0 = completely off-topic, 1 = directly addresses the question.`,
    });
 
    return {
      passed: object.score >= 0.7,
      score: object.score,
      reason: object.reason,
    }; 
  },
});

LLM-as-Judge Pattern

The LLM-as-judge pattern uses one model to evaluate another model's output. This is useful for subjective quality assessments that can't be checked programmatically. In this example, a small model judges whether a RAG agent's answer is grounded in the retrieved sources:

typescriptsrc/agent/rag-agent/eval.ts
import ragAgent from './agent';
import OpenAI from 'openai';
import { s } from '@agentuity/schema';
 
const client = new OpenAI();
 
const GroundingSchema = s.object({
  isGrounded: s.boolean(),
  unsupportedClaims: s.array(s.string()),
  score: s.number(),
});
 
export const hallucinationEval = ragAgent.createEval('hallucination-check', {
  description: 'Detects claims not supported by sources',
  handler: async (ctx, input, output) => {
    const retrievedDocs = ctx.state.get('retrievedDocs') as string[];
 
    const completion = await client.chat.completions.create({
      model: 'gpt-5.4-nano',
      response_format: {
        type: 'json_schema',
        json_schema: {
          name: 'grounding_check',
          schema: s.toJSONSchema(GroundingSchema) as Record<string, unknown>,
          strict: true,
        },
      },
      messages: [{
        role: 'user',
        content: `Check if this answer is supported by the source documents.
 
Question: ${input.question}
Answer: ${output.answer}
 
Sources:
${retrievedDocs.join('\n\n')}
 
Identify any claims not supported by the sources.`,
      }],
    });
 
    const result = JSON.parse(completion.choices[0]?.message?.content ?? '{}');
    return {
      passed: result.isGrounded,
      score: result.score,
      reason: result.isGrounded ? 'Answer is grounded in sources' : 'Found unsupported claims',
      metadata: {
        isGrounded: result.isGrounded,
        unsupportedClaims: result.unsupportedClaims,
      },
    };
  },
});

Inline Scoring for Frontend Display

When you need scores visible in your UI (not just the App), run LLM-as-judge inline in your handler and include the results in your output schema:

import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
 
const ScoresSchema = z.object({
  creativity: z.number().min(0).max(1),
  engagement: z.number().min(0).max(1),
  toneMatch: z.boolean(),
});
 
const agent = createAgent('Story Generator', {
  schema: {
    input: z.object({ prompt: z.string(), tone: z.string() }),
    output: z.object({
      story: z.string(),
      scores: ScoresSchema,
    }),
  },
  handler: async (ctx, input) => {
    // Generate the story
    const { text: story } = await generateText({
      model: openai('gpt-5-mini'),
      prompt: `Write a short ${input.tone} story about: ${input.prompt}`,
    });
 
    // Inline LLM-as-judge: scores returned with response
    const { object: scores } = await generateObject({
      model: openai('gpt-5.4-nano'),
      schema: ScoresSchema,
      prompt: `Score this ${input.tone} story (0-1 for scores, boolean for tone match):
 
${story}
 
- creativity: How original and imaginative?
- engagement: How compelling to read?
- toneMatch: Does it match the requested "${input.tone}" tone?`,
    });
 
    return { story, scores };  // Frontend receives scores directly
  },
});
 
export default agent;

Your frontend can then display the scores alongside the response. This pattern is useful for model comparisons, content moderation dashboards, or any UI that needs to show quality metrics.

Multiple Evals

When you have multiple evals, define them in a separate eval.ts file. All evals run in parallel after the agent completes. You can mix custom evals with preset evals:

typescriptsrc/agent/qa-agent/eval.ts
import agent from './agent';
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
import { pii, conciseness } from '@agentuity/evals';
 
// Eval 1: Custom LLM-based relevance score
export const relevanceEval = agent.createEval('relevance', {
  description: 'Scores response relevance',
  handler: async (ctx, input, output) => {
    ctx.logger.info('Running relevance check');
 
    const { object } = await generateObject({
      model: openai('gpt-5.4-nano'),
      schema: z.object({
        score: z.number().min(0).max(1),
        reason: z.string(),
      }),
      prompt: `Score relevance (0-1): Does "${output.answer}" answer "${input.question}"?`,
    });
 
    ctx.logger.info('Relevance check complete', { score: object.score });
    return {
      passed: object.score >= 0.7,
      score: object.score,
      reason: object.reason,
    };
  },
});
 
// Eval 2: Preset conciseness eval
export const concisenessCheck = agent.createEval(conciseness()); 
 
// Eval 3: Preset PII detection (LLM-powered, more thorough than regex)
export const piiCheck = agent.createEval(pii()); 

Errors in one eval don't affect others. Each runs independently.

Error Handling

Throw when an eval can't complete. The runtime records the error without affecting agent responses:

typescriptsrc/agent/my-agent/eval.ts
import agent from './agent';
 
export const externalValidationEval = agent.createEval('external-validation', {
  description: 'Validates output via external API',
  handler: async (ctx, input, output) => {
    try {
      const response = await fetch('https://api.example.com/validate', {
        method: 'POST',
        body: JSON.stringify({ text: output.answer }),
        signal: AbortSignal.timeout(3000),
      });
 
      if (!response.ok) {
        throw new Error(`Service error: ${response.status}`);
      }
 
      const result = await response.json();
      return { passed: result.isValid };
    } catch (error) {
      ctx.logger.error('Validation failed', { error });
      throw error;
    }
  },
});

Eval errors are logged but don't affect agent responses.

Preset Evals

The @agentuity/evals package provides reusable evaluations for common quality checks. These preset evals can be configured with custom thresholds (minimum score to pass) and models.

Available Presets

PresetTypeThresholdDescription
politenessScore0.8Flags rude, dismissive, condescending, or hostile tone
safetyBinaryDetects unsafe content (harassment, harmful content, illegal guidance) and ensures medical/legal/financial advice includes disclaimers
piiBinaryScans for personal data: emails, phone numbers, SSNs, addresses, credit cards
concisenessScore0.7Penalizes filler phrases, redundant explanations, and responses disproportionate to request complexity
adversarialBinaryDetects prompt injection, jailbreaks, and manipulation attempts; auto-passes if no attack in request
ambiguityScore0.7Flags unclear references, vague statements, and undefined terms with multiple meanings
answerCompletenessScore0.7Checks that all questions are directly answered; penalizes tangential or vague responses
extraneousContentScore0.7Flags off-topic content, unsolicited advice, and meta-commentary ("I hope this helps!")
formatBinaryValidates response matches requested format (JSON, lists, tables); auto-passes if no format specified
knowledgeRetentionScore0.7Detects contradictions with prior conversation context; auto-passes with no history
roleAdherenceScore0.7Ensures response stays in character; detects domain violations and persona breaks
selfReferenceBinaryFlags AI self-identification ("As an AI...") unless user asked about the model

Using Preset Evals

Import preset evals from @agentuity/evals and pass them to agent.createEval():

typescriptsrc/agent/chat/eval.ts
import agent from './agent';
import { politeness, safety, pii } from '@agentuity/evals';
 
// Use with default settings
export const politenessCheck = agent.createEval(politeness()); 
 
// Override the name
export const safetyCheck = agent.createEval(safety({
  name: 'safety-strict',
}));
 
// PII detection with defaults
export const piiCheck = agent.createEval(pii());

Configuring Preset Evals

Preset evals accept configuration options:

import { politeness } from '@agentuity/evals';
import { openai } from '@ai-sdk/openai';
 
// Override model and threshold
export const politenessCheck = agent.createEval(politeness({
  name: 'politeness-strict',
  model: openai('gpt-5.4-nano'),
  threshold: 0.9,  // Stricter passing threshold
}));

All preset evals use a default model optimized for cost and speed. Override model when you need specific capabilities.

Lifecycle Hooks

Preset evals support onStart and onComplete hooks for custom logic around eval execution:

import { politeness } from '@agentuity/evals';
 
export const politenessCheck = agent.createEval(politeness({
  onStart: async (ctx, input, output) => {
    ctx.logger.info('Starting politeness eval', {
      inputLength: input.request?.length,
    });
  },
  onComplete: async (ctx, result) => {
    // Track results in external monitoring
    if (!result.passed) {
      ctx.logger.warn('Politeness check failed', {
        score: result.score,
        reason: result.reason,
      });
    }
  },
}));

Use cases for lifecycle hooks:

  • Log eval execution for debugging
  • Send results to external monitoring systems
  • Track eval performance metrics
  • Trigger alerts on failures

Schema Middleware

Preset evals expect a standard input/output format:

  • Input: { request: string, context?: string }
  • Output: { response: string }

When your agent uses different schemas, provide middleware to transform between them:

typescriptsrc/agent/calculator/eval.ts
import agent, { AgentInput, AgentOutput } from './agent';
import { politeness } from '@agentuity/evals';
 
// Agent schema: { value: number } -> { result: number, doubled: boolean }
// Eval expects: { request: string } -> { response: string }
 
export const politenessCheck = agent.createEval(
  politeness<typeof AgentInput, typeof AgentOutput>({
    middleware: {
      transformInput: (input) => ({
        request: `Calculate double of ${input.value}`,
      }),
      transformOutput: (output) => ({
        response: `Result: ${output.result}, Doubled: ${output.doubled}`,
      }),
    },
  })
);

Pass your agent's schema types as generics to get typed middleware transforms. Without generics, the transform functions receive any.

Next Steps