LLM as a Judge — Agentuity Documentation

LLM as a Judge

Use LLMs to evaluate and score agent outputs for quality, safety, and compliance

The LLM-as-judge pattern uses one model to evaluate another model's output. This is useful for subjective quality assessments that can't be checked programmatically.

When to Use This Pattern

Use CaseGood Fit?
Checking factual accuracy against sourcesYes
Evaluating tone and politenessYes
Detecting hallucinations in RAGYes
Comparing response qualityYes
Validating JSON structureNo (use schema validation)
Checking string lengthNo (use code)

Use LLM-as-judge when the evaluation requires understanding context, nuance, or subjective criteria.

Inline vs Evals

LLM-as-judge can be used in two contexts:

ContextWhen to UseBlocks Response?
Inline (in handler)Scores needed in UI, model comparison, user-facing quality metricsYes
Evals (in eval.ts)Background quality monitoring, compliance checks, production observabilityNo

Inline example: A model arena that shows users which response "won" needs scores returned with the response.

Eval example: Checking all responses for PII or hallucinations in production without slowing down users.

Basic Pattern

Use a fast model to judge outputs. Groq provides low-latency inference for judge calls:

import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
 
const JudgmentSchema = z.object({
  score: z.number().min(0).max(1).describe('Quality score from 0 to 1'),
  passed: z.boolean().describe('Whether the response meets quality threshold'),
  reason: z.string().describe('Brief explanation of the judgment'),
});
 
const agent = createAgent('QualityChecker', {
  schema: {
    input: z.object({ question: z.string() }),
    output: z.object({
      answer: z.string(),
      judgment: JudgmentSchema,
    }),
  },
  handler: async (ctx, input) => {
    // Generate answer with primary model
    const { text: answer } = await generateText({
      model: openai('gpt-5-mini'),
      prompt: input.question,
    });
 
    // Judge with fast model
    const { object: judgment } = await generateObject({
      model: groq('openai/gpt-oss-120b'),
      schema: JudgmentSchema,
      prompt: `Evaluate this response on a 0-1 scale.
 
Question: ${input.question}
Response: ${answer}
 
Consider:
- Does it directly answer the question?
- Is the information accurate?
- Is it clear and easy to understand?
 
Score 0.7+ to pass.`,
    });
 
    return { answer, judgment };
  },
});
 
export default agent;

Model Comparison Arena

Compare responses from multiple providers and declare a winner. This pattern is useful for benchmarking, A/B testing, or letting users see quality differences:

import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
 
const ArenaJudgment = z.object({
  winner: z.enum(['openai', 'anthropic']),
  reasoning: z.string(),
  scores: z.object({
    openai: z.number().min(0).max(1),
    anthropic: z.number().min(0).max(1),
  }),
});
 
const arenaAgent = createAgent('ModelArena', {
  schema: {
    input: z.object({
      prompt: z.string(),
      tone: z.enum(['whimsical', 'suspenseful', 'comedic']),
    }),
    output: z.object({
      results: z.array(z.object({
        provider: z.string(),
        story: z.string(),
        generationMs: z.number(),
      })),
      judgment: ArenaJudgment,
    }),
  },
  handler: async (ctx, input) => {
    // Generate from both models in parallel
    const [openaiResult, anthropicResult] = await Promise.all([
      generateWithTiming(openai('gpt-5-nano'), input.prompt, input.tone),
      generateWithTiming(anthropic('claude-haiku-4-5'), input.prompt, input.tone),
    ]);
 
    const results = [
      { provider: 'openai', ...openaiResult },
      { provider: 'anthropic', ...anthropicResult },
    ];
 
    // Judge with fast model
    const { object: judgment } = await generateObject({
      model: groq('openai/gpt-oss-120b'),
      schema: ArenaJudgment,
      prompt: `Compare these ${input.tone} stories and pick a winner.
 
PROMPT: "${input.prompt}"
 
--- OpenAI ---
${openaiResult.story}
 
--- Anthropic ---
${anthropicResult.story}
 
Score each on creativity, engagement, and tone match (0-1).
Declare a winner with reasoning.`,
    });
 
    ctx.logger.info('Arena complete', { winner: judgment.winner });
    return { results, judgment };
  },
});
 
async function generateWithTiming(
  model: Parameters<typeof generateText>[0]['model'],
  prompt: string,
  tone: string
) {
  const start = Date.now();
  const { text } = await generateText({
    model,
    system: `Write a short ${tone} story (max 200 words) with a beginning, middle, and end.`,
    prompt,
  });
  return { story: text, generationMs: Date.now() - start };
}
 
export default arenaAgent;

Grounding Check for RAG

Detect hallucinations by checking if claims are supported by retrieved sources:

import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
 
const GroundingJudgment = z.object({
  isGrounded: z.boolean().describe('Whether all claims are supported by sources'),
  score: z.number().min(0).max(1).describe('Percentage of claims supported'),
  unsupportedClaims: z.array(z.string()).describe('Claims not found in sources'),
  reason: z.string(),
});
 
interface DocumentMetadata {
  content: string;
  title?: string;
}
 
const ragAgent = createAgent('RAGWithGrounding', {
  schema: {
    input: z.object({ question: z.string() }),
    output: z.object({
      answer: z.string(),
      grounding: GroundingJudgment,
      sources: z.array(z.string()),
    }),
  },
  handler: async (ctx, input) => {
    // Retrieve relevant documents
    const results = await ctx.vector.search<DocumentMetadata>('knowledge-base', {
      query: input.question,
      limit: 3,
    });
 
    const sources = results.map(r => r.metadata?.content || '').filter(Boolean);
 
    // Generate answer
    const { text: answer } = await generateText({
      model: openai('gpt-5-mini'),
      prompt: `Answer based on these sources:\n\n${sources.join('\n\n')}\n\nQuestion: ${input.question}`,
    });
 
    // Check grounding with fast model
    const { object: grounding } = await generateObject({
      model: groq('openai/gpt-oss-120b'),
      schema: GroundingJudgment,
      prompt: `Check if this answer is supported by the sources.
 
Answer: ${answer}
 
Sources:
${sources.map((s, i) => `[${i + 1}] ${s}`).join('\n\n')}
 
Are all factual claims in the answer supported by the sources?
List any unsupported claims.`,
    });
 
    if (!grounding.isGrounded) {
      ctx.logger.warn('Hallucination detected', {
        claims: grounding.unsupportedClaims,
      });
    }
 
    return { answer, grounding, sources: results.map(r => r.key) };
  },
});
 
export default ragAgent;

Using in Evals

For background quality monitoring, use LLM-as-judge in your eval files. Evals run asynchronously after the response is sent, so they don't slow down users:

typescriptsrc/agent/qa-agent/eval.ts
import agent from './agent';
import { generateObject } from 'ai';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
 
const HelpfulnessJudgment = z.object({
  scores: z.object({
    helpfulness: z.object({
      score: z.number().min(0).max(1),
      reason: z.string(),
    }),
  }),
  checks: z.object({
    answersQuestion: z.object({
      passed: z.boolean(),
      reason: z.string(),
    }),
    actionable: z.object({
      passed: z.boolean(),
      reason: z.string(),
    }),
  }),
});
 
export const helpfulnessEval = agent.createEval('helpfulness', {
  description: 'Judges if responses are helpful and actionable',
  handler: async (ctx, input, output) => {
    const { object } = await generateObject({
      model: groq('openai/gpt-oss-120b'),
      schema: HelpfulnessJudgment,
      prompt: `Evaluate this response.
 
Question: ${input.question}
Response: ${output.answer}
 
SCORE:
- helpfulness: How useful is this response? (0 = useless, 1 = extremely helpful)
 
CHECKS:
- answersQuestion: Does it directly answer what was asked?
- actionable: Can the user act on this information?`,
    });
 
    // Combine score and checks into eval result
    const allChecksPassed = object.checks.answersQuestion.passed && object.checks.actionable.passed;
 
    return {
      passed: allChecksPassed && object.scores.helpfulness.score >= 0.7,
      score: object.scores.helpfulness.score,
      reason: object.scores.helpfulness.reason,
      metadata: {
        checks: object.checks,
      },
    };
  },
});

See Adding Evaluations for more patterns including preset evals.

Structuring Judge Prompts

Good judge prompts have clear structure:

const judgePrompt = `You are evaluating a ${taskType} response.
 
CONTEXT:
${context}
 
RESPONSE TO EVALUATE:
${response}
 
SCORING CRITERIA (0.0-1.0):
- criterion1: What specifically to look for
- criterion2: What specifically to look for
 
CHECKS (pass/fail):
- check1: Specific yes/no question
- check2: Specific yes/no question
 
Evaluate each criterion, then provide an overall assessment.`;

Tips:

  • Be explicit about the scale (0-1, not 1-10)
  • Define what each score level means
  • Make checks binary questions with clear answers
  • Include the original context/prompt for reference

Best Practices

  • Use fast models for judging: Groq, gpt-5-nano, or claude-haiku-4-5 work well
  • Separate scores from checks: Scores for gradients, checks for pass/fail
  • Structure prompts clearly: List criteria explicitly with descriptions
  • Log low scores: Track failures for debugging and model improvement
  • Consider latency: Use inline judging only when users need to see results

Cost Optimization

LLM-as-judge adds model calls. Optimize by:

  • Using the fastest capable model (Groq for speed, nano models for cost)
  • Running evals asynchronously (default in Agentuity)
  • Sampling a percentage of requests in high-volume scenarios
  • Batching judgments when evaluating multiple items

Next Steps