Agentuity Documentation

The LLM-as-judge pattern uses one model to evaluate another model's output. This is useful for subjective quality assessments that can't be checked programmatically.

When to Use This Pattern

Use Case	Good Fit?
Checking factual accuracy against sources	Yes
Evaluating tone and politeness	Yes
Detecting hallucinations in RAG	Yes
Comparing response quality	Yes
Validating JSON structure	No (use schema validation)
Checking string length	No (use code)

Use LLM-as-judge when the evaluation requires understanding context, nuance, or subjective criteria.

Inline vs Background Checks

LLM-as-judge can be used in two contexts:

Context	When to Use	Blocks Response?
Inline (in handler)	Scores returned with the response, model comparison, or request-time gating	Yes
Background (`waitUntil()`)	Quality monitoring, compliance checks, or review flows after the response is sent	No

Inline example: A model arena that shows users which response "won" needs scores returned with the response.

Background example: Checking responses for PII or hallucinations without adding latency to the main response.

When to Run a Judge

Run LLM-as-judge inline when the score is part of the response. Use waitUntil() when the result is for monitoring, review, or follow-up work.

Basic Pattern

Use a fast model to judge outputs. Groq provides low-latency inference for judge calls:

import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
 
const JudgmentSchema = z.object({
  score: z.number().min(0).max(1).describe('Quality score from 0 to 1'),
  passed: z.boolean().describe('Whether the response meets quality threshold'),
  reason: z.string().describe('Brief explanation of the judgment'),
});
 
const agent = createAgent('QualityChecker', {
  schema: {
    input: z.object({ question: z.string() }),
    output: z.object({
      answer: z.string(),
      judgment: JudgmentSchema,
    }),
  },
  handler: async (ctx, input) => {
    // Generate answer with primary model
    const { text: answer } = await generateText({
      model: openai('gpt-5-mini'),
      prompt: input.question,
    });
 
    // Judge with fast model
    const { object: judgment } = await generateObject({
      model: groq('openai/gpt-oss-120b'),
      schema: JudgmentSchema,
      prompt: `Evaluate this response on a 0-1 scale.
 
Question: ${input.question}
Response: ${answer}
 
Consider:
- Does it directly answer the question?
- Is the information accurate?
- Is it clear and easy to understand?
 
Score 0.7+ to pass.`,
    });
 
    return { answer, judgment };
  },
});
 
export default agent;

Model Comparison Arena

Compare responses from multiple providers and declare a winner. This pattern is useful for benchmarking, A/B testing, or letting users see quality differences:

import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
 
const ArenaJudgment = z.object({
  winner: z.enum(['openai', 'anthropic']),
  reasoning: z.string(),
  scores: z.object({
    openai: z.number().min(0).max(1),
    anthropic: z.number().min(0).max(1),
  }),
});
 
const arenaAgent = createAgent('ModelArena', {
  schema: {
    input: z.object({
      prompt: z.string(),
      tone: z.enum(['whimsical', 'suspenseful', 'comedic']),
    }),
    output: z.object({
      results: z.array(z.object({
        provider: z.string(),
        story: z.string(),
        generationMs: z.number(),
      })),
      judgment: ArenaJudgment,
    }),
  },
  handler: async (ctx, input) => {
    // Generate from both models in parallel
    const [openaiResult, anthropicResult] = await Promise.all([
      generateWithTiming(openai('gpt-5.4-nano'), input.prompt, input.tone),
      generateWithTiming(anthropic('claude-haiku-4-5'), input.prompt, input.tone),
    ]);
 
    const results = [
      { provider: 'openai', ...openaiResult },
      { provider: 'anthropic', ...anthropicResult },
    ];
 
    // Judge with fast model
    const { object: judgment } = await generateObject({
      model: groq('openai/gpt-oss-120b'),
      schema: ArenaJudgment,
      prompt: `Compare these ${input.tone} stories and pick a winner.
 
PROMPT: "${input.prompt}"
 
--- OpenAI ---
${openaiResult.story}
 
--- Anthropic ---
${anthropicResult.story}
 
Score each on creativity, engagement, and tone match (0-1).
Declare a winner with reasoning.`,
    });
 
    ctx.logger.info('Arena complete', { winner: judgment.winner });
    return { results, judgment };
  },
});
 
async function generateWithTiming(
  model: Parameters<typeof generateText>[0]['model'],
  prompt: string,
  tone: string
) {
  const start = Date.now();
  const { text } = await generateText({
    model,
    system: `Write a short ${tone} story (max 200 words) with a beginning, middle, and end.`,
    prompt,
  });
  return { story: text, generationMs: Date.now() - start };
}
 
export default arenaAgent;

See It Live

The SDK Explorer includes a working Model Arena demo that demonstrates this pattern in more detail.

Grounding Check for RAG

Detect hallucinations by checking if claims are supported by retrieved sources:

import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
 
const GroundingJudgment = z.object({
  isGrounded: z.boolean().describe('Whether all claims are supported by sources'),
  score: z.number().min(0).max(1).describe('Percentage of claims supported'),
  unsupportedClaims: z.array(z.string()).describe('Claims not found in sources'),
  reason: z.string(),
});
 
interface DocumentMetadata extends Record<string, unknown> {
  content: string;
  title: string | undefined;
}
 
const ragAgent = createAgent('RAGWithGrounding', {
  schema: {
    input: z.object({ question: z.string() }),
    output: z.object({
      answer: z.string(),
      grounding: GroundingJudgment,
      sources: z.array(z.string()),
    }),
  },
  handler: async (ctx, input) => {
    // Retrieve relevant documents
    const results = await ctx.vector.search<DocumentMetadata>('knowledge-base', {
      query: input.question,
      limit: 3,
    });
 
    const sources = results.map(r => r.metadata?.content || '').filter(Boolean);
 
    // Generate answer
    const { text: answer } = await generateText({
      model: openai('gpt-5-mini'),
      prompt: `Answer based on these sources:\n\n${sources.join('\n\n')}\n\nQuestion: ${input.question}`,
    });
 
    // Check grounding with fast model
    const { object: grounding } = await generateObject({
      model: groq('openai/gpt-oss-120b'),
      schema: GroundingJudgment,
      prompt: `Check if this answer is supported by the sources.
 
Answer: ${answer}
 
Sources:
${sources.map((s, i) => `[${i + 1}] ${s}`).join('\n\n')}
 
Are all factual claims in the answer supported by the sources?
List any unsupported claims.`,
    });
 
    if (!grounding.isGrounded) {
      ctx.logger.warn('Hallucination detected', {
        claims: grounding.unsupportedClaims,
      });
    }
 
    return { answer, grounding, sources: results.map(r => r.key) };
  },
});
 
export default ragAgent;

Run in the Background

For background quality monitoring, queue the judge with ctx.waitUntil() so the response returns immediately:

typescriptsrc/agent/qa-agent/agent.ts

import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
 
const HelpfulnessJudgment = z.object({
  scores: z.object({
    helpfulness: z.object({
      score: z.number().min(0).max(1),
      reason: z.string(),
    }),
  }),
  checks: z.object({
    answersQuestion: z.object({
      passed: z.boolean(),
      reason: z.string(),
    }),
    actionable: z.object({
      passed: z.boolean(),
      reason: z.string(),
    }),
  }),
});
 
const agent = createAgent('QAAgent', {
  schema: {
    input: z.object({ question: z.string() }),
    output: z.object({ answer: z.string() }),
  },
  handler: async (ctx, input) => {
    const { text: answer } = await generateText({
      model: openai('gpt-5-mini'),
      prompt: input.question,
    });
 
    ctx.waitUntil(async () => {
      const { object } = await generateObject({
        model: groq('openai/gpt-oss-120b'),
        schema: HelpfulnessJudgment,
        prompt: `Evaluate this response.
 
Question: ${input.question}
Response: ${answer}
 
SCORE:
- helpfulness: How useful is this response? (0 = useless, 1 = extremely helpful)
 
CHECKS:
- answersQuestion: Does it directly answer what was asked?
- actionable: Can someone act on this information?`,
      });
 
      ctx.logger.info('Judge completed', {
        helpfulness: object.scores.helpfulness.score,
        answersQuestion: object.checks.answersQuestion.passed,
        actionable: object.checks.actionable.passed,
      });
    });
 
    return { answer };
  },
});
 
export default agent;

Structuring Judge Prompts

Good judge prompts have clear structure:

const judgePrompt = `You are evaluating a ${taskType} response.
 
CONTEXT:
${context}
 
RESPONSE TO EVALUATE:
${response}
 
SCORING CRITERIA (0.0-1.0):
- criterion1: What specifically to look for
- criterion2: What specifically to look for
 
CHECKS (pass/fail):
- check1: Specific yes/no question
- check2: Specific yes/no question
 
Evaluate each criterion, then provide an overall assessment.`;

Tips:

Be explicit about the scale (0-1, not 1-10)
Define what each score level means
Make checks binary questions with clear answers
Include the original context/prompt for reference

Best Practices

Use lower-latency judge models: examples here use Groq openai/gpt-oss-120b, OpenAI gpt-5.4-nano, and Anthropic claude-haiku-4-5
Separate scores from checks: Scores for gradients, checks for pass/fail
Structure prompts clearly: List criteria explicitly with descriptions
Log low scores: Track failures for debugging and model improvement
Consider latency: Use inline judging only when users need to see results

Cost Optimization

LLM-as-judge adds model calls. Optimize by:

Using the fastest capable model for the rubric, such as Groq openai/gpt-oss-120b or a smaller provider model
Running the judge asynchronously when the request path does not need a score
Sampling a percentage of requests in high-volume scenarios
Batching judgments when evaluating multiple items

Full Working Example

See the Code Runner example for parallel sandbox execution with LLM-as-judge checks across TypeScript and Python.

Next Steps

Using the AI SDK: Model selection and configuration
Vector Storage: Build RAG systems to ground responses