The LLM-as-judge pattern uses one model to evaluate another model's output. This is useful for subjective quality assessments that can't be checked programmatically.
When to Use This Pattern
| Use Case | Good Fit? |
|---|---|
| Checking factual accuracy against sources | Yes |
| Evaluating tone and politeness | Yes |
| Detecting hallucinations in RAG | Yes |
| Comparing response quality | Yes |
| Validating JSON structure | No (use schema validation) |
| Checking string length | No (use code) |
Use LLM-as-judge when the evaluation requires understanding context, nuance, or subjective criteria.
Inline vs Background Checks
LLM-as-judge can be used in two contexts:
| Context | When to Use | Blocks Response? |
|---|---|---|
| Inline (in handler) | Scores returned with the response, model comparison, or request-time gating | Yes |
Background (waitUntil()) | Quality monitoring, compliance checks, or review flows after the response is sent | No |
Inline example: A model arena that shows users which response "won" needs scores returned with the response.
Background example: Checking responses for PII or hallucinations without adding latency to the main response.
Run LLM-as-judge inline when the score is part of the response. Use waitUntil() when the result is for monitoring, review, or follow-up work.
Basic Pattern
Use a fast model to judge outputs. Groq provides low-latency inference for judge calls:
import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
const JudgmentSchema = z.object({
score: z.number().min(0).max(1).describe('Quality score from 0 to 1'),
passed: z.boolean().describe('Whether the response meets quality threshold'),
reason: z.string().describe('Brief explanation of the judgment'),
});
const agent = createAgent('QualityChecker', {
schema: {
input: z.object({ question: z.string() }),
output: z.object({
answer: z.string(),
judgment: JudgmentSchema,
}),
},
handler: async (ctx, input) => {
// Generate answer with primary model
const { text: answer } = await generateText({
model: openai('gpt-5-mini'),
prompt: input.question,
});
// Judge with fast model
const { object: judgment } = await generateObject({
model: groq('openai/gpt-oss-120b'),
schema: JudgmentSchema,
prompt: `Evaluate this response on a 0-1 scale.
Question: ${input.question}
Response: ${answer}
Consider:
- Does it directly answer the question?
- Is the information accurate?
- Is it clear and easy to understand?
Score 0.7+ to pass.`,
});
return { answer, judgment };
},
});
export default agent;Model Comparison Arena
Compare responses from multiple providers and declare a winner. This pattern is useful for benchmarking, A/B testing, or letting users see quality differences:
import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { anthropic } from '@ai-sdk/anthropic';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
const ArenaJudgment = z.object({
winner: z.enum(['openai', 'anthropic']),
reasoning: z.string(),
scores: z.object({
openai: z.number().min(0).max(1),
anthropic: z.number().min(0).max(1),
}),
});
const arenaAgent = createAgent('ModelArena', {
schema: {
input: z.object({
prompt: z.string(),
tone: z.enum(['whimsical', 'suspenseful', 'comedic']),
}),
output: z.object({
results: z.array(z.object({
provider: z.string(),
story: z.string(),
generationMs: z.number(),
})),
judgment: ArenaJudgment,
}),
},
handler: async (ctx, input) => {
// Generate from both models in parallel
const [openaiResult, anthropicResult] = await Promise.all([
generateWithTiming(openai('gpt-5.4-nano'), input.prompt, input.tone),
generateWithTiming(anthropic('claude-haiku-4-5'), input.prompt, input.tone),
]);
const results = [
{ provider: 'openai', ...openaiResult },
{ provider: 'anthropic', ...anthropicResult },
];
// Judge with fast model
const { object: judgment } = await generateObject({
model: groq('openai/gpt-oss-120b'),
schema: ArenaJudgment,
prompt: `Compare these ${input.tone} stories and pick a winner.
PROMPT: "${input.prompt}"
--- OpenAI ---
${openaiResult.story}
--- Anthropic ---
${anthropicResult.story}
Score each on creativity, engagement, and tone match (0-1).
Declare a winner with reasoning.`,
});
ctx.logger.info('Arena complete', { winner: judgment.winner });
return { results, judgment };
},
});
async function generateWithTiming(
model: Parameters<typeof generateText>[0]['model'],
prompt: string,
tone: string
) {
const start = Date.now();
const { text } = await generateText({
model,
system: `Write a short ${tone} story (max 200 words) with a beginning, middle, and end.`,
prompt,
});
return { story: text, generationMs: Date.now() - start };
}
export default arenaAgent;The SDK Explorer includes a working Model Arena demo that demonstrates this pattern in more detail.
Grounding Check for RAG
Detect hallucinations by checking if claims are supported by retrieved sources:
import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
const GroundingJudgment = z.object({
isGrounded: z.boolean().describe('Whether all claims are supported by sources'),
score: z.number().min(0).max(1).describe('Percentage of claims supported'),
unsupportedClaims: z.array(z.string()).describe('Claims not found in sources'),
reason: z.string(),
});
interface DocumentMetadata extends Record<string, unknown> {
content: string;
title: string | undefined;
}
const ragAgent = createAgent('RAGWithGrounding', {
schema: {
input: z.object({ question: z.string() }),
output: z.object({
answer: z.string(),
grounding: GroundingJudgment,
sources: z.array(z.string()),
}),
},
handler: async (ctx, input) => {
// Retrieve relevant documents
const results = await ctx.vector.search<DocumentMetadata>('knowledge-base', {
query: input.question,
limit: 3,
});
const sources = results.map(r => r.metadata?.content || '').filter(Boolean);
// Generate answer
const { text: answer } = await generateText({
model: openai('gpt-5-mini'),
prompt: `Answer based on these sources:\n\n${sources.join('\n\n')}\n\nQuestion: ${input.question}`,
});
// Check grounding with fast model
const { object: grounding } = await generateObject({
model: groq('openai/gpt-oss-120b'),
schema: GroundingJudgment,
prompt: `Check if this answer is supported by the sources.
Answer: ${answer}
Sources:
${sources.map((s, i) => `[${i + 1}] ${s}`).join('\n\n')}
Are all factual claims in the answer supported by the sources?
List any unsupported claims.`,
});
if (!grounding.isGrounded) {
ctx.logger.warn('Hallucination detected', {
claims: grounding.unsupportedClaims,
});
}
return { answer, grounding, sources: results.map(r => r.key) };
},
});
export default ragAgent;Run in the Background
For background quality monitoring, queue the judge with ctx.waitUntil() so the response returns immediately:
import { createAgent } from '@agentuity/runtime';
import { generateText, generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { groq } from '@ai-sdk/groq';
import { z } from 'zod';
const HelpfulnessJudgment = z.object({
scores: z.object({
helpfulness: z.object({
score: z.number().min(0).max(1),
reason: z.string(),
}),
}),
checks: z.object({
answersQuestion: z.object({
passed: z.boolean(),
reason: z.string(),
}),
actionable: z.object({
passed: z.boolean(),
reason: z.string(),
}),
}),
});
const agent = createAgent('QAAgent', {
schema: {
input: z.object({ question: z.string() }),
output: z.object({ answer: z.string() }),
},
handler: async (ctx, input) => {
const { text: answer } = await generateText({
model: openai('gpt-5-mini'),
prompt: input.question,
});
ctx.waitUntil(async () => {
const { object } = await generateObject({
model: groq('openai/gpt-oss-120b'),
schema: HelpfulnessJudgment,
prompt: `Evaluate this response.
Question: ${input.question}
Response: ${answer}
SCORE:
- helpfulness: How useful is this response? (0 = useless, 1 = extremely helpful)
CHECKS:
- answersQuestion: Does it directly answer what was asked?
- actionable: Can someone act on this information?`,
});
ctx.logger.info('Judge completed', {
helpfulness: object.scores.helpfulness.score,
answersQuestion: object.checks.answersQuestion.passed,
actionable: object.checks.actionable.passed,
});
});
return { answer };
},
});
export default agent;Structuring Judge Prompts
Good judge prompts have clear structure:
const judgePrompt = `You are evaluating a ${taskType} response.
CONTEXT:
${context}
RESPONSE TO EVALUATE:
${response}
SCORING CRITERIA (0.0-1.0):
- criterion1: What specifically to look for
- criterion2: What specifically to look for
CHECKS (pass/fail):
- check1: Specific yes/no question
- check2: Specific yes/no question
Evaluate each criterion, then provide an overall assessment.`;Tips:
- Be explicit about the scale (0-1, not 1-10)
- Define what each score level means
- Make checks binary questions with clear answers
- Include the original context/prompt for reference
Best Practices
- Use lower-latency judge models: examples here use Groq
openai/gpt-oss-120b, OpenAIgpt-5.4-nano, and Anthropicclaude-haiku-4-5 - Separate scores from checks: Scores for gradients, checks for pass/fail
- Structure prompts clearly: List criteria explicitly with descriptions
- Log low scores: Track failures for debugging and model improvement
- Consider latency: Use inline judging only when users need to see results
Cost Optimization
LLM-as-judge adds model calls. Optimize by:
- Using the fastest capable model for the rubric, such as Groq
openai/gpt-oss-120bor a smaller provider model - Running the judge asynchronously when the request path does not need a score
- Sampling a percentage of requests in high-volume scenarios
- Batching judgments when evaluating multiple items
See the Code Runner example for parallel sandbox execution with LLM-as-judge checks across TypeScript and Python.
Next Steps
- Using the AI SDK: Model selection and configuration
- Vector Storage: Build RAG systems to ground responses