Agentuity Documentation

When an Agentuity app needs a prompt and model regression suite, keep the prompt text in app code and let the OpenAI Evals API run the dataset and graders. This fits checks where the behavior under test is the prompt, model choice, and output contract.

npm install openai zod @agentuity/telemetry

This creates OpenAI resources

The script below uploads a JSONL file and creates an eval run through the OpenAI API. Keep OPENAI_API_KEY and OPENAI_EVAL_MODEL out of committed files.

Put the prompt and output vocabulary in the same server module your app uses. The eval script can import those constants without importing the framework route.

typescriptsrc/lib/ticket-classifier.ts

export const TICKET_CLASSIFIER_INSTRUCTIONS =
  'Categorize the support ticket into one of Hardware, Software, or Other. Respond with only the label.';
 
export type TicketLabel = 'Hardware' | 'Software' | 'Other';
 
export function normalizeTicketLabel(value: string): TicketLabel | undefined {
  const normalized = value.trim().toLowerCase();
 
  if (normalized === 'hardware') {
    return 'Hardware';
  }
 
  if (normalized === 'software') {
    return 'Software';
  }
 
  if (normalized === 'other') {
    return 'Other';
  }
 
  return undefined;
}

Write the Dataset

The Evals API file upload expects JSONL rows whose item object matches the eval's item_schema.

jsonsrc/evals/tickets.jsonl

{ "item": { "ticket_text": "My monitor will not turn on.", "correct_label": "Hardware" } }
{ "item": { "ticket_text": "The app crashes after I update the package.", "correct_label": "Software" } }
{ "item": { "ticket_text": "Where should I eat lunch?", "correct_label": "Other" } }

Create the Eval Run

Create the eval once, upload the dataset, then start a run that uses the Responses API data source. The string_check grader compares the model output with the human label in each row.

typescriptsrc/evals/openai-ticket-eval.ts

import fs from 'node:fs';
import OpenAI from 'openai';
import { logger } from '@agentuity/telemetry';
import { TICKET_CLASSIFIER_INSTRUCTIONS } from '../lib/ticket-classifier';
 
async function createTicketEval(openai: OpenAI): Promise<string> {
  const evalObject = await openai.evals.create({
    name: 'Agentuity support ticket classifier',
    data_source_config: {
      type: 'custom',
      item_schema: {
        type: 'object',
        properties: {
          ticket_text: { type: 'string' },
          correct_label: { type: 'string' },
        },
        required: ['ticket_text', 'correct_label'],
      },
      include_sample_schema: true,
    },
    testing_criteria: [
      {
        type: 'string_check',
        name: 'Matches expected label',
        input: '{{ sample.output_text }}',
        operation: 'eq',
        reference: '{{ item.correct_label }}',
      },
    ],
  });
 
  return evalObject.id;
}
 
async function uploadTicketDataset(openai: OpenAI, filePath: string): Promise<string> {
  const file = await openai.files.create({
    file: fs.createReadStream(filePath),
    purpose: 'evals',
  });
 
  return file.id;
}
 
async function createTicketEvalRun(
  openai: OpenAI,
  evalId: string,
  fileId: string,
  model: string
): Promise<string> {
  const run = await openai.evals.runs.create(evalId, {
    name: `Agentuity support ticket classifier: ${model}`,
    data_source: {
      type: 'responses',
      model,
      input_messages: {
        type: 'template',
        template: [
          { role: 'developer', content: TICKET_CLASSIFIER_INSTRUCTIONS },
          { role: 'user', content: '{{ item.ticket_text }}' },
        ],
      },
      source: { type: 'file_id', id: fileId },
    },
  });
 
  return run.id;
}
 
const apiKey = process.env.OPENAI_API_KEY;
const model = process.env.OPENAI_EVAL_MODEL;
 
if (!apiKey || !model) {
  throw new Error('Set OPENAI_API_KEY and OPENAI_EVAL_MODEL.');
}
 
const openai = new OpenAI({ apiKey });
const evalId = await createTicketEval(openai);
const fileId = await uploadTicketDataset(openai, 'src/evals/tickets.jsonl');
const runId = await createTicketEvalRun(openai, evalId, fileId, model);
 
logger.info('openai eval run created', { evalId, fileId, runId });

Run the script from your app root:

OPENAI_EVAL_MODEL=gpt-4.1 bun run src/evals/openai-ticket-eval.ts

Check the Run Status

Eval runs execute asynchronously. Poll the run or open the report_url returned by the API.

const run = await openai.evals.runs.retrieve(runId, { eval_id: evalId });
 
logger.info('openai eval run status', {
  status: run.status,
  resultCounts: run.result_counts,
  reportUrl: run.report_url,
});

Keep Agentuity Beside It

The Evals API runs the model task from the eval configuration. It does not call your deployed Agentuity route or exercise Agentuity service clients unless your eval task is shaped to do that elsewhere.

Use Evals and Testing or Braintrust Evals when the test needs to call app functions, routes, retrieval code, service clients, or a full workflow. Use Tracing when a failed eval needs spans around the app-owned work.

Next Steps

Evals and Testing: choose the eval shape for app-owned workflows
LLM as a Judge: score model output with a rubric in code
AI Gateway: route provider SDK calls through Agentuity when the app uses project credentials

Share the Task Prompt

Write the Dataset

Create the Eval Run

Check the Run Status

Keep Agentuity Beside It

Next Steps