Overview

AI Judges leverage powerful LLMs to grade your LLM completions based on predefined criteria. This approach enables scalable, consistent, and cost-effective evaluation with little to no dependency on human automation and labor cost.

How AI Judges Work

AI Judges evaluate completions by comparing them against criteria you define. The judge model analyzes the completion and provides a binary pass/fail result along with a justification for the decision.

Defining an AI Judge

Basic Criteria Definition

The simplest way to define an AI Judge is to provide a clear criterion. This should be a concise statement that clearly defines what constitutes a passing completion.
The AI Judge Grader is scaffolded by a prompt that explains the binary classification task and wraps your custom criterion. Your criterion should only focus on the expected behavior of the judged model, no explanation of the task is required.
Example criteria:
  • “The response must directly answer the user’s question without going off-topic”
  • “If the response includes math calculations, it must always provide a breakdown of the calculation”
  • “The completion must first have tags with a reasoning inside before providing a final answer”

Adding Examples (Few-Shot Prompting)

To improve the accuracy and consistency of your AI Judge, you can provide examples with interactions (input prompt and completion), pass/fail grades and their justification. This helps the judge understand your specific evaluation standards and improve performance. Example criteria: If the response includes math calculations, it must always provide a breakdown of the calculation

Examples of shots given to an AI judge during the grader definition

Best Practices

Writing Clear Criteria

  • Keep criteria concise and unambiguous
  • Use specific, measurable language
  • Avoid subjective terms that could lead to inconsistent grading
  • Focus on one main evaluation aspect per judge, keep criterion atomic

Providing Effective Examples

  • Include diverse examples that cover edge cases
  • Ensure examples clearly demonstrate pass/fail scenarios
  • Provide detailed justifications that explain the reasoning
  • Use examples that are representative of your actual use case

Validation

  • Test your AI Judge on a small set of examples first
  • Compare AI Judge results with human judgments when possible
  • Refine criteria and examples based on initial results
  • Monitor for bias or inconsistency in grading