AI Judges

Overview

AI Judges leverage powerful LLMs to grade your LLM completions based on predefined criteria. This approach enables scalable, consistent, and cost-effective evaluation with little to no dependency on human automation and labor cost.

How AI Judges Work

AI Judges evaluate completions by comparing them against criteria you define. The judge model analyzes the completion and provides a binary pass/fail result along with a justification for the decision.

Defining an AI Judge

Basic Criteria Definition

The simplest way to define an AI Judge is to provide a clear criterion. This should be a concise statement that clearly defines what constitutes a passing completion.

The AI Judge Grader is scaffolded by a prompt that explains the binary classification task and wraps your custom criterion. Your criterion should only focus on the expected behavior of the judged model, no explanation of the task is required.

Example criteria:

“The response must directly answer the user’s question without going off-topic”
“If the response includes math calculations, it must always provide a breakdown of the calculation”
“The completion must first have tags with a reasoning inside before providing a final answer”

Adding Examples (Few-Shot Prompting)

To improve the accuracy and consistency of your AI Judge, you can provide examples with interactions (input prompt and completion), pass/fail grades and their justification. This helps the judge understand your specific evaluation standards and improve performance. Example criteria: If the response includes math calculations, it must always provide a breakdown of the calculation

Best Practices

Writing Clear Criteria

Keep criteria concise and unambiguous
Use specific, measurable language
Avoid subjective terms that could lead to inconsistent grading
Focus on one main evaluation aspect per judge, keep criterion atomic

Providing Effective Examples

Include diverse examples that cover edge cases
Ensure examples clearly demonstrate pass/fail scenarios
Provide detailed justifications that explain the reasoning
Use examples that are representative of your actual use case

Validation

Test your AI Judge on a small set of examples first
Compare AI Judge results with human judgments when possible
Refine criteria and examples based on initial results
Monitor for bias or inconsistency in grading

Platform

Inference

Evaluation

Graders

Recipes & Runs

Datasets

Interactions

Integrations

Deployment

Overview

How AI Judges Work

Defining an AI Judge

Basic Criteria Definition

Adding Examples (Few-Shot Prompting)

Best Practices

Writing Clear Criteria

Providing Effective Examples

Validation

Platform

Inference

Evaluation

Graders

Recipes & Runs

Datasets

Interactions

Integrations

Deployment

​Overview

​How AI Judges Work

​Defining an AI Judge

​Basic Criteria Definition

​Adding Examples (Few-Shot Prompting)

​Best Practices

​Writing Clear Criteria

​Providing Effective Examples

​Validation

Overview

How AI Judges Work

Defining an AI Judge

Basic Criteria Definition

Adding Examples (Few-Shot Prompting)

Best Practices

Writing Clear Criteria

Providing Effective Examples

Validation