AI Judges leverage powerful LLMs to grade your LLM completions based on predefined criteria.
This approach enables scalable, consistent, and cost-effective evaluation with little to no dependency on human automation and labor cost.
AI Judges evaluate completions by comparing them against criteria you define. The judge model analyzes the completion and provides a binary pass/fail result along with a justification for the decision.
The simplest way to define an AI Judge is to provide a clear criterion. This should be a concise statement that clearly defines what constitutes a passing completion.
The AI Judge Grader is scaffolded by a prompt that explains the binary classification task and wraps your custom criterion. Your criterion should only focus on the expected behavior of the judged model, no explanation of the task is required.
Example criteria:
“The response must directly answer the user’s question without going off-topic”
“If the response includes math calculations, it must always provide a breakdown of the calculation”
“The completion must first have tags with a reasoning inside before providing a final answer”
To improve the accuracy and consistency of your AI Judge, you can provide examples with interactions (input prompt and completion), pass/fail grades and their justification. This helps the judge understand your specific evaluation standards and improve performance.Example criteria: If the response includes math calculations, it must always provide a breakdown of the calculation
Examples of shots given to an AI judge during the grader definition