Skip to main content
Graders evaluate LLM completions and produce metric scores. Use them as reward signals for training or to assess model performance during evaluation.

Create an AI judge

AI judges use an LLM to grade completions based on a criterion you define:
adaptive.graders.create.binary_judge(
    key="helpful-judge",
    criteria="The response directly answers the user's question without going off-topic",
    judge_model="llama-3.1-8b-instruct",
    feedback_key="helpfulness",
)
ParameterTypeRequiredDescription
keystrYesUnique identifier
criteriastrYesWhat constitutes a pass (natural language)
judge_modelstrYesModel to use as judge
feedback_keystrYesFeedback key to write scores to
The judge returns PASS/FAIL for each completion along with reasoning.

Prompt templates for AI judges

AI judges use Handlebars templates for their prompts. Template variables give you access to the conversation context, completion, and metadata.Basic syntax:
{{{completion}}}            — insert without HTML escaping (always use for text)
{{#if metadata.domain}}     — conditional block
...
{{else}}
...
{{/if}}     
{{#each turns}}             — loop over a list (list of turn objects below)
Turn {{@index}}:
{{role}}: {{content}}
{{/each}}
Example template:
System: "You are a judge. Evaluate based on: {{criteria}}.
         Output this JSON schema: {{output_schema}}"

User: "Context:\n{{context_str_without_last_user}}
       Question:\n{{last_user_turn_content}}
       Response:\n{{completion}}"
VariableDescription
completionThe assistant’s completion being evaluated
last_user_turn_contentContent of the final user turn
context_strFull conversation context as a formatted string
context_str_without_last_userContext excluding the final user turn
turnsAll turns as a list of {role, content} dicts
context_turnsAll turns except the completion
context_turns_without_last_userContext turns without the last user turn
metadataThread metadata dict
output_schemaExpected output JSON schema
template_variablesCustom variables passed at judge initialization
Use triple braces ({{{var}}}) for variables that may contain HTML entities or special characters.

Grader types

TypeMethodUse when
AI judgecreate.binary_judge()Criteria can be expressed in natural language
Pre-builtcreate.prebuilt_judge()RAG evaluation (faithfulness, relevancy)
External endpointcreate.external_endpoint()Scoring requires an external system
Customcreate.custom()Python logic in recipes

Pre-built graders

For RAG applications, use pre-built graders optimized by Adaptive:
adaptive.graders.create.prebuilt_judge(
    key="rag-faithfulness",
    type="FAITHFULNESS",
    judge_model="llama-3.1-8b-instruct",
)
  • Faithfulness: Does the completion adhere to provided context?
  • Context Relevancy: Is the retrieved context relevant to the query?
  • Answer Relevancy: Does the completion answer the question?
Faithfulness breaks the completion into atomic claims and checks each against the context:
score = claims supported by context / total claims
Pass context as document turns in the input messages. Each retrieved chunk should be a separate turn.Sample:
{"role": "document", "content": "Tim Berners-Lee created the first website."}
{"role": "document", "content": "Tim Berners-Lee invented the world wide web."}
{"role": "user", "content": "Who published the first website?"}
Completion: “Tim Berners-Lee published the first website in August 1990.”Score: 0.5 (first claim supported, date claim unsupported)
Context Relevancy checks if retrieved chunks are relevant to the query:
score = relevant chunks / total chunks

Answer Relevancy checks if the completion addresses the question:
score = relevant claims / total claims
Extra information not requested by the user lowers the score.
For reward servers and custom graders, see Reward Servers and Custom Recipes.See SDK Reference for all grader methods.