Skip to main content
Graders evaluate LLM completions and provide quantitative feedback. Use them as reward signals for training or to assess model performance during evaluation.

Create an AI judge

AI judges use an LLM to grade completions based on a criterion you define:
adaptive.graders.create.binary_judge(
    key="helpful-judge",
    criteria="The response directly answers the user's question without going off-topic",
    judge_model="llama-3.1-8b-instruct",
    feedback_key="helpfulness",
)
ParameterTypeRequiredDescription
keystrYesUnique identifier
criteriastrYesWhat constitutes a pass (natural language)
judge_modelstrYesModel to use as judge
feedback_keystrYesFeedback key to write scores to
The judge returns PASS/FAIL for each completion along with reasoning.

Grader types

TypeMethodUse when
AI judgecreate.binary_judge()Criteria can be expressed in natural language
Pre-builtcreate.prebuilt_judge()RAG evaluation (faithfulness, relevancy)
External endpointcreate.external_endpoint()Scoring requires an external system
Customcreate.custom()Python logic in recipes

Pre-built graders

For RAG applications, use pre-built graders optimized by Adaptive:
adaptive.graders.create.prebuilt_judge(
    key="rag-faithfulness",
    type="FAITHFULNESS",
    judge_model="llama-3.1-8b-instruct",
)
  • Faithfulness: Does the completion adhere to provided context?
  • Context Relevancy: Is the retrieved context relevant to the query?
  • Answer Relevancy: Does the completion answer the question?
Faithfulness breaks the completion into atomic claims and checks each against the context:
score = claims supported by context / total claims
Pass context as document turns in the input messages. Each retrieved chunk should be a separate turn.Sample:
{"role": "document", "content": "Tim Berners-Lee created the first website."}
{"role": "document", "content": "Tim Berners-Lee invented the world wide web."}
{"role": "user", "content": "Who published the first website?"}
Completion: “Tim Berners-Lee published the first website in August 1990.”Score: 0.5 (first claim supported, date claim unsupported)
Context Relevancy checks if retrieved chunks are relevant to the query:
score = relevant chunks / total chunks

Answer Relevancy checks if the completion addresses the question:
score = relevant claims / total claims
Extra information not requested by the user lowers the score.
For reward servers and custom graders, see Reward Servers and Custom Recipes.See SDK Reference for all grader methods.