Graders

Graders evaluate LLM completions and provide quantitative feedback. Use them as reward signals for training or to assess model performance during evaluation.

Create an AI judge

AI judges use an LLM to grade completions based on a criterion you define:

adaptive.graders.create.binary_judge(
    key="helpful-judge",
    criteria="The response directly answers the user's question without going off-topic",
    judge_model="llama-3.1-8b-instruct",
    feedback_key="helpfulness",
)

Parameter	Type	Required	Description
`key`	str	Yes	Unique identifier
`criteria`	str	Yes	What constitutes a pass (natural language)
`judge_model`	str	Yes	Model to use as judge
`feedback_key`	str	Yes	Feedback key to write scores to

The judge returns PASS/FAIL for each completion along with reasoning.

Grader types

Type	Method	Use when
AI judge	`create.binary_judge()`	Criteria can be expressed in natural language
Pre-built	`create.prebuilt_judge()`	RAG evaluation (faithfulness, relevancy)
External endpoint	`create.external_endpoint()`	Scoring requires an external system
Custom	`create.custom()`	Python logic in recipes

Pre-built graders

For RAG applications, use pre-built graders optimized by Adaptive:

adaptive.graders.create.prebuilt_judge(
    key="rag-faithfulness",
    type="FAITHFULNESS",
    judge_model="llama-3.1-8b-instruct",
)

Faithfulness: Does the completion adhere to provided context?
Context Relevancy: Is the retrieved context relevant to the query?
Answer Relevancy: Does the completion answer the question?

How pre-built graders work

Faithfulness breaks the completion into atomic claims and checks each against the context:

score = claims supported by context / total claims

Pass context as document turns in the input messages. Each retrieved chunk should be a separate turn.Sample:

{"role": "document", "content": "Tim Berners-Lee created the first website."}
{"role": "document", "content": "Tim Berners-Lee invented the world wide web."}
{"role": "user", "content": "Who published the first website?"}

Completion: “Tim Berners-Lee published the first website in August 1990.”Score: 0.5 (first claim supported, date claim unsupported)

Context Relevancy checks if retrieved chunks are relevant to the query:

score = relevant chunks / total chunks

Answer Relevancy checks if the completion addresses the question:

score = relevant claims / total claims

Extra information not requested by the user lowers the score.

For reward servers and custom graders, see Reward Servers and Custom Recipes.See SDK Reference for all grader methods.

Start

Core

Advanced

Deploy

Updates

Create an AI judge

Grader types

Pre-built graders

Create a grader

Use in recipes

Start

Core

Advanced

Deploy

Updates

​Create an AI judge

​Grader types

​Pre-built graders

​Create a grader

​Use in recipes

Create an AI judge

Grader types

Pre-built graders

Create a grader

Use in recipes