Graders

Graders evaluate LLM completions and produce metric scores. Use them as reward signals during training or as evaluation criteria for runs. Adaptive supports five grader types — pick one based on what your scoring rule looks like.

Type	When to use	Create with
Function	Correctness is structural or rule-based — string match, regex, JSON shape	`create.function()` / New Grader → Function
AI judge	Pass/fail criteria can be expressed in natural language	`create.binary_judge()` / New Grader → AI Judge
Pre-built	RAG evaluation: faithfulness, context relevancy, answer relevancy	`create.prebuilt_judge()` / New Grader → Pre-built
External endpoint	Scoring requires an external service or model you host	`create.external_endpoint()` / New Grader → External
Custom	Python logic baked into a custom recipe	`create.custom()`

New in v0.14: function graders are first-class objects. Define a Python grade function, validate it in a sandbox, and reuse it across recipes — no need to edit recipe code to swap scoring rules.

Create a function grader

Function graders run an async def grade(thread: StringThread) -> float in a sandboxed Python environment, called on every completion during training or evaluation. Use them when correctness is structural or rule-based — string matching, regex, JSON shape — and an LLM judge would add noise at every step.

from adaptive_harmony import StringThread

async def grade(thread: StringThread) -> float:
    completion = thread.completion() or ""
    expected = thread.metadata.get("expected", "")
    return 1.0 if completion.strip() == expected.strip() else 0.0

adaptive.graders.create.function(
    key="exact-match",
    fn=grade,
    description="Strict equality between completion and expected answer",
)

Pass the function itself, not its source as a string. The SDK reads grade’s source from the file or notebook cell where it’s defined, validates it in the sandbox, and stores it under the project. The function gets renamed to grade server-side, so the name in your code doesn’t matter.

The grade function contract

Rule	Detail
Must be `async`	Sync `def grade(...)` is rejected at validation.
Exactly one parameter	A `StringThread`. The parameter name is convention only.
Returns numeric	`int` or `float`. Other types fail validation.
Source size	64 KiB max.
Top-level only	The function must live at module top level — not nested in another function or class.

The validator emits warnings (not errors) if the parameter isn’t annotated StringThread or the return isn’t annotated float.

Reading the thread

Inside grade, the StringThread exposes the conversation and any per-sample metadata you attached to your dataset:

async def grade(thread: StringThread) -> float:
    completion = thread.completion()    # last assistant turn, or None
    turns = thread.get_turns()           # list of (role, content) tuples
    history = thread.messages()          # all turns except final assistant
    metadata = thread.metadata           # dataset row metadata dict
    ...

thread.metadata is the per-sample metadata you set when building the dataset — put rubric fields, expected outputs, or labels there.

Validation

graders.create.function(...) validates before it persists. If validation fails, the SDK raises ValueError with the failed check name and message — nothing is saved.

Validation steps

The sandbox runs five checks in order and stops at the first failure:

Syntax — compile() succeeds.
Structure — a top-level async def grade(...) exists.
Signature — exactly one parameter; warnings emitted for missing or unexpected annotations.
Execution — the module body executes and grade is callable.
Test run — grade is invoked with a thread and must return a numeric value.

The SDK supplies a hardcoded mock for the test run: a thread containing ("user", "What is 2+2?") and ("assistant", "4"), with metadata = {}. The check confirms your function runs and returns a number — not that it produces a sensible score on real data. To test against real samples, use the test payload panel in the UI.

Examples

# Regex match
import re

async def grade(thread: StringThread) -> float:
    completion = thread.completion() or ""
    pattern = thread.metadata.get("pattern", "")
    return 1.0 if re.search(pattern, completion) else 0.0

# Strict JSON shape
import json

async def grade(thread: StringThread) -> float:
    completion = thread.completion() or ""
    try:
        parsed = json.loads(completion)
        return 1.0 if {"name", "email", "age"}.issubset(parsed.keys()) else 0.0
    except Exception:
        return 0.0

# Multi-keyword scoring with partial credit
async def grade(thread: StringThread) -> float:
    completion = (thread.completion() or "").lower()
    keywords = thread.metadata.get("keywords", [])
    if not keywords:
        return 0.0
    hits = sum(1 for k in keywords if k.lower() in completion)
    return hits / len(keywords)

Sandbox

The grade function runs in an isolated Python sandbox (nsjail). What’s available:

The Python standard library (re, json, math, string, etc.)
StringThread, importable from adaptive_harmony

What’s not:

Network calls (requests, httpx, urllib.request)
Filesystem writes outside the sandbox
Third-party packages (numpy, pydantic, etc.)

If your scoring depends on a network call or external library, use an external endpoint grader instead.

Manage existing graders

adaptive.graders.list()                                       # all graders in project
adaptive.graders.get(grader_key="exact-match")                # one grader
adaptive.graders.delete(grader_key="exact-match")             # remove
adaptive.graders.lock(grader_key="exact-match", locked=True)  # prevent edits

The SDK has no in-place update for function graders. To change the implementation, delete and recreate the grader, or edit it through the UI.

Create an AI judge

AI judges use an LLM to grade completions based on a criterion you define:

adaptive.graders.create.binary_judge(
    key="helpful-judge",
    criteria="The response directly answers the user's question without going off-topic",
    judge_model="llama-3.1-8b-instruct",
    feedback_key="helpfulness",
)

Parameter	Type	Required	Description
`key`	str	Yes	Unique identifier
`criteria`	str	Yes	What constitutes a pass (natural language)
`judge_model`	str	Yes	Model to use as judge
`feedback_key`	str	Yes	Feedback key to write scores to

The judge returns PASS/FAIL for each completion along with reasoning.

Prompt templates for AI judges

AI judges use Handlebars templates for their prompts. Template variables give you access to the conversation context, completion, and metadata.Basic syntax:

{{{completion}}}            — insert without HTML escaping (always use for text)
{{#if metadata.domain}}     — conditional block
...
{{else}}
...
{{/if}}     
{{#each turns}}             — loop over a list (list of turn objects below)
Turn {{@index}}:
{{role}}: {{content}}
{{/each}}

Example template:

System: "You are a judge. Evaluate based on: {{criteria}}.
         Output this JSON schema: {{output_schema}}"

User: "Context:\n{{context_str_without_last_user}}
       Question:\n{{last_user_turn_content}}
       Response:\n{{completion}}"

All template variables

Variable	Description
`completion`	The assistant’s completion being evaluated
`last_user_turn_content`	Content of the final user turn
`context_str`	Full conversation context as a formatted string
`context_str_without_last_user`	Context excluding the final user turn
`turns`	All turns as a list of `{role, content}` dicts
`context_turns`	All turns except the completion
`context_turns_without_last_user`	Context turns without the last user turn
`metadata`	Thread metadata dict
`output_schema`	Expected output JSON schema
`template_variables`	Custom variables passed at judge initialization

Use triple braces ({{{var}}}) for variables that may contain HTML entities or special characters.

Pre-built graders

For RAG applications, use pre-built graders optimized by Adaptive:

adaptive.graders.create.prebuilt_judge(
    key="rag-faithfulness",
    type="FAITHFULNESS",
    judge_model="llama-3.1-8b-instruct",
)

Faithfulness: Does the completion adhere to provided context?
Context Relevancy: Is the retrieved context relevant to the query?
Answer Relevancy: Does the completion answer the question?

How pre-built graders work

Faithfulness breaks the completion into atomic claims and checks each against the context:

score = claims supported by context / total claims

Pass context as document turns in the input messages. Each retrieved chunk should be a separate turn.Sample:

{"role": "document", "content": "Tim Berners-Lee created the first website."}
{"role": "document", "content": "Tim Berners-Lee invented the world wide web."}
{"role": "user", "content": "Who published the first website?"}

Completion: “Tim Berners-Lee published the first website in August 1990.”Score: 0.5 (first claim supported, date claim unsupported)

Context Relevancy checks if retrieved chunks are relevant to the query:

score = relevant chunks / total chunks

Answer Relevancy checks if the completion addresses the question:

score = relevant claims / total claims

Extra information not requested by the user lowers the score.

For reward servers and custom graders, see Reward Servers and Custom Recipes.See SDK Reference for all grader methods.

Create a grader

Navigate to your project and open the Graders tab. Click New Grader and pick a type.

Function grader

Pick Function to write a Python grade function directly in the editor. Use the Test Payload panel to run validation against a sample from your dataset before saving — the panel surfaces errors and per-sample scores so you can iterate without leaving the page.

AI judge

For AI judges, provide a natural language criterion that defines what a passing completion looks like.

Add examples to improve judge accuracy. Include both passing and failing examples with justifications.

Use in recipes

After creating a grader, select it when configuring training or evaluation recipes. The grader scores each completion and provides feedback for the run.

Start

Core

Advanced

Deploy

Updates

Create a function grader

The grade function contract

Reading the thread

Validation

Examples

Sandbox

Manage existing graders

Create an AI judge

Prompt templates for AI judges

Pre-built graders

Create a grader

Function grader

AI judge

Use in recipes

Start

Core

Advanced

Deploy

Updates

Documentation Index

​Create a function grader

​The grade function contract

​Reading the thread

​Validation

​Examples

​Sandbox

​Manage existing graders

​Create an AI judge

​Prompt templates for AI judges

​Pre-built graders

​Create a grader

​Function grader

​AI judge

​Use in recipes

Create a function grader

The grade function contract

Reading the thread

Validation

Examples

Sandbox

Manage existing graders

Create an AI judge

Prompt templates for AI judges

Pre-built graders

Create a grader

Function grader

AI judge

Use in recipes