Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt

Use this file to discover all available pages before exploring further.

Graders evaluate LLM completions and produce metric scores. Use them as reward signals during training or as evaluation criteria for runs. Adaptive supports five grader types — pick one based on what your scoring rule looks like.
TypeWhen to useCreate with
FunctionCorrectness is structural or rule-based — string match, regex, JSON shapecreate.function() / New Grader → Function
AI judgePass/fail criteria can be expressed in natural languagecreate.binary_judge() / New Grader → AI Judge
Pre-builtRAG evaluation: faithfulness, context relevancy, answer relevancycreate.prebuilt_judge() / New Grader → Pre-built
External endpointScoring requires an external service or model you hostcreate.external_endpoint() / New Grader → External
CustomPython logic baked into a custom recipecreate.custom()
New in v0.14: function graders are first-class objects. Define a Python grade function, validate it in a sandbox, and reuse it across recipes — no need to edit recipe code to swap scoring rules.

Create a function grader

Function graders run an async def grade(thread: StringThread) -> float in a sandboxed Python environment, called on every completion during training or evaluation. Use them when correctness is structural or rule-based — string matching, regex, JSON shape — and an LLM judge would add noise at every step.
from adaptive_harmony import StringThread

async def grade(thread: StringThread) -> float:
    completion = thread.completion() or ""
    expected = thread.metadata.get("expected", "")
    return 1.0 if completion.strip() == expected.strip() else 0.0

adaptive.graders.create.function(
    key="exact-match",
    fn=grade,
    description="Strict equality between completion and expected answer",
)
Pass the function itself, not its source as a string. The SDK reads grade’s source from the file or notebook cell where it’s defined, validates it in the sandbox, and stores it under the project. The function gets renamed to grade server-side, so the name in your code doesn’t matter.

The grade function contract

RuleDetail
Must be asyncSync def grade(...) is rejected at validation.
Exactly one parameterA StringThread. The parameter name is convention only.
Returns numericint or float. Other types fail validation.
Source size64 KiB max.
Top-level onlyThe function must live at module top level — not nested in another function or class.
The validator emits warnings (not errors) if the parameter isn’t annotated StringThread or the return isn’t annotated float.

Reading the thread

Inside grade, the StringThread exposes the conversation and any per-sample metadata you attached to your dataset:
async def grade(thread: StringThread) -> float:
    completion = thread.completion()    # last assistant turn, or None
    turns = thread.get_turns()           # list of (role, content) tuples
    history = thread.messages()          # all turns except final assistant
    metadata = thread.metadata           # dataset row metadata dict
    ...
thread.metadata is the per-sample metadata you set when building the dataset — put rubric fields, expected outputs, or labels there.

Validation

graders.create.function(...) validates before it persists. If validation fails, the SDK raises ValueError with the failed check name and message — nothing is saved.
The sandbox runs five checks in order and stops at the first failure:
  1. Syntaxcompile() succeeds.
  2. Structure — a top-level async def grade(...) exists.
  3. Signature — exactly one parameter; warnings emitted for missing or unexpected annotations.
  4. Execution — the module body executes and grade is callable.
  5. Test rungrade is invoked with a thread and must return a numeric value.
The SDK supplies a hardcoded mock for the test run: a thread containing ("user", "What is 2+2?") and ("assistant", "4"), with metadata = {}. The check confirms your function runs and returns a number — not that it produces a sensible score on real data. To test against real samples, use the test payload panel in the UI.

Examples

# Regex match
import re

async def grade(thread: StringThread) -> float:
    completion = thread.completion() or ""
    pattern = thread.metadata.get("pattern", "")
    return 1.0 if re.search(pattern, completion) else 0.0
# Strict JSON shape
import json

async def grade(thread: StringThread) -> float:
    completion = thread.completion() or ""
    try:
        parsed = json.loads(completion)
        return 1.0 if {"name", "email", "age"}.issubset(parsed.keys()) else 0.0
    except Exception:
        return 0.0
# Multi-keyword scoring with partial credit
async def grade(thread: StringThread) -> float:
    completion = (thread.completion() or "").lower()
    keywords = thread.metadata.get("keywords", [])
    if not keywords:
        return 0.0
    hits = sum(1 for k in keywords if k.lower() in completion)
    return hits / len(keywords)

Sandbox

The grade function runs in an isolated Python sandbox (nsjail). What’s available:
  • The Python standard library (re, json, math, string, etc.)
  • StringThread, importable from adaptive_harmony
What’s not:
  • Network calls (requests, httpx, urllib.request)
  • Filesystem writes outside the sandbox
  • Third-party packages (numpy, pydantic, etc.)
If your scoring depends on a network call or external library, use an external endpoint grader instead.

Manage existing graders

adaptive.graders.list()                                       # all graders in project
adaptive.graders.get(grader_key="exact-match")                # one grader
adaptive.graders.delete(grader_key="exact-match")             # remove
adaptive.graders.lock(grader_key="exact-match", locked=True)  # prevent edits
The SDK has no in-place update for function graders. To change the implementation, delete and recreate the grader, or edit it through the UI.

Create an AI judge

AI judges use an LLM to grade completions based on a criterion you define:
adaptive.graders.create.binary_judge(
    key="helpful-judge",
    criteria="The response directly answers the user's question without going off-topic",
    judge_model="llama-3.1-8b-instruct",
    feedback_key="helpfulness",
)
ParameterTypeRequiredDescription
keystrYesUnique identifier
criteriastrYesWhat constitutes a pass (natural language)
judge_modelstrYesModel to use as judge
feedback_keystrYesFeedback key to write scores to
The judge returns PASS/FAIL for each completion along with reasoning.

Prompt templates for AI judges

AI judges use Handlebars templates for their prompts. Template variables give you access to the conversation context, completion, and metadata.Basic syntax:
{{{completion}}}            — insert without HTML escaping (always use for text)
{{#if metadata.domain}}     — conditional block
...
{{else}}
...
{{/if}}     
{{#each turns}}             — loop over a list (list of turn objects below)
Turn {{@index}}:
{{role}}: {{content}}
{{/each}}
Example template:
System: "You are a judge. Evaluate based on: {{criteria}}.
         Output this JSON schema: {{output_schema}}"

User: "Context:\n{{context_str_without_last_user}}
       Question:\n{{last_user_turn_content}}
       Response:\n{{completion}}"
VariableDescription
completionThe assistant’s completion being evaluated
last_user_turn_contentContent of the final user turn
context_strFull conversation context as a formatted string
context_str_without_last_userContext excluding the final user turn
turnsAll turns as a list of {role, content} dicts
context_turnsAll turns except the completion
context_turns_without_last_userContext turns without the last user turn
metadataThread metadata dict
output_schemaExpected output JSON schema
template_variablesCustom variables passed at judge initialization
Use triple braces ({{{var}}}) for variables that may contain HTML entities or special characters.

Pre-built graders

For RAG applications, use pre-built graders optimized by Adaptive:
adaptive.graders.create.prebuilt_judge(
    key="rag-faithfulness",
    type="FAITHFULNESS",
    judge_model="llama-3.1-8b-instruct",
)
  • Faithfulness: Does the completion adhere to provided context?
  • Context Relevancy: Is the retrieved context relevant to the query?
  • Answer Relevancy: Does the completion answer the question?
Faithfulness breaks the completion into atomic claims and checks each against the context:
score = claims supported by context / total claims
Pass context as document turns in the input messages. Each retrieved chunk should be a separate turn.Sample:
{"role": "document", "content": "Tim Berners-Lee created the first website."}
{"role": "document", "content": "Tim Berners-Lee invented the world wide web."}
{"role": "user", "content": "Who published the first website?"}
Completion: “Tim Berners-Lee published the first website in August 1990.”Score: 0.5 (first claim supported, date claim unsupported)
Context Relevancy checks if retrieved chunks are relevant to the query:
score = relevant chunks / total chunks

Answer Relevancy checks if the completion addresses the question:
score = relevant claims / total claims
Extra information not requested by the user lowers the score.
For reward servers and custom graders, see Reward Servers and Custom Recipes.See SDK Reference for all grader methods.