> ## Documentation Index
> Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Graders

> Score completions for training and evaluation

Graders evaluate LLM completions and produce metric scores. Use them as reward signals during training or as evaluation criteria for runs. Adaptive supports five grader types — pick one based on what your scoring rule looks like.

| Type                  | When to use                                                               | Create with                                              |
| --------------------- | ------------------------------------------------------------------------- | -------------------------------------------------------- |
| **Function**          | Correctness is structural or rule-based — string match, regex, JSON shape | `create.function()` / **New Grader → Function**          |
| **AI judge**          | Pass/fail criteria can be expressed in natural language                   | `create.binary_judge()` / **New Grader → AI Judge**      |
| **Pre-built**         | RAG evaluation: faithfulness, context relevancy, answer relevancy         | `create.prebuilt_judge()` / **New Grader → Pre-built**   |
| **External endpoint** | Scoring requires an external service or model you host                    | `create.external_endpoint()` / **New Grader → External** |
| **Custom**            | Python logic baked into a [custom recipe](/v0.14/harmony/overview)        | `create.custom()`                                        |

**New in v0.14:** function graders are first-class objects. Define a Python `grade` function, validate it in a sandbox, and reuse it across recipes — no need to edit recipe code to swap scoring rules.

<Tabs>
  <Tab title="SDK" icon="code">
    ## Create a function grader

    Function graders run an `async def grade(thread: StringThread) -> float` in a sandboxed Python environment, called on every completion during training or evaluation. Use them when correctness is structural or rule-based — string matching, regex, JSON shape — and an LLM judge would add noise at every step.

    ```python theme={null}
    from adaptive_harmony import StringThread

    async def grade(thread: StringThread) -> float:
        completion = thread.completion() or ""
        expected = thread.metadata.get("expected", "")
        return 1.0 if completion.strip() == expected.strip() else 0.0

    adaptive.graders.create.function(
        key="exact-match",
        fn=grade,
        description="Strict equality between completion and expected answer",
    )
    ```

    Pass the function itself, not its source as a string. The SDK reads `grade`'s source from the file or notebook cell where it's defined, validates it in the sandbox, and stores it under the project. The function gets renamed to `grade` server-side, so the name in your code doesn't matter.

    ### The grade function contract

    | Rule                  | Detail                                                                                |
    | --------------------- | ------------------------------------------------------------------------------------- |
    | Must be `async`       | Sync `def grade(...)` is rejected at validation.                                      |
    | Exactly one parameter | A `StringThread`. The parameter name is convention only.                              |
    | Returns numeric       | `int` or `float`. Other types fail validation.                                        |
    | Source size           | 64 KiB max.                                                                           |
    | Top-level only        | The function must live at module top level — not nested in another function or class. |

    The validator emits warnings (not errors) if the parameter isn't annotated `StringThread` or the return isn't annotated `float`.

    ### Reading the thread

    Inside `grade`, the `StringThread` exposes the conversation and any per-sample metadata you attached to your dataset:

    ```python theme={null}
    async def grade(thread: StringThread) -> float:
        completion = thread.completion()    # last assistant turn, or None
        turns = thread.get_turns()           # list of (role, content) tuples
        history = thread.messages()          # all turns except final assistant
        metadata = thread.metadata           # dataset row metadata dict
        ...
    ```

    `thread.metadata` is the per-sample metadata you set when building the dataset — put rubric fields, expected outputs, or labels there.

    ### Validation

    `graders.create.function(...)` validates before it persists. If validation fails, the SDK raises `ValueError` with the failed check name and message — nothing is saved.

    <Accordion title="Validation steps">
      The sandbox runs five checks in order and stops at the first failure:

      1. **Syntax** — `compile()` succeeds.
      2. **Structure** — a top-level `async def grade(...)` exists.
      3. **Signature** — exactly one parameter; warnings emitted for missing or unexpected annotations.
      4. **Execution** — the module body executes and `grade` is callable.
      5. **Test run** — `grade` is invoked with a thread and must return a numeric value.

      The SDK supplies a hardcoded mock for the test run: a thread containing `("user", "What is 2+2?")` and `("assistant", "4")`, with `metadata = {}`. The check confirms your function runs and returns a number — not that it produces a sensible score on real data. To test against real samples, use the test payload panel in the UI.
    </Accordion>

    ### Examples

    ```python theme={null}
    # Regex match
    import re

    async def grade(thread: StringThread) -> float:
        completion = thread.completion() or ""
        pattern = thread.metadata.get("pattern", "")
        return 1.0 if re.search(pattern, completion) else 0.0
    ```

    ```python theme={null}
    # Strict JSON shape
    import json

    async def grade(thread: StringThread) -> float:
        completion = thread.completion() or ""
        try:
            parsed = json.loads(completion)
            return 1.0 if {"name", "email", "age"}.issubset(parsed.keys()) else 0.0
        except Exception:
            return 0.0
    ```

    ```python theme={null}
    # Multi-keyword scoring with partial credit
    async def grade(thread: StringThread) -> float:
        completion = (thread.completion() or "").lower()
        keywords = thread.metadata.get("keywords", [])
        if not keywords:
            return 0.0
        hits = sum(1 for k in keywords if k.lower() in completion)
        return hits / len(keywords)
    ```

    ### Sandbox

    The grade function runs in an isolated Python sandbox (nsjail). What's available:

    * The Python standard library (`re`, `json`, `math`, `string`, etc.)
    * `StringThread`, importable from `adaptive_harmony`

    What's not:

    * Network calls (`requests`, `httpx`, `urllib.request`)
    * Filesystem writes outside the sandbox
    * Third-party packages (`numpy`, `pydantic`, etc.)

    If your scoring depends on a network call or external library, use an [external endpoint grader](/v0.14/advanced/reward-servers) instead.

    ### Manage existing graders

    ```python theme={null}
    adaptive.graders.list()                                       # all graders in project
    adaptive.graders.get(grader_key="exact-match")                # one grader
    adaptive.graders.delete(grader_key="exact-match")             # remove
    adaptive.graders.lock(grader_key="exact-match", locked=True)  # prevent edits
    ```

    The SDK has no in-place update for function graders. To change the implementation, delete and recreate the grader, or edit it through the UI.

    ## Create an AI judge

    AI judges use an LLM to grade completions based on a criterion you define:

    ```python theme={null}
    adaptive.graders.create.binary_judge(
        key="helpful-judge",
        criteria="The response directly answers the user's question without going off-topic",
        judge_model="llama-3.1-8b-instruct",
        feedback_key="helpfulness",
    )
    ```

    | Parameter      | Type | Required | Description                                |
    | -------------- | ---- | -------- | ------------------------------------------ |
    | `key`          | str  | Yes      | Unique identifier                          |
    | `criteria`     | str  | Yes      | What constitutes a pass (natural language) |
    | `judge_model`  | str  | Yes      | Model to use as judge                      |
    | `feedback_key` | str  | Yes      | Feedback key to write scores to            |

    The judge returns PASS/FAIL for each completion along with reasoning.

    ## Prompt templates for AI judges

    AI judges use [Handlebars](https://handlebarsjs.com/) templates for their prompts. Template variables give you access to the conversation context, completion, and metadata.

    **Basic syntax:**

    ```handlebars theme={null}
    {{{completion}}}            — insert without HTML escaping (always use for text)
    {{#if metadata.domain}}     — conditional block
    ...
    {{else}}
    ...
    {{/if}}     
    {{#each turns}}             — loop over a list (list of turn objects below)
    Turn {{@index}}:
    {{role}}: {{content}}
    {{/each}}
    ```

    **Example template:**

    ```
    System: "You are a judge. Evaluate based on: {{criteria}}.
             Output this JSON schema: {{output_schema}}"

    User: "Context:\n{{context_str_without_last_user}}
           Question:\n{{last_user_turn_content}}
           Response:\n{{completion}}"
    ```

    <Accordion title="All template variables">
      | Variable                          | Description                                     |
      | --------------------------------- | ----------------------------------------------- |
      | `completion`                      | The assistant's completion being evaluated      |
      | `last_user_turn_content`          | Content of the final user turn                  |
      | `context_str`                     | Full conversation context as a formatted string |
      | `context_str_without_last_user`   | Context excluding the final user turn           |
      | `turns`                           | All turns as a list of `{role, content}` dicts  |
      | `context_turns`                   | All turns except the completion                 |
      | `context_turns_without_last_user` | Context turns without the last user turn        |
      | `metadata`                        | Thread metadata dict                            |
      | `output_schema`                   | Expected output JSON schema                     |
      | `template_variables`              | Custom variables passed at judge initialization |

      Use triple braces (`{{{var}}}`) for variables that may contain HTML entities or special characters.
    </Accordion>

    ## Pre-built graders

    For RAG applications, use pre-built graders optimized by Adaptive:

    ```python theme={null}
    adaptive.graders.create.prebuilt_judge(
        key="rag-faithfulness",
        type="FAITHFULNESS",
        judge_model="llama-3.1-8b-instruct",
    )
    ```

    * **Faithfulness**: Does the completion adhere to provided context?
    * **Context Relevancy**: Is the retrieved context relevant to the query?
    * **Answer Relevancy**: Does the completion answer the question?

    <Accordion title="How pre-built graders work">
      **Faithfulness** breaks the completion into atomic claims and checks each against the context:

      ```
      score = claims supported by context / total claims
      ```

      Pass context as `document` turns in the input messages. Each retrieved chunk should be a separate turn.

      **Sample:**

      ```json theme={null}
      {"role": "document", "content": "Tim Berners-Lee created the first website."}
      {"role": "document", "content": "Tim Berners-Lee invented the world wide web."}
      {"role": "user", "content": "Who published the first website?"}
      ```

      Completion: "Tim Berners-Lee published the first website in August 1990."

      Score: 0.5 (first claim supported, date claim unsupported)

      ***

      **Context Relevancy** checks if retrieved chunks are relevant to the query:

      ```
      score = relevant chunks / total chunks
      ```

      ***

      **Answer Relevancy** checks if the completion addresses the question:

      ```
      score = relevant claims / total claims
      ```

      Extra information not requested by the user lowers the score.
    </Accordion>

    For reward servers and custom graders, see [Reward Servers](/v0.14/advanced/reward-servers) and [Custom Recipes](/v0.14/harmony/overview).

    See [SDK Reference](/v0.14/reference/sdk) for all grader methods.
  </Tab>

  <Tab title="UI" icon="mouse-pointer">
    ## Create a grader

    Navigate to your project and open the **Graders** tab. Click **New Grader** and pick a type.

    ### Function grader

    Pick **Function** to write a Python `grade` function directly in the editor. Use the **Test Payload** panel to run validation against a sample from your dataset before saving — the panel surfaces errors and per-sample scores so you can iterate without leaving the page.

    ### AI judge

    For AI judges, provide a natural language criterion that defines what a passing completion looks like.

    <Frame caption="Define a criterion for AI judge graders">
      <img src="https://mintcdn.com/adaptiveml/PADjThkrE4-_Dc39/static/ai_judge_shot_light.png?fit=max&auto=format&n=PADjThkrE4-_Dc39&q=85&s=541c8551915bb7d822d9221249f3e06d" className="block dark:hidden" width="638" height="382" data-path="static/ai_judge_shot_light.png" />

      <img src="https://mintcdn.com/adaptiveml/PADjThkrE4-_Dc39/static/ai_judge_shot_dark.png?fit=max&auto=format&n=PADjThkrE4-_Dc39&q=85&s=32c69b44a3b76b07f4995863ec8967b2" className="hidden dark:block" width="638" height="382" data-path="static/ai_judge_shot_dark.png" />
    </Frame>

    Add examples to improve judge accuracy. Include both passing and failing examples with justifications.

    ## Use in recipes

    After creating a grader, select it when configuring training or evaluation recipes. The grader scores each completion and provides feedback for the run.
  </Tab>
</Tabs>
