> ## Documentation Index
> Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Custom recipe Graders

> How to write custom recipe graders

Graders are systems that evaluate generated completions based on any custom heuristic or criteria. Graders in `adaptive_harmony` are designed to be reusable for both evaluations and training - providing a reward for training, or a numerical score with an optional reason for evaluation.

## Define your own recipe grader

To create a custom recipe grader, you must inherit from the `BaseGrader` class and implement the `grade` method. `__init__` is not mandatory, implement it only if you need custom state in your custom grader instances.

Following is a simple example of a Grader that uses an LLM as a judge:

```python theme={null}
from adaptive_harmony import StringThread, HarmonyClient, InferenceModel
from adaptive_harmony.graders import BaseGrader, Grade

class MyCustomGrader(BaseGrader):
    def __init__(self, grader_key: str, model: InferenceModel):
        """
        Initialize the grader with parameters like model keys, API endpoints, or other configuration.
        """
        super().__init__(grader_key)
        self.grader_key = grader_key
        self.model = model
    
    async def grade(self, sample: StringThread) -> Grade:
        """
        The main function that returns the grade. 
        Must return a `Grade` object.
        """
        completion = sample.last_content()
        # you could judge the sample with the spawned LLM here
        # judge_thread: StringThread = format_completion_into_judge_prompt(completion)
        # self.model.generate(judge_prompt)
        score = 1
        reason = "Custom scoring logic which could come from LLM"
        return Grade(value=score, grader_key=self.grader_key, reasoning=reason)
```

<Info>
  A simple string grader\_key is mandatory for any recipe grader you build; you will understand why when you read [Create an evaluation recipe](/v0.14/harmony/eval). The grader\_key is the logical name that is displayed in the UI to report evaluation scores for a grader that lives in your recipe.
</Info>

## Quick implementation from a simple function

Often times, you might want to implement a grader as a simple Python function; an example of this would be checking a substring in the generated completion against some ground truth label (this would be the case if you'd asked your model to predict some category in a structured way, for example `<category_A>predicted_category</category_A>`).

A function such as the one described above would require no state. To make it easier to implement such a Grader without the boilerplate overhead, you can use the `BaseGrader.from_function()` method. The only requirement is that the function is asynchronous.

Below is an example where:

1. a completion (which we expect to be a valid JSON object) is validated for adherence against a desired Pydantic Model, which includes reasoning and a category label prediction
2. the predicted category label is compared against the ground truth, which is attached in the sample's metadata
3. a reward is then returned as follows: `-1.0` if the model did not respect the output format, `0.0` if the format was respected but the predicted category was wrong, and `1.0` if both the format and the predicted label were correct

```python theme={null}
from typing import Literal
from pydantic import BaseModel

from adaptive_harmony import StringThread
from adaptive_harmony.graders import BaseGrader
from adaptive_harmony.core.structured_output import pydantic_parse, OutputParserException

class OutputExpectation(BaseModel):
    reasoning: str
    category: Literal["A", "B", "C", "D", "E"]

async def grade_fn(sample: StringThread) -> float:
    try:
        structured_output = pydantic_parse(sample.last_content(), OutputExpectation)
    except OutputParserException:
        return -1.0
    
    ground_truth = sample.metadata.get("category")

    if structured_output.category == ground_truth:
        return 1.0
    else:
        return 0.0

# Creates a grader with default logging
grader = BaseGrader.from_function("label-grader", grade_fn)
```

## <span id="grade-object">Grade object</span>

The `Grade` object contains:

* **`value`**: The numerical score (float)
* **`grader_key`**: Identifier for proper logging in the app for evaluations
* **`reasoning`**: Optional field to provide a reasoning to back the score, which will be displayed as interaction metadata in the app if you save this grade in an `EvaluationArtifact`

```python theme={null}
from adaptive_harmony import Grade

grade = Grade(
    value=0.95,
    grader_key="my-custom-grader",
    reasoning="Response demonstrates excellent factual accuracy and relevance"
)
```

## Combine grades

All [trainer classes](/v0.14/harmony/training#trainers) take a single Grader as the reward function for training. This is because the underlying RL algorithms require a single scalar to optimize towards. When optimizing on multiple graders in a single training run, you must combine their rewards into a single float reward. You can trivially achieve this by using the `CombinedGrader`. You can also (optionally) weigh the contributions of different graders as desired.

```python Use the CombinedGrader theme={null}
from adaptive_harmony.graders.combined_grader import CombinedGrader

combined_grader = CombinedGrader(
    grader_key="combined",
    graders=[grader_1, grader_2, grader_3],
    weights=[1.0, 2.0, 1.0]
)

# Use the combined scorer
grade = await combined_grade.grade(thread)
print(grade)  # mean grade from all graders
```

## Logging and aggregation

Graders provide built-in logging to track metrics across all graded samples. This is useful for monitoring average rewards during RL training or analyzing grader behavior during evaluation.

### How logging works

**Step 1: Log data in `grade()`**

Use `self.add_log(dict)` to record information for each sample:

```python theme={null}
async def grade(self, sample: StringThread) -> Grade:
    prediction = parse_output(sample.last_content())
    score = compute_score(prediction)

    # Log data for this sample
    self.add_log({"score": score, "category": prediction.category})

    return Grade(value=score, grader_key=self.grader_key)
```

**Step 2: Aggregate logs with `get_logs()`**

Call `self.get_logs()` to get aggregated statistics. The default implementation computes mean, std, min, max, and count for any `"score"` keys logged:

```python theme={null}
# After grading multiple samples
logs = grader.get_logs(clear=True)  # clear=True resets log buffer
print(logs)
# {
#   "score/mean": 0.85,
#   "score/std": 0.12,
#   "score/min": 0.5,
#   "score/max": 1.0,
#   "score/count": 100
# }
```

**Understanding `clear=True`:**

When `clear=True`, the grader's internal log buffer (`self._logs`) is reset after computing aggregations. This ensures the next `get_logs()` call only aggregates logs added **since the last call**.

```python theme={null}
# Training loop example
for epoch in range(3):
    for batch in batches:
        for sample in batch:
            grade = await grader.grade(sample)  # Logs accumulate

    # Get stats for this epoch only
    epoch_logs = grader.get_logs(clear=True)  # Aggregates + clears buffer
    logger.log_metrics(epoch_logs, step=epoch)

    # Next epoch starts with empty buffer
```

Without `clear=True`, logs accumulate indefinitely, causing `get_logs()` to aggregate over **all samples ever graded**. Use `clear=True` when you want epoch-level or batch-level statistics.

### Custom logging with tables

Override `get_logs()` to add custom metrics or create tables for detailed inspection. Tables are rendered as interactive HTML in W\&B/MLflow.

**Example: Log malformed outputs in a table**

```python theme={null}
from adaptive_harmony.logging_table import Table

class MyGrader(BaseGrader):
    async def grade(self, sample: StringThread) -> Grade:
        try:
            prediction = pydantic_parse(sample.last_content(), MySchema)
            score = 1.0 if prediction.category == sample.metadata["category"] else 0.0

            self.add_log({
                "score": score,
                "valid_format": True,
                "completion": sample.last_content(),
                "predicted": prediction.category,
                "ground_truth": sample.metadata["category"]
            })
        except OutputParserException:
            score = -1.0
            self.add_log({
                "score": score,
                "valid_format": False,
                "completion": sample.last_content(),
                "predicted": None,
                "ground_truth": sample.metadata["category"]
            })

        return Grade(value=score, grader_key=self.grader_key)

    def get_logs(self, clear: bool = False) -> dict[str, float | Table]:
        # Get default statistics (mean, std, etc.)
        logs = super().get_logs(clear=False)

        # Filter logs for malformed samples
        malformed = [log for log in self._logs if not log.get("valid_format", True)]

        # Add custom metrics
        logs["score/malformed_count"] = len(malformed)
        logs["score/malformed_rate"] = len(malformed) / len(self._logs) if self._logs else 0

        # Create a table of malformed samples
        if malformed:
            malformed_table = (
                Table()
                .add_column("Completion", [log["completion"][:100] + "..." for log in malformed])
                .add_column("Ground Truth", [log["ground_truth"] for log in malformed])
            )
            logs["score/malformed_samples"] = malformed_table

        if clear:
            self.clear_logs()

        return logs
```

### Creating tables

The `Table` class provides a fluent API for building tables:

```python theme={null}
from adaptive_harmony.logging_table import Table

# Build table column-by-column
table = (
    Table()
    .add_column("Category", ["A", "B", "C"])
    .add_column("Score", [0.9, 0.7, 0.85])
    .add_column("Reason", ["Good", "Needs work", "Excellent"])
)

# Or build row-by-row
table = Table(initial_headers=["Category", "Score", "Reason"])
table.add_row(["A", 0.9, "Good"])
table.add_row(["B", 0.7, "Needs work"])
```

Tables support strings and floats. They render as styled HTML in experiment tracking dashboards.

### Helper: `get_sample_tables()`

The base class provides `get_sample_tables()` to quickly create tables of scored and failed samples:

```python theme={null}
from adaptive_harmony.graders.utils import SuccessJudgeLog, FailedJudgeLog

class MyGrader(BaseGrader):
    async def grade(self, sample: StringThread) -> Grade:
        try:
            score = compute_score(sample)
            self.add_log({
                "score": score,
                "prompt": sample.get_turns()[0].content,
                "reasoning": "Some reasoning"
            })
            return Grade(value=score, grader_key=self.grader_key)
        except Exception as e:
            self.add_log({
                "error": str(e),
                "prompt": sample.get_turns()[0].content
            })
            return Grade(value=0.0, grader_key=self.grader_key)

    def get_logs(self, clear: bool = False) -> dict[str, float | Table]:
        logs = super().get_logs(clear=False)

        # Separate successful and failed samples
        successful = [log for log in self._logs if "score" in log and "error" not in log]
        failed = [log for log in self._logs if "error" in log]

        # Use helper to create tables
        table_logs = self.get_sample_tables(successful, failed)
        logs.update(table_logs)

        if clear:
            self.clear_logs()

        return logs
```

This creates:

* `score/scored_samples` - Table with Prompt, Reasoning, Score columns
* `score/unscored_samples` - Table with Prompt, Error columns (if failures exist)
* `score/scored_samples_count` - Count of successful samples
* `score/unscored_samples_count` - Count of failed samples

**Log format required:**

* Successful logs: `{"score": float, "prompt": str, "reasoning": str (optional)}`
* Failed logs: `{"error": str, "prompt": str (optional)}`

### When to use custom logging

**Use default logging when:**

* You only need aggregated statistics (mean, std, etc.)
* Logging `{"score": value}` is sufficient

**Use `get_sample_tables()` when:**

* You want quick tables of scored/failed samples
* Your logs match the expected format (prompt, reasoning, score, error)

**Override `get_logs()` when:**

* You want custom metrics (e.g., false positive rate, category-wise accuracy)
* You need tables with custom columns or groupings
* You want to track additional aggregations beyond score statistics
