Skip to main content
Graders are systems that evaluate generated completions based on any custom heuristic or criteria. Graders in adaptive_harmony are designed to be reusable for both evaluations and training - providing a reward for training, or a numerical score with an optional reason for evaluation.

Define your own Recipe Grader

To create a custom recipe grader, you must inherit from the BaseGrader class and implement the grade method. __init__ is not mandatory, implement it only if you need custom state in your custom grader instances. Following is a simple example of a Grader that uses an LLM as a judge:
from adaptive_harmony import StringThread, HarmonyClient, InferenceModel
from adaptive_harmony.graders import BaseGrader, Grade

class MyCustomGrader(BaseGrader):
    def __init__(self, grader_key: str, model: InferenceModel):
        """
        Initialize the grader with parameters like model keys, API endpoints, or other configuration.
        """
        super().__init__(grader_key)
        self.grader_key = grader_key
        self.model = model
    
    async def grade(self, sample: StringThread) -> Grade:
        """
        The main function that returns the grade. 
        Must return a `Grade` object.
        """
        completion = sample.last_content()
        # you could judge the sample with the spawned LLM here
        # judge_thread: StringThread = format_completion_into_judge_prompt(completion)
        # self.model.generate(judge_prompt)
        score = 1
        reason = "Custom scoring logic which could come from LLM"
        return Grade(value=score, grader_key=self.grader_key, reasoning=reason)
A simple string grader_key is mandatory for any recipe grader you build; you will understand why when you read Create an evaluation recipe. The grader_key is the logical name that is displayed in the UI to report evaluation scores for a grader that lives in your recipe.

Quick implementation from a simple function

Often times, you might want to implement a grader as a simple Python function; an example of this would be checking a substring in the generated completion against some ground truth label (this would be the case if you’d asked your model to predict some category in a structured way, for example <category_A>predicted_category</category_A>). A function such as the one described above would require no state. To make it easier to implement such a Grader without the boilerplate overhead, you can use the BaseGrader.from_function() method. The only requirement is that the function is asynchronous. Below is an example where:
  1. a completion (which we expect to be a valid JSON object) is validated for adherence against a desired Pydantic Model, which includes reasoning and a category label prediction
  2. the predicted category label is compared against the ground truth, which is attached in the sample’s metadata
  3. a reward is then returned as follows: -1.0 if the model did not respect the output format, 0.0 if the format was respected but the predicted category was wrong, and 1.0 if both the format and the predicted label were correct
from typing import Literal
from pydantic import BaseModel

from adaptive_harmony import StringThread
from adaptive_harmony.graders import BaseGrader
from adaptive_harmony.core.structured_output import pydantic_parse, OutputParserException

class OutputExpectation(BaseModel):
    reasoning: str
    category: Literal["A", "B", "C", "D", "E"]

async def grade_fn(sample: StringThread) -> float:
    try:
        structured_output = pydantic_parse(sample.last_content(), OutputExpectation)
    except OutputParserException:
        return -1.0
    
    ground_truth = sample.metadata.get("category")

    if structured_output.category == ground_truth:
        return 1.0
    else:
        return 0.0

# Creates a grader with default logging
grader = BaseGrader.from_function("label-grader", grade_fn)

Grade object

The Grade object contains:
  • value: The numerical score (float)
  • grader_key: Identifier for proper logging in the app for evaluations
  • reasoning: Optional field to provide a reasoning to back the score, which will be displayed as interaction metadata in the app if you save this grade in an EvaluationArtifact
from adaptive_harmony import Grade

grade = Grade(
    value=0.95,
    grader_key="my-custom-grader",
    reasoning="Response demonstrates excellent factual accuracy and relevance"
)

Combine grades

All trainer classes take a single Grader as the reward function for training. This is because the underlying RL algorithms require a single scalar to optimize towards. When optimizing on multiple graders in a single training run, you must combine their rewards into a single float reward. You can trivially achieve this by using the CombinedGrader. You can also (optionally) weigh the contributions of different graders as desired.
Use the CombinedGrader
from adaptive_harmony.graders.combined_grader import CombinedGrader

combined_grader = CombinedGrader(
    grader_key="combined",
    graders=[grader_1, grader_2, grader_3],
    weights=[1.0, 2.0, 1.0]
)

# Use the combined scorer
grade = await combined_grade.grade(thread)
print(grade)  # mean grade from all graders

Logging and Aggregation

Graders provide built-in logging to track metrics across all graded samples. This is useful for monitoring average rewards during RL training or analyzing grader behavior during evaluation.

How Logging Works

Step 1: Log data in grade() Use self.add_log(dict) to record information for each sample:
async def grade(self, sample: StringThread) -> Grade:
    prediction = parse_output(sample.last_content())
    score = compute_score(prediction)

    # Log data for this sample
    self.add_log({"score": score, "category": prediction.category})

    return Grade(value=score, grader_key=self.grader_key)
Step 2: Aggregate logs with get_logs() Call self.get_logs() to get aggregated statistics. The default implementation computes mean, std, min, max, and count for any "score" keys logged:
# After grading multiple samples
logs = grader.get_logs(clear=True)  # clear=True resets log buffer
print(logs)
# {
#   "score/mean": 0.85,
#   "score/std": 0.12,
#   "score/min": 0.5,
#   "score/max": 1.0,
#   "score/count": 100
# }
Understanding clear=True: When clear=True, the grader’s internal log buffer (self._logs) is reset after computing aggregations. This ensures the next get_logs() call only aggregates logs added since the last call.
# Training loop example
for epoch in range(3):
    for batch in batches:
        for sample in batch:
            grade = await grader.grade(sample)  # Logs accumulate

    # Get stats for this epoch only
    epoch_logs = grader.get_logs(clear=True)  # Aggregates + clears buffer
    logger.log_metrics(epoch_logs, step=epoch)

    # Next epoch starts with empty buffer
Without clear=True, logs accumulate indefinitely, causing get_logs() to aggregate over all samples ever graded. Use clear=True when you want epoch-level or batch-level statistics.

Custom Logging with Tables

Override get_logs() to add custom metrics or create tables for detailed inspection. Tables are rendered as interactive HTML in W&B/MLflow. Example: Log malformed outputs in a table
from adaptive_harmony.logging_table import Table

class MyGrader(BaseGrader):
    async def grade(self, sample: StringThread) -> Grade:
        try:
            prediction = pydantic_parse(sample.last_content(), MySchema)
            score = 1.0 if prediction.category == sample.metadata["category"] else 0.0

            self.add_log({
                "score": score,
                "valid_format": True,
                "completion": sample.last_content(),
                "predicted": prediction.category,
                "ground_truth": sample.metadata["category"]
            })
        except OutputParserException:
            score = -1.0
            self.add_log({
                "score": score,
                "valid_format": False,
                "completion": sample.last_content(),
                "predicted": None,
                "ground_truth": sample.metadata["category"]
            })

        return Grade(value=score, grader_key=self.grader_key)

    def get_logs(self, clear: bool = False) -> dict[str, float | Table]:
        # Get default statistics (mean, std, etc.)
        logs = super().get_logs(clear=False)

        # Filter logs for malformed samples
        malformed = [log for log in self._logs if not log.get("valid_format", True)]

        # Add custom metrics
        logs["score/malformed_count"] = len(malformed)
        logs["score/malformed_rate"] = len(malformed) / len(self._logs) if self._logs else 0

        # Create a table of malformed samples
        if malformed:
            malformed_table = (
                Table()
                .add_column("Completion", [log["completion"][:100] + "..." for log in malformed])
                .add_column("Ground Truth", [log["ground_truth"] for log in malformed])
            )
            logs["score/malformed_samples"] = malformed_table

        if clear:
            self.clear_logs()

        return logs

Creating Tables

The Table class provides a fluent API for building tables:
from adaptive_harmony.logging_table import Table

# Build table column-by-column
table = (
    Table()
    .add_column("Category", ["A", "B", "C"])
    .add_column("Score", [0.9, 0.7, 0.85])
    .add_column("Reason", ["Good", "Needs work", "Excellent"])
)

# Or build row-by-row
table = Table(initial_headers=["Category", "Score", "Reason"])
table.add_row(["A", 0.9, "Good"])
table.add_row(["B", 0.7, "Needs work"])
Tables support strings and floats. They render as styled HTML in experiment tracking dashboards.

Helper: get_sample_tables()

The base class provides get_sample_tables() to quickly create tables of scored and failed samples:
from adaptive_harmony.graders.utils import SuccessJudgeLog, FailedJudgeLog

class MyGrader(BaseGrader):
    async def grade(self, sample: StringThread) -> Grade:
        try:
            score = compute_score(sample)
            self.add_log({
                "score": score,
                "prompt": sample.get_turns()[0].content,
                "reasoning": "Some reasoning"
            })
            return Grade(value=score, grader_key=self.grader_key)
        except Exception as e:
            self.add_log({
                "error": str(e),
                "prompt": sample.get_turns()[0].content
            })
            return Grade(value=0.0, grader_key=self.grader_key)

    def get_logs(self, clear: bool = False) -> dict[str, float | Table]:
        logs = super().get_logs(clear=False)

        # Separate successful and failed samples
        successful = [log for log in self._logs if "score" in log and "error" not in log]
        failed = [log for log in self._logs if "error" in log]

        # Use helper to create tables
        table_logs = self.get_sample_tables(successful, failed)
        logs.update(table_logs)

        if clear:
            self.clear_logs()

        return logs
This creates:
  • score/scored_samples - Table with Prompt, Reasoning, Score columns
  • score/unscored_samples - Table with Prompt, Error columns (if failures exist)
  • score/scored_samples_count - Count of successful samples
  • score/unscored_samples_count - Count of failed samples
Log format required:
  • Successful logs: {"score": float, "prompt": str, "reasoning": str (optional)}
  • Failed logs: {"error": str, "prompt": str (optional)}

When to Use Custom Logging

Use default logging when:
  • You only need aggregated statistics (mean, std, etc.)
  • Logging {"score": value} is sufficient
Use get_sample_tables() when:
  • You want quick tables of scored/failed samples
  • Your logs match the expected format (prompt, reasoning, score, error)
Override get_logs() when:
  • You want custom metrics (e.g., false positive rate, category-wise accuracy)
  • You need tables with custom columns or groupings
  • You want to track additional aggregations beyond score statistics