Graders are systems that evaluate generated completions based on any custom heuristic or criteria. Graders in adaptive_harmony are designed to be reusable for both evaluations and training - providing a reward for training, or a numerical score with an optional reason for evaluation.

Define your own Recipe Grader

To create a custom recipe grader, you must inherit from the Grader class and implement 4 compulsory methods: __init__, grade, setup and teardown. Following is a simple example of a Grader that uses an LLM as a judge, which explains the expectations for the 4 methods:
from adaptive_harmony import StringThread, HarmonyClient
from adaptive_harmony.graders import Grader, Grade 

class MyCustomGrader(Grader):
    def __init__(self, grader_key: str, model_key: str, client: HarmonyClient):
        """
        Initialize the grader with parameters like model keys, API endpoints, or other configuration.
        """
        super().__init__(grader_key)
        self.grader_key = grader_key
        self.model_key = model_key
        self._client = client
        self.model_is_spawned = False
    
    async def grade(self, sample: StringThread) -> Grade:
        """
        The main function that returns the grade. 
        Must return a `Grade` object.
        """
        if not self.judge_is_spawned:
            raise RuntimeError("Model not initialized, run grader.setup() before grading")
        completion = sample.last_content()
        # you could judge the sample with the spawned LLM here
        # self.model.generate(sample)
        score = 1
        reason = "Custom scoring logic"
        return Grade(value=score, grader_key=self.grader_key, reason=reason)
    
    async def setup(self) -> None:
        """
        Should be called before using the grader in a recipe.
        Useful for loading LLMs or setting up tools to avoid reloading at every `.grade()` call.
        If your grader does not require a setup method, `setup` should be implemented as a no-op. 
        """
        self.model = await self._client.model(path=self.model_key).spawn_inference(name=self.grader_key)
        self.model_is_spawned = True
    
    async def teardown(self) -> None:
        """
        Called at the end of a recipe when grader is not used anymore to clean up resources.
        
        """
        await self.model.dealloc()
A simple string grader_key is mandatory for any recipe grader you build; you will understand why when you read Create an evaluation recipe. The grader_key is the logical name that is displayed in the UI to report evaluation scores for a grader that lives in your recipe.

Quick implementation from a simple function

Often times, you might want to implement a grader as a simple Python function; an example of this would be checking a substring in the generated completion against some ground truth label (this would be the case if you’d asked your model to predict some category in a structured way, for example <category_A>predicted_category</category_A>). A function such as the one described above would required no state, setup or teardown. To make it easier to implement such a Grader without the boilerplate overhead, you can use the Grader.from_function() method. The only requirement is that the function is asynchronous. Below is an example where:
  1. a completion (which we expect to be a valid JSON object) is validated for adherence against a desired Pydantic Model, which includes reasoning and a category label prediction
  2. the predicted category label is compared against the ground truth, which is attached in the sample’s metadata
  3. a reward is then returned as follows: -1.0 if the model did not respect the output format, 0.0 if the format was respected but the predicted category was wrong, and 1.0 if both the format and the predicted label were correct
from typing import Literal
from pydantic import BaseModel

from adaptive_harmony import StringThread
from adaptive_harmony.core.structured_output import pydantic_parse, OutputParserException

class OutputExpectation(BaseModel):
    reasoning: str
    category: Literal["A", "B", "C", "D", "E"]

async def grade_fn(sample: StringThread) -> float:
    try:
        structured_output = pydantic_parse(sample.last_content(), OutputExpectation)
    except OutputParserException:
        return -1.0
    
    ground_truth = sample.metadata.get("category")

    if structured_output.category == ground_truth:
        return 1.0
    else:
        return 0.0

# Creates a grader with default logging and no setup/teardown
grader = Grader.from_function(grader_key="label-grader", grader_fn)

Grade object

The Grade object contains:
  • value: The numerical score (float)
  • grader_key: Identifier for proper logging in the app for evaluations
  • reason: Optional field to provide a reasoning to back the score, which will be displayed as interaction metadata in the app if you save this grade in an EvaluationArtifact
from adaptive_harmony import Grade

grade = Grade(
    value=0.95,
    grader_key="my-custom-grader",
    reason="Response demonstrates excellent factual accuracy and relevance"
)

Combine grades

All trainer classes take a single Grader as the reward function for training. This is because the underlying RL algorithms require a single scalar to optimize towards. When optimizing on multiple graders in a single training run, you must combine their rewards into a single float reward. You can trivially achieve this by using the CombinedGrader. You can also (optionally) weigh the contributions of different graders as desired.
Use the CombinedGrader
from adaptive_harmony.graders import CombinedGrader

combined_grader = CombinedGrader(
    graders=[grader_1, grader_2, grader_3],
    weights=[1.0, 2.0, 1.0]
    )

# Use the combined scorer
grade = await combined_grade.grade(thread)
print(grade)  # mean grade from all graders

Instrument Graders with logs

As you’ve verified at this point, the .grade method is asynchronous and grades a single sample. However, it’s still a common need to get aggregated scores across all the samples, most often for logging purposes. For example, this would be useful to get the average reward obtained on all samples for a single RL step. Logging logic is implemented in the base Grader that lets you log a score for every graded sample, and get aggregated statistics (mean, std, min, max, count) of scores for every graded sample processed so far by the grader. You can get the aggregates using the .get_logs() method on your grader; you can optionally pass clear=True in this function call to wipe all accumulated logs thus far, and reset the aggregation. The default get_logs() requires that you log your grade value under a score key. Use add_log in your .grade() method to add a new log dict with desired fields (include score if you want to use the default .get_logs() implementation), and override the get_logs method if custom logging behavior is required. Here is an example of a get_logs function that will log the same statistics as the default logger, but also logs a table of completions that did not following a given desired format.
from adaptive_harmony.logging_table import Table


def get_logs(self, clear: bool = False) -> dict[str, float | Table]:
        # Only clear logs at the end if clear is True
        logs = super().get_logs(clear=False) # to also log default statistics

        def get_samples_table(logs: list[dict]) -> Table:
            return (
                Table()
                .add_column("Messages", [log.get("messages", "") for log in logs])
                .add_column("Prompt", [log.get("user_message", "") for log in logs])
                .add_column("Completion", [log.get("completion", "") for log in logs])
                .add_column("Score", [float(log.get("score", 0)) for log in logs])
                .add_column("Ground Truth", [log.get("ground_truth", "") for log in logs])
            )

        wrong_format_samples = [log for log in self._logs if "good_format" in log and log["good_format"] == False]
        logs["score/wrong_format_samples_count"] = len(wrong_format_samples)

        if clear:
            self.clear_logs()

        return logs
In accordance with your get_logs method, you would log the following keys when calling add_log in your grader’s grade method:
from adaptive_harmony.core.utils import stringify_thread

async def grade(self, sample: StringThread):
    ...
    score=1
    self.add_log(
        {
            "score": score,
            "messages": stringify_thread(sample),
            "completion": completion,
            "good_format": True, 
            "ground_truth": ground_truth
            }
        )