adaptive_harmony are designed to be reusable for both evaluations and training - providing a reward for training, or a numerical score with an optional reason for evaluation.
Define your own Recipe Grader
To create a custom recipe grader, you must inherit from theBaseGrader class and implement the grade method. __init__ is not mandatory, implement it only if you need custom state in your custom grader instances.
Following is a simple example of a Grader that uses an LLM as a judge:
A simple string grader_key is mandatory for any recipe grader you build; you will understand why when you read Create an evaluation recipe. The grader_key is the logical name that is displayed in the UI to report evaluation scores for a grader that lives in your recipe.
Quick implementation from a simple function
Often times, you might want to implement a grader as a simple Python function; an example of this would be checking a substring in the generated completion against some ground truth label (this would be the case if you’d asked your model to predict some category in a structured way, for example<category_A>predicted_category</category_A>).
A function such as the one described above would require no state. To make it easier to implement such a Grader without the boilerplate overhead, you can use the BaseGrader.from_function() method. The only requirement is that the function is asynchronous.
Below is an example where:
- a completion (which we expect to be a valid JSON object) is validated for adherence against a desired Pydantic Model, which includes reasoning and a category label prediction
- the predicted category label is compared against the ground truth, which is attached in the sample’s metadata
- a reward is then returned as follows:
-1.0if the model did not respect the output format,0.0if the format was respected but the predicted category was wrong, and1.0if both the format and the predicted label were correct
Grade object
TheGrade object contains:
value: The numerical score (float)grader_key: Identifier for proper logging in the app for evaluationsreasoning: Optional field to provide a reasoning to back the score, which will be displayed as interaction metadata in the app if you save this grade in anEvaluationArtifact
Combine grades
All trainer classes take a single Grader as the reward function for training. This is because the underlying RL algorithms require a single scalar to optimize towards. When optimizing on multiple graders in a single training run, you must combine their rewards into a single float reward. You can trivially achieve this by using theCombinedGrader. You can also (optionally) weigh the contributions of different graders as desired.
Use the CombinedGrader
Logging and Aggregation
Graders provide built-in logging to track metrics across all graded samples. This is useful for monitoring average rewards during RL training or analyzing grader behavior during evaluation.How Logging Works
Step 1: Log data ingrade()
Use self.add_log(dict) to record information for each sample:
get_logs()
Call self.get_logs() to get aggregated statistics. The default implementation computes mean, std, min, max, and count for any "score" keys logged:
clear=True:
When clear=True, the grader’s internal log buffer (self._logs) is reset after computing aggregations. This ensures the next get_logs() call only aggregates logs added since the last call.
clear=True, logs accumulate indefinitely, causing get_logs() to aggregate over all samples ever graded. Use clear=True when you want epoch-level or batch-level statistics.
Custom Logging with Tables
Overrideget_logs() to add custom metrics or create tables for detailed inspection. Tables are rendered as interactive HTML in W&B/MLflow.
Example: Log malformed outputs in a table
Creating Tables
TheTable class provides a fluent API for building tables:
Helper: get_sample_tables()
The base class provides get_sample_tables() to quickly create tables of scored and failed samples:
score/scored_samples- Table with Prompt, Reasoning, Score columnsscore/unscored_samples- Table with Prompt, Error columns (if failures exist)score/scored_samples_count- Count of successful samplesscore/unscored_samples_count- Count of failed samples
- Successful logs:
{"score": float, "prompt": str, "reasoning": str (optional)} - Failed logs:
{"error": str, "prompt": str (optional)}
When to Use Custom Logging
Use default logging when:- You only need aggregated statistics (mean, std, etc.)
- Logging
{"score": value}is sufficient
get_sample_tables() when:
- You want quick tables of scored/failed samples
- Your logs match the expected format (prompt, reasoning, score, error)
get_logs() when:
- You want custom metrics (e.g., false positive rate, category-wise accuracy)
- You need tables with custom columns or groupings
- You want to track additional aggregations beyond score statistics

