adaptive_harmony
are designed to be reusable for both evaluations and training - providing a reward for training, or a numerical score with an optional reason for evaluation.
Define your own Recipe Grader
To create a custom recipe grader, you must inherit from theGrader
class and implement 4 compulsory methods: __init__
, grade
, setup
and teardown
.
Following is a simple example of a Grader that uses an LLM as a judge, which explains the expectations for the 4 methods:
A simple string grader_key is mandatory for any recipe grader you build; you will understand why when you read Create an evaluation recipe. The grader_key is the logical name that is displayed in the UI to report evaluation scores for a grader that lives in your recipe.
Quick implementation from a simple function
Often times, you might want to implement a grader as a simple Python function; an example of this would be checking a substring in the generated completion against some ground truth label (this would be the case if you’d asked your model to predict some category in a structured way, for example<category_A>predicted_category</category_A>
).
A function such as the one described above would required no state, setup or teardown. To make it easier to implement such a Grader without the boilerplate overhead, you can use the Grader.from_function()
method. The only requirement is that the function is asynchronous.
Below is an example where:
- a completion (which we expect to be a valid JSON object) is validated for adherence against a desired Pydantic Model, which includes reasoning and a category label prediction
- the predicted category label is compared against the ground truth, which is attached in the sample’s metadata
- a reward is then returned as follows:
-1.0
if the model did not respect the output format,0.0
if the format was respected but the predicted category was wrong, and1.0
if both the format and the predicted label were correct
Grade object
TheGrade
object contains:
value
: The numerical score (float)grader_key
: Identifier for proper logging in the app for evaluationsreason
: Optional field to provide a reasoning to back the score, which will be displayed as interaction metadata in the app if you save this grade in anEvaluationArtifact
Combine grades
All trainer classes take a single Grader as the reward function for training. This is because the underlying RL algorithms require a single scalar to optimize towards. When optimizing on multiple graders in a single training run, you must combine their rewards into a single float reward. You can trivially achieve this by using theCombinedGrader
. You can also (optionally) weigh the contributions of different graders as desired.
Use the CombinedGrader
Instrument Graders with logs
As you’ve verified at this point, the.grade
method is asynchronous and grades a single sample. However, it’s still a common need to get aggregated scores across all the samples, most often for logging purposes. For example, this would be useful to get the average reward obtained on all samples for a single RL step.
Logging logic is implemented in the base Grader
that lets you log a score for every graded sample, and get aggregated statistics (mean, std, min, max, count) of scores for every graded sample processed so far by the grader. You can get the aggregates using the .get_logs()
method on your grader; you can optionally pass clear=True
in this function call to wipe all accumulated logs thus far, and reset the aggregation. The default get_logs()
requires that you log your grade value under a score
key.
Use add_log
in your .grade()
method to add a new log dict with desired fields (include score
if you want to use the default .get_logs()
implementation), and override the get_logs
method if custom logging behavior is required.
Here is an example of a get_logs
function that will log the same statistics as the default logger, but also logs a table of completions that did not following a given desired format.
get_logs
method, you would log the following keys when calling add_log
in your grader’s grade
method: