How to write custom recipe graders
adaptive_harmony
are designed to be reusable for both evaluations and training - providing a reward for training, or a numerical score with an optional reason for evaluation.
Grader
class and implement 4 compulsory methods: __init__
, grade
, setup
and teardown
.
Following is a simple example of a Grader that uses an LLM as a judge, which explains the expectations for the 4 methods:
<category_A>predicted_category</category_A>
).
A function such as the one described above would required no state, setup or teardown. To make it easier to implement such a Grader without the boilerplate overhead, you can use the Grader.from_function()
method. The only requirement is that the function is asynchronous.
Below is an example where:
-1.0
if the model did not respect the output format, 0.0
if the format was respected but the predicted category was wrong, and 1.0
if both the format and the predicted label were correctGrade
object contains:
value
: The numerical score (float)grader_key
: Identifier for proper logging in the app for evaluationsreason
: Optional field to provide a reasoning to back the score, which will be displayed as interaction metadata in the app if you save this grade in an EvaluationArtifact
CombinedGrader
. You can also (optionally) weigh the contributions of different graders as desired.
.grade
method is asynchronous and grades a single sample. However, it’s still a common need to get aggregated scores across all the samples, most often for logging purposes. For example, this would be useful to get the average reward obtained on all samples for a single RL step.
Logging logic is implemented in the base Grader
that lets you log a score for every graded sample, and get aggregated statistics (mean, std, min, max, count) of scores for every graded sample processed so far by the grader. You can get the aggregates using the .get_logs()
method on your grader; you can optionally pass clear=True
in this function call to wipe all accumulated logs thus far, and reset the aggregation. The default get_logs()
requires that you log your grade value under a score
key.
Use add_log
in your .grade()
method to add a new log dict with desired fields (include score
if you want to use the default .get_logs()
implementation), and override the get_logs
method if custom logging behavior is required.
Here is an example of a get_logs
function that will log the same statistics as the default logger, but also logs a table of completions that did not following a given desired format.
get_logs
method, you would log the following keys when calling add_log
in your grader’s grade
method: