Skip to main content

Grader description

This example shows how to build a custom grader to train on classification accuracy against a groundtruth label. It is common to require a JSON formatted output from an LLM, where one or more of the keys in the output are a predicted label from a set of possible labels. An example of this, if we asked an LLM to analyze a call transcript and extract information: Example prompt:
Analyze the support call, reason about the sentiment of the customer, and whether they wanted to escalate to a human.

Provide your output in the following JSON format, nothing else:
{
  "reason": str,
  "sentiment: Literal["angry", "sad", "neutral", "happy", "thrilled"],
  "escalation": bool
}
Example output:
{
  "reason": "The customer was really angry and asked to speak to a human.",
  "sentiment": "angry",
  "escalation": true
}
The grader we will build compares the values predicted for sentiment and escalation against groundtruth stored in each sample’s metadata. This same grader would work for any number of labels in the output, and any expected JSON format.

Pseudocode

This is the pseudocode for our grader:
  • Validate a text completion in JSON format against a pydantic model
  • If completion does not comply with expected format, return -1.0 reward
  • Compare predicted category against ground truth label
    • Returns binary scores (1.0 for correct, 0.0 for incorrect)

Implementation

How label matching works

The grader uses the same key (label_key) to extract labels from two places:
  1. From the Pydantic output: Uses getattr(prediction, label_key) to get the predicted label
  2. From sample metadata: Uses sample.metadata.get(label_key) to get the ground truth
Example with label_key="sentiment":
  • Predicted: prediction.sentiment"angry"
  • Ground truth: sample.metadata["sentiment"]"angry"
  • Score: 1.0 (match!)
This design keeps the grader simple and ensures the field names are consistent across your Pydantic schema and dataset metadata.

Why -1.0 for malformed outputs in training?

During training with RL, we use -1.0 for malformed outputs instead of 0.0 because:
  1. Stronger penalty signal: -1.0 tells the RL algorithm that malformed output is worse than just getting the wrong category. This encourages the model to always produce valid JSON.
  2. Distinguishes failure modes:
    • -1.0 = “You broke the format”
    • 0.0 = “You followed the format but chose the wrong category”
    • 1.0 = “Perfect!”
  3. Faster convergence: The stronger signal helps the model learn to produce structured output early in training, before fine-tuning category predictions.
During evaluation, we use 0.0 because we only care about binary correctness (right or wrong), not degrees of failure.
from typing import Literal
from pydantic import BaseModel, Field

from adaptive_harmony import StringThread, Grade
from adaptive_harmony.core.structured_output import pydantic_parse, OutputParserException
from adaptive_harmony.graders import BaseGrader


class CallAnalysisFormat(BaseModel):
    """Schema for call analysis output."""

    reason: str = Field(description="Reasoning behind customer sentiment and escalation prediction")
    sentiment: Literal["angry", "sad", "neutral", "happy", "thrilled"] = Field(description="Customer sentiment")
    escalation: bool = Field(description="Whether the customer escalated the conversation to a human")


class LabelClassificationGrader(BaseGrader):
    """
    Grader for label classification accuracy.

    Compares a predicted label/categories against the ground truth label.
    The predicted label is extracted from the Pydantic output using the same key
    as the ground truth label in metadata.
    """

    def __init__(self, grader_key: str, label_key: str, eval_mode: bool, expected_format: type[BaseModel]):
        """
        Args:
            grader_key: Unique identifier for this grader
            label_key: Key to find both the ground truth in metadata AND the predicted label in the Pydantic output
            eval_mode: If True, score malformed outputs as 0.0. If False (training), score as -1.0
        """
        super().__init__(grader_key)
        self.label_key = label_key
        self.eval_mode = eval_mode
        self.expected_format = expected_format

    async def grade(self, sample: StringThread) -> Grade:
        # Try to parse the model's structured output
        try:
            prediction = pydantic_parse(sample.last_content(), self.expected_format)
        except OutputParserException:
            # Model output was not valid JSON or didn't match the schema
            # During training: return -1.0 to penalize malformed outputs
            # During evaluation: return 0.0 (treat as incorrect, but don't skew the average score)
            score = 0.0 if self.eval_mode else -1.0
            self.add_log({"score": score})
            return Grade(
                value=score,
                grader_key=self.grader_key,
                reasoning="JSON parse error",
            )
        except Exception as e:
            # Unexpected error - fail loudly
            raise Exception(f"Error parsing jailbreak output: {e}")

        # Extract predicted label from Pydantic output using the same key as metadata
        predicted_label = getattr(prediction, self.label_key)
        # Get ground truth from sample metadata
        groundtruth_label = sample.metadata.get(self.label_key)

        # Compare prediction to ground truth
        score = 1.0 if predicted_label == groundtruth_label else 0.0

        # Log for experiment tracking
        self.add_log({"score": score})

        # Return grade with reasoning for debugging
        reason = f"Predicted: {predicted_label}\nGround truth: {groundtruth_label}"
        return Grade(value=score, grader_key=self.grader_key, reasoning=reason)

Test cases

We create 3 simple test cases:
  • correct completion: right format and labels
  • incorrect completion: right format, wrong labels
  • malformed completion: label values are not supported, likely a hallucinationt; could also be incorrect JSON, like a missing curly brace
from adaptive_harmony.core.structured_output import render_schema

instructions = f"""Analyze the support call, reason about the sentiment of the customer, and whether they wanted to escalate to a human.
Your output must comply with the following format, valid JSON output only:
{render_schema(CallAnalysisFormat)}"""

call_content = "Enough of this!! I am mad, I want to speak to a human"

groundtruth_labels = {"sentiment": "angry", "escalation": True}

# label is correct
correct_completion = CallAnalysisFormat(
    reason="It is clear that the content indicates anger and an escalation", sentiment="angry", escalation=True
).model_dump_json(indent=2)

# labels are incorrect
incorrect_completion = CallAnalysisFormat(
    reason="The customer sounded very happe", sentiment="happy", escalation=False
).model_dump_json(indent=2)

# malformed completion; predicted label does not exist in list of possible labels, escalation is not a bool
malformed_completion = """{
  "reason": "this is the reason",
  "sentiment": "very-very-happy",
  "escalation": "YES"
}"""

correct_thread = StringThread(
    turns=[("system", instructions), ("user", call_content), ("assistant", correct_completion)],
    metadata=groundtruth_labels,
)

incorrect_thread = StringThread(
    turns=[("system", instructions), ("user", call_content), ("assistant", incorrect_completion)],
    metadata=groundtruth_labels,
)

malformed_thread = StringThread(
    turns=[("system", instructions), ("user", call_content), ("assistant", malformed_completion)],
    metadata=groundtruth_labels,
)

print(correct_thread)
print(incorrect_thread)
print(malformed_thread)

Validation

Run tests on the mock samples
grader = LabelClassificationGrader(
    grader_key="sentiment-grader",
    label_key="sentiment",
    expected_format=CallAnalysisFormat,
    eval_mode=False
)

correct_grade = await grader.grade(correct_thread)
incorrect_grade = await grader.grade(incorrect_thread)
malformed_grade = await grader.grade(malformed_thread)


assert correct_grade.value == 1.0
print(correct_grade)
print("✅ Correct thread got reward of 1\n")
assert incorrect_grade.value == 0.0
print(incorrect_grade)
print("✅ Correct thread got reward of 0\n")
assert malformed_grade.value == -1.0
print(malformed_grade)
print("✅ Correct thread got reward of -1")

from pprint import pprint 

pprint(grader.get_logs())

Key takeaways

  1. Always validate structured output - Use Pydantic schemas and handle parse errors gracefully with pydantic parse
  2. Different penalties for training vs. eval - Penalize format errors more heavily during training
  3. Log scores for tracking - Use self.add_log() to send metrics to training loggers
  4. Include reasoning in grades - Helps debug model behavior during development