> ## Documentation Index
> Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Label classification accuracy

> Build a custom grader for label classification accuracy vs. a groundtruth

## Grader description

This example shows how to build a custom grader to train on classification accuracy against a groundtruth label.
It is common to require a JSON formatted output from an LLM, where one or more of the keys in the output are a predicted label from a set of possible labels. An example of this, if we asked an LLM to analyze a call transcript and extract information:

**Example prompt:**

```
Analyze the support call, reason about the sentiment of the customer, and whether they wanted to escalate to a human.

Provide your output in the following JSON format, nothing else:
{
  "reason": str,
  "sentiment: Literal["angry", "sad", "neutral", "happy", "thrilled"],
  "escalation": bool
}
```

**Example output:**

```json theme={null}
{
  "reason": "The customer was really angry and asked to speak to a human.",
  "sentiment": "angry",
  "escalation": true
}
```

The grader we will build compares the values predicted for `sentiment` and `escalation` against groundtruth stored in each sample's metadata. This same grader would work for any number of labels in the output, and any expected JSON format.

## Pseudocode

This is the pseudocode for our grader:

* Validate a text completion in JSON format against a pydantic model
* If completion does not comply with expected format, return `-1.0` reward
* Compare predicted category against ground truth label
  * Returns binary scores (1.0 for correct, 0.0 for incorrect)

## Implementation

### How label matching works

The grader uses the same key (`label_key`) to extract labels from two places:

1. From the Pydantic output: Uses `getattr(prediction, label_key)` to get the predicted label
2. From sample metadata: Uses `sample.metadata.get(label_key)` to get the ground truth

Example with `label_key="sentiment"`:

* Predicted: `prediction.sentiment` → `"angry"`
* Ground truth: `sample.metadata["sentiment"]` → `"angry"`
* Score: `1.0` (match!)

This design keeps the grader simple and ensures the field names are consistent across
your Pydantic schema and dataset metadata.

### Why -1.0 for malformed outputs in training?

During training with RL, we use `-1.0` for malformed outputs instead of `0.0` because:

1. Stronger penalty signal: `-1.0` tells the RL algorithm that malformed output is worse than just getting the wrong category. This encourages the model to always produce valid JSON.

2. Distinguishes failure modes:
   * `-1.0` = "You broke the format"
   * `0.0` = "You followed the format but chose the wrong category"
   * `1.0` = "Perfect!"

3. Faster convergence: The stronger signal helps the model learn to produce structured output early in training, before fine-tuning category predictions.

During evaluation, we use `0.0` because we only care about binary correctness (right or wrong), not degrees of failure.

```python theme={null}
from typing import Literal
from pydantic import BaseModel, Field

from adaptive_harmony import StringThread, Grade
from adaptive_harmony.core.structured_output import pydantic_parse, OutputParserException
from adaptive_harmony.graders import BaseGrader


class CallAnalysisFormat(BaseModel):
    """Schema for call analysis output."""

    reason: str = Field(description="Reasoning behind customer sentiment and escalation prediction")
    sentiment: Literal["angry", "sad", "neutral", "happy", "thrilled"] = Field(description="Customer sentiment")
    escalation: bool = Field(description="Whether the customer escalated the conversation to a human")


class LabelClassificationGrader(BaseGrader):
    """
    Grader for label classification accuracy.

    Compares a predicted label/categories against the ground truth label.
    The predicted label is extracted from the Pydantic output using the same key
    as the ground truth label in metadata.
    """

    def __init__(self, grader_key: str, label_key: str, eval_mode: bool, expected_format: type[BaseModel]):
        """
        Args:
            grader_key: Unique identifier for this grader
            label_key: Key to find both the ground truth in metadata AND the predicted label in the Pydantic output
            eval_mode: If True, score malformed outputs as 0.0. If False (training), score as -1.0
        """
        super().__init__(grader_key)
        self.label_key = label_key
        self.eval_mode = eval_mode
        self.expected_format = expected_format

    async def grade(self, sample: StringThread) -> Grade:
        # Try to parse the model's structured output
        try:
            prediction = pydantic_parse(sample.last_content(), self.expected_format)
        except OutputParserException:
            # Model output was not valid JSON or didn't match the schema
            # During training: return -1.0 to penalize malformed outputs
            # During evaluation: return 0.0 (treat as incorrect, but don't skew the average score)
            score = 0.0 if self.eval_mode else -1.0
            self.add_log({"score": score})
            return Grade(
                value=score,
                grader_key=self.grader_key,
                reasoning="JSON parse error",
            )
        except Exception as e:
            # Unexpected error - fail loudly
            raise Exception(f"Error parsing structured output: {e}")

        # Extract predicted label from Pydantic output using the same key as metadata
        predicted_label = getattr(prediction, self.label_key)
        # Get ground truth from sample metadata
        groundtruth_label = sample.metadata.get(self.label_key)

        # Compare prediction to ground truth
        score = 1.0 if predicted_label == groundtruth_label else 0.0

        # Log for experiment tracking
        self.add_log({"score": score})

        # Return grade with reasoning for debugging
        reason = f"Predicted: {predicted_label}\nGround truth: {groundtruth_label}"
        return Grade(value=score, grader_key=self.grader_key, reasoning=reason)
```

### Test cases

We create 3 simple test cases:

* correct completion: right format and labels
* incorrect completion: right format, wrong labels
* malformed completion: label values are not supported, likely a hallucination; could also be incorrect JSON, like a missing curly brace

```python theme={null}
from adaptive_harmony.core.structured_output import render_schema

instructions = f"""Analyze the support call, reason about the sentiment of the customer, and whether they wanted to escalate to a human.
Your output must comply with the following format, valid JSON output only:
{render_schema(CallAnalysisFormat)}"""

call_content = "Enough of this!! I am mad, I want to speak to a human"

groundtruth_labels = {"sentiment": "angry", "escalation": True}

# label is correct
correct_completion = CallAnalysisFormat(
    reason="It is clear that the content indicates anger and an escalation", sentiment="angry", escalation=True
).model_dump_json(indent=2)

# labels are incorrect
incorrect_completion = CallAnalysisFormat(
    reason="The customer sounded very happe", sentiment="happy", escalation=False
).model_dump_json(indent=2)

# malformed completion; predicted label does not exist in list of possible labels, escalation is not a bool
malformed_completion = """{
  "reason": "this is the reason",
  "sentiment": "very-very-happy",
  "escalation": "YES"
}"""

correct_thread = StringThread(
    turns=[("system", instructions), ("user", call_content), ("assistant", correct_completion)],
    metadata=groundtruth_labels,
)

incorrect_thread = StringThread(
    turns=[("system", instructions), ("user", call_content), ("assistant", incorrect_completion)],
    metadata=groundtruth_labels,
)

malformed_thread = StringThread(
    turns=[("system", instructions), ("user", call_content), ("assistant", malformed_completion)],
    metadata=groundtruth_labels,
)

print("🟢 Correct thread:")
print(correct_thread)
print("🔴 Incorrect thread:")
print(incorrect_thread)
print("⚠️ Malformed thread:")
print(malformed_thread)
```

### Validation

Run tests on the mock samples

```python theme={null}
grader = LabelClassificationGrader(
    grader_key="sentiment-grader",
    label_key="sentiment",
    expected_format=CallAnalysisFormat,
    eval_mode=False
)

correct_grade = await grader.grade(correct_thread)
incorrect_grade = await grader.grade(incorrect_thread)
malformed_grade = await grader.grade(malformed_thread)


assert correct_grade.value == 1.0
print(correct_grade)
print("✅ Correct prediction → reward = 1\n")
assert incorrect_grade.value == 0.0
print(incorrect_grade)
print("❌ Wrong prediction → reward = 0\n")
assert malformed_grade.value == -1.0
print(malformed_grade)
print("⚠️ Malformed output → reward = -1")
```

```python theme={null}
from pprint import pprint

print("📊 Grader logs:")
pprint(grader.get_logs())
```

## Key takeaways

1. **Always validate structured output** - Use Pydantic schemas and handle parse errors gracefully with pydantic parse
2. **Different penalties for training vs. eval** - Penalize format errors more heavily during training
3. **Log scores for tracking** - Use `self.add_log()` to send metrics to training loggers
4. **Include reasoning in grades** - Helps debug model behavior during development
