> ## Documentation Index
> Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Binary LLM judge grader

> Use BinaryJudgeGrader to have an LLM evaluate completions as PASS or FAIL against custom criteria.

## What BinaryJudgeGrader does

BinaryJudgeGrader evaluates a model completion using an LLM judge and returns either **PASS (1.0)** or **FAIL (0.0)**.

Unlike rule-based graders that check for patterns or structures, BinaryJudgeGrader:

1. Takes a `criteria` string describing what makes a good completion
2. Sends the completion to a judge LLM with the criteria
3. Judge outputs `BinaryJudgeOutput` with `reasoning` and `score` (PASS/FAIL)
4. Grader returns 1.0 for PASS, 0.0 for FAIL

This enables semantic evaluation—checking for qualities like "clarity", "completeness", or "consistency"—without writing custom judge prompts.

## When to use BinaryJudgeGrader

* **Semantic quality checks**: You need LLM judgment on aspects rules can't capture ("Is this response helpful?")
* **Binary decisions are sufficient**: You only care about PASS vs. FAIL, not a numeric scale (use `RangeJudgeGrader` for scoring on a range)
* **Quick setup**: You want a working judge grader without writing custom judge prompts (use `TemplatedPromptJudgeGrader` for full control)
* **Training & evaluation**: Works for both training and offline evaluation

## Configuration

Create a BinaryJudgeGrader by providing:

* **`grader_key`**: Unique identifier for this grader (used in logging/aggregation)
* **`model`**: An inference model spawned for judging (must be an LLM capable of structured output)
* **`criteria`**: A string describing what makes a completion PASS vs. FAIL

### Basic usage

```python theme={null}
from adaptive_harmony.graders import BinaryJudgeGrader

grader = BinaryJudgeGrader(
    grader_key="quality-check",
    model=judge_model,
    criteria="The response must be factually accurate, well-structured, and directly address the user's request.",
)

# Use the grader
grade = await grader.grade(thread)
# grade.value is 1.0 (PASS) or 0.0 (FAIL)
# grade.reasoning contains the judge's explanation
```

## How the judge prompt works internally

BinaryJudgeGrader constructs a judge prompt with these components:

1. **System prompt**: Instructs the judge LLM to evaluate the completion against the criteria
2. **Completion**: The model's output to be evaluated
3. **Output format**: Judge must return JSON with two fields:
   * `reasoning`: Explanation of why the completion PASSES or FAILS
   * `score`: Either `"PASS"`, `"FAIL"`, or `"NA"` (not applicable)

If the judge fails to produce valid structured output, BinaryJudgeGrader retries automatically.

Example judge evaluation:

```json theme={null}
{
  "reasoning": "The response directly answers the question with multiple specific examples and clear explanations.",
  "score": "PASS"
}
```

## Choosing good criteria

Criteria are critical—they directly shape judge behavior. Use these principles:

1. **Be specific and measurable** — Avoid vague language like "be good". Instead:
   * Bad: "The response should be helpful"
   * Good: "The response must address the user's request with at least one concrete example"

2. **Include both positive and negative indicators** — Help the judge distinguish PASS from FAIL:
   * "PASS: Contains verifiable facts and cites sources. FAIL: Contains unsupported claims or hallucinations."

3. **Keep criteria focused** — One dimension per grader. Use `CombinedGrader` to aggregate multiple graders for multi-dimensional evaluation:
   * One grader checks "Is this factually accurate?"
   * Another checks "Does this follow the requested format?"
   * Combine them to get overall quality score

### Example criteria of increasing specificity:

```python theme={null}
# Example 1: Generic (not recommended)
criteria_generic = "The response is good."

# Example 2: More specific
criteria_specific = (
    "The response must directly answer the user's question without introducing unrelated topics."
)

# Example 3: Highly specific with PASS/FAIL conditions
criteria_detailed = (
    "PASS: The response directly answers the user's question with clear reasoning, "
    "includes at least one relevant example or supporting detail, and contains no factual errors. "
    "FAIL: The response is vague, goes off-topic, lacks supporting details, or contains factual errors."
)

# Example 4: Domain-specific (customer support)
criteria_support = (
    "PASS: The response acknowledges the customer's problem, provides a clear solution, "
    "and offers follow-up support. FAIL: The response dismisses the problem, provides no solution, "
    "or is unclear."
)

# Example 5: Domain-specific (content summarization)
criteria_summary = (
    "PASS: The summary captures the main points, is concise (2-3 sentences), "
    "and uses language from the original text. FAIL: The summary is too long, "
    "omits key points, or introduces information not in the original."
)
```

## Using BinaryJudgeGrader in a recipe

In a training recipe, spawn a judge model and create the grader:

```python theme={null}
from adaptive_harmony.runtime import InputConfig, recipe_main
from adaptive_harmony.runtime.context import RecipeContext
from adaptive_harmony.graders import BinaryJudgeGrader

class MyConfig(InputConfig):
    train_dataset: Dataset
    model_to_train: Model
    judge_model: Model

@recipe_main
async def main(config: MyConfig, ctx: RecipeContext):
    # Spawn the judge model with spawn_inference
    judge_builder = await config.judge_model.to_builder(ctx, tp=1)
    judge = await judge_builder.spawn_inference("judge")
    
    # Create the grader
    grader = BinaryJudgeGrader(
        grader_key="factuality",
        model=judge,
        criteria="The response contains only factually verifiable statements and cites sources when needed.",
    )
    
    # Use the grader in training...
    # grade = await grader.grade(sample)
```

## Using few-shot examples (optional)

For more consistent judging, you can provide few-shot examples via `BinaryJudgeShot`. Each shot pairs a thread with the expected score and reasoning, teaching the judge your standards for PASS vs. FAIL.

```python theme={null}
from adaptive_harmony.graders.binary_judge import BinaryJudgeShot
from adaptive_harmony import StringThread

# Example 1: Good response — should PASS
good_example = StringThread([
    ("user", "What is machine learning?"),
    ("assistant", "Machine learning is a subset of AI where systems learn patterns from data "
     "without explicit programming. For example, recommendation systems use ML to suggest "
     "products based on user history.")
])

# Example 2: Poor response — should FAIL
poor_example = StringThread([
    ("user", "What is machine learning?"),
    ("assistant", "ML is cool.")
])

shots = [
    BinaryJudgeShot(thread=good_example, score="PASS", reasoning="Clear definition with a concrete example"),
    BinaryJudgeShot(thread=poor_example, score="FAIL", reasoning="Too vague, no examples or detail"),
]
```

### Validation

We can validate the Grade object structure and few-shot setup without a live judge model. In a real recipe, calling `grader.grade(thread)` sends the thread to the judge LLM and returns one of these Grade objects.

```python theme={null}
from adaptive_harmony import Grade

# Simulate the Grade objects that BinaryJudgeGrader.grade() returns
pass_grade = Grade(
    value=1.0,
    grader_key="quality-check",
    reasoning="The response directly addresses the question with clear examples and accurate information.",
)

fail_grade = Grade(
    value=0.0,
    grader_key="quality-check",
    reasoning="The response is vague and does not provide concrete examples as required.",
)

# Validate structure
assert pass_grade.value == 1.0
assert fail_grade.value == 0.0
assert pass_grade.grader_key == "quality-check"
assert len(pass_grade.reasoning) > 0

print(f"✅ PASS grade: value={pass_grade.value}, reasoning=\"{pass_grade.reasoning}\"")
print(f"❌ FAIL grade: value={fail_grade.value}, reasoning=\"{fail_grade.reasoning}\"")

# Validate few-shot examples are well-formed
for shot in shots:
    assert shot.score in ("PASS", "FAIL"), f"Invalid score: {shot.score}"
    assert len(shot.thread.last_content()) > 0, "Shot thread must have content"
print(f"\n✅ All {len(shots)} few-shot examples validated")
```

## Key takeaways

1. **BinaryJudgeGrader is the simplest LLM judge** — Just provide a judge model and criteria string
2. **Write specific, measurable criteria** — Vague criteria lead to inconsistent judging; be explicit about PASS/FAIL conditions
3. **Use for semantic quality checks** — When rules can't capture what makes a completion good (e.g., "helpful", "clear", "complete")
4. **Optional: Add few-shot examples** — Improves consistency by showing the judge what PASS and FAIL look like
5. **For finer-grained evaluation**, see:
   * `RangeJudgeGrader` — Score on a numeric scale (e.g., 1-5) instead of binary PASS/FAIL
   * `TemplatedPromptJudgeGrader` — Write custom judge prompts with templates
   * `CombinedGrader` — Combine multiple graders for multi-dimensional evaluation
