Binary LLM judge grader

What BinaryJudgeGrader does

BinaryJudgeGrader evaluates a model completion using an LLM judge and returns either PASS (1.0) or FAIL (0.0). Unlike rule-based graders that check for patterns or structures, BinaryJudgeGrader:

Takes a criteria string describing what makes a good completion
Sends the completion to a judge LLM with the criteria
Judge outputs BinaryJudgeOutput with reasoning and score (PASS/FAIL)
Grader returns 1.0 for PASS, 0.0 for FAIL

This enables semantic evaluation—checking for qualities like “clarity”, “completeness”, or “consistency”—without writing custom judge prompts.

When to use BinaryJudgeGrader

Semantic quality checks: You need LLM judgment on aspects rules can’t capture (“Is this response helpful?”)
Binary decisions are sufficient: You only care about PASS vs. FAIL, not a numeric scale (use RangeJudgeGrader for scoring on a range)
Quick setup: You want a working judge grader without writing custom judge prompts (use TemplatedPromptJudgeGrader for full control)
Training & evaluation: Works for both training and offline evaluation

Configuration

Create a BinaryJudgeGrader by providing:

grader_key: Unique identifier for this grader (used in logging/aggregation)
model: An inference model spawned for judging (must be an LLM capable of structured output)
criteria: A string describing what makes a completion PASS vs. FAIL

Basic usage

from adaptive_harmony.graders import BinaryJudgeGrader

grader = BinaryJudgeGrader(
    grader_key="quality-check",
    model=judge_model,
    criteria="The response must be factually accurate, well-structured, and directly address the user's request.",
)

# Use the grader
grade = await grader.grade(thread)
# grade.value is 1.0 (PASS) or 0.0 (FAIL)
# grade.reasoning contains the judge's explanation

How the judge prompt works internally

BinaryJudgeGrader constructs a judge prompt with these components:

System prompt: Instructs the judge LLM to evaluate the completion against the criteria
Completion: The model’s output to be evaluated
Output format: Judge must return JSON with two fields:
- reasoning: Explanation of why the completion PASSES or FAILS
- score: Either "PASS", "FAIL", or "NA" (not applicable)

If the judge fails to produce valid structured output, BinaryJudgeGrader retries automatically. Example judge evaluation:

{
  "reasoning": "The response directly answers the question with multiple specific examples and clear explanations.",
  "score": "PASS"
}

Choosing good criteria

Criteria are critical—they directly shape judge behavior. Use these principles:

Be specific and measurable — Avoid vague language like “be good”. Instead:
- Bad: “The response should be helpful”
- Good: “The response must address the user’s request with at least one concrete example”
Include both positive and negative indicators — Help the judge distinguish PASS from FAIL:
- “PASS: Contains verifiable facts and cites sources. FAIL: Contains unsupported claims or hallucinations.”
Keep criteria focused — One dimension per grader. Use CombinedGrader to aggregate multiple graders for multi-dimensional evaluation:
- One grader checks “Is this factually accurate?”
- Another checks “Does this follow the requested format?”
- Combine them to get overall quality score

Example criteria of increasing specificity:

# Example 1: Generic (not recommended)
criteria_generic = "The response is good."

# Example 2: More specific
criteria_specific = (
    "The response must directly answer the user's question without introducing unrelated topics."
)

# Example 3: Highly specific with PASS/FAIL conditions
criteria_detailed = (
    "PASS: The response directly answers the user's question with clear reasoning, "
    "includes at least one relevant example or supporting detail, and contains no factual errors. "
    "FAIL: The response is vague, goes off-topic, lacks supporting details, or contains factual errors."
)

# Example 4: Domain-specific (customer support)
criteria_support = (
    "PASS: The response acknowledges the customer's problem, provides a clear solution, "
    "and offers follow-up support. FAIL: The response dismisses the problem, provides no solution, "
    "or is unclear."
)

# Example 5: Domain-specific (content summarization)
criteria_summary = (
    "PASS: The summary captures the main points, is concise (2-3 sentences), "
    "and uses language from the original text. FAIL: The summary is too long, "
    "omits key points, or introduces information not in the original."
)

Using BinaryJudgeGrader in a recipe

In a training recipe, spawn a judge model and create the grader:

from adaptive_harmony.runtime import InputConfig, recipe_main
from adaptive_harmony.runtime.context import RecipeContext
from adaptive_harmony.graders import BinaryJudgeGrader

class MyConfig(InputConfig):
    train_dataset: Dataset
    model_to_train: Model
    judge_model: Model

@recipe_main
async def main(config: MyConfig, ctx: RecipeContext):
    # Spawn the judge model with spawn_inference
    judge_builder = await config.judge_model.to_builder(ctx, tp=1)
    judge = await judge_builder.spawn_inference("judge")
    
    # Create the grader
    grader = BinaryJudgeGrader(
        grader_key="factuality",
        model=judge,
        criteria="The response contains only factually verifiable statements and cites sources when needed.",
    )
    
    # Use the grader in training...
    # grade = await grader.grade(sample)

Using few-shot examples (optional)

For more consistent judging, you can provide few-shot examples via BinaryJudgeShot. Each shot pairs a thread with the expected score and reasoning, teaching the judge your standards for PASS vs. FAIL.

from adaptive_harmony.graders.binary_judge import BinaryJudgeShot
from adaptive_harmony import StringThread

# Example 1: Good response — should PASS
good_example = StringThread([
    ("user", "What is machine learning?"),
    ("assistant", "Machine learning is a subset of AI where systems learn patterns from data "
     "without explicit programming. For example, recommendation systems use ML to suggest "
     "products based on user history.")
])

# Example 2: Poor response — should FAIL
poor_example = StringThread([
    ("user", "What is machine learning?"),
    ("assistant", "ML is cool.")
])

shots = [
    BinaryJudgeShot(thread=good_example, score="PASS", reasoning="Clear definition with a concrete example"),
    BinaryJudgeShot(thread=poor_example, score="FAIL", reasoning="Too vague, no examples or detail"),
]

Validation

We can validate the Grade object structure and few-shot setup without a live judge model. In a real recipe, calling grader.grade(thread) sends the thread to the judge LLM and returns one of these Grade objects.

from adaptive_harmony import Grade

# Simulate the Grade objects that BinaryJudgeGrader.grade() returns
pass_grade = Grade(
    value=1.0,
    grader_key="quality-check",
    reasoning="The response directly addresses the question with clear examples and accurate information.",
)

fail_grade = Grade(
    value=0.0,
    grader_key="quality-check",
    reasoning="The response is vague and does not provide concrete examples as required.",
)

# Validate structure
assert pass_grade.value == 1.0
assert fail_grade.value == 0.0
assert pass_grade.grader_key == "quality-check"
assert len(pass_grade.reasoning) > 0

print(f"✅ PASS grade: value={pass_grade.value}, reasoning=\"{pass_grade.reasoning}\"")
print(f"❌ FAIL grade: value={fail_grade.value}, reasoning=\"{fail_grade.reasoning}\"")

# Validate few-shot examples are well-formed
for shot in shots:
    assert shot.score in ("PASS", "FAIL"), f"Invalid score: {shot.score}"
    assert len(shot.thread.last_content()) > 0, "Shot thread must have content"
print(f"\n✅ All {len(shots)} few-shot examples validated")

Key takeaways

BinaryJudgeGrader is the simplest LLM judge — Just provide a judge model and criteria string
Write specific, measurable criteria — Vague criteria lead to inconsistent judging; be explicit about PASS/FAIL conditions
Use for semantic quality checks — When rules can’t capture what makes a completion good (e.g., “helpful”, “clear”, “complete”)
Optional: Add few-shot examples — Improves consistency by showing the judge what PASS and FAIL look like
For finer-grained evaluation, see:
- RangeJudgeGrader — Score on a numeric scale (e.g., 1-5) instead of binary PASS/FAIL
- TemplatedPromptJudgeGrader — Write custom judge prompts with templates
- CombinedGrader — Combine multiple graders for multi-dimensional evaluation

Getting Started

Building Blocks

Training & Evaluation

Cookbook

Updates

What BinaryJudgeGrader does

When to use BinaryJudgeGrader

Configuration

Basic usage

How the judge prompt works internally

Choosing good criteria

Example criteria of increasing specificity:

Using BinaryJudgeGrader in a recipe

Using few-shot examples (optional)

Validation

Key takeaways

Getting Started

Building Blocks

Training & Evaluation

Cookbook

Updates

Documentation Index

​What BinaryJudgeGrader does

​When to use BinaryJudgeGrader

​Configuration

​Basic usage

​How the judge prompt works internally

​Choosing good criteria

​Example criteria of increasing specificity:

​Using BinaryJudgeGrader in a recipe

​Using few-shot examples (optional)

​Validation

​Key takeaways

What BinaryJudgeGrader does

When to use BinaryJudgeGrader

Configuration

Basic usage

How the judge prompt works internally

Choosing good criteria

Example criteria of increasing specificity:

Using BinaryJudgeGrader in a recipe

Using few-shot examples (optional)

Validation

Key takeaways