Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt

Use this file to discover all available pages before exploring further.

What BinaryJudgeGrader does

BinaryJudgeGrader evaluates a model completion using an LLM judge and returns either PASS (1.0) or FAIL (0.0). Unlike rule-based graders that check for patterns or structures, BinaryJudgeGrader:
  1. Takes a criteria string describing what makes a good completion
  2. Sends the completion to a judge LLM with the criteria
  3. Judge outputs BinaryJudgeOutput with reasoning and score (PASS/FAIL)
  4. Grader returns 1.0 for PASS, 0.0 for FAIL
This enables semantic evaluation—checking for qualities like “clarity”, “completeness”, or “consistency”—without writing custom judge prompts.

When to use BinaryJudgeGrader

  • Semantic quality checks: You need LLM judgment on aspects rules can’t capture (“Is this response helpful?”)
  • Binary decisions are sufficient: You only care about PASS vs. FAIL, not a numeric scale (use RangeJudgeGrader for scoring on a range)
  • Quick setup: You want a working judge grader without writing custom judge prompts (use TemplatedPromptJudgeGrader for full control)
  • Training & evaluation: Works for both training and offline evaluation

Configuration

Create a BinaryJudgeGrader by providing:
  • grader_key: Unique identifier for this grader (used in logging/aggregation)
  • model: An inference model spawned for judging (must be an LLM capable of structured output)
  • criteria: A string describing what makes a completion PASS vs. FAIL

Basic usage

from adaptive_harmony.graders import BinaryJudgeGrader

grader = BinaryJudgeGrader(
    grader_key="quality-check",
    model=judge_model,
    criteria="The response must be factually accurate, well-structured, and directly address the user's request.",
)

# Use the grader
grade = await grader.grade(thread)
# grade.value is 1.0 (PASS) or 0.0 (FAIL)
# grade.reasoning contains the judge's explanation

How the judge prompt works internally

BinaryJudgeGrader constructs a judge prompt with these components:
  1. System prompt: Instructs the judge LLM to evaluate the completion against the criteria
  2. Completion: The model’s output to be evaluated
  3. Output format: Judge must return JSON with two fields:
    • reasoning: Explanation of why the completion PASSES or FAILS
    • score: Either "PASS", "FAIL", or "NA" (not applicable)
If the judge fails to produce valid structured output, BinaryJudgeGrader retries automatically. Example judge evaluation:
{
  "reasoning": "The response directly answers the question with multiple specific examples and clear explanations.",
  "score": "PASS"
}

Choosing good criteria

Criteria are critical—they directly shape judge behavior. Use these principles:
  1. Be specific and measurable — Avoid vague language like “be good”. Instead:
    • Bad: “The response should be helpful”
    • Good: “The response must address the user’s request with at least one concrete example”
  2. Include both positive and negative indicators — Help the judge distinguish PASS from FAIL:
    • “PASS: Contains verifiable facts and cites sources. FAIL: Contains unsupported claims or hallucinations.”
  3. Keep criteria focused — One dimension per grader. Use CombinedGrader to aggregate multiple graders for multi-dimensional evaluation:
    • One grader checks “Is this factually accurate?”
    • Another checks “Does this follow the requested format?”
    • Combine them to get overall quality score

Example criteria of increasing specificity:

# Example 1: Generic (not recommended)
criteria_generic = "The response is good."

# Example 2: More specific
criteria_specific = (
    "The response must directly answer the user's question without introducing unrelated topics."
)

# Example 3: Highly specific with PASS/FAIL conditions
criteria_detailed = (
    "PASS: The response directly answers the user's question with clear reasoning, "
    "includes at least one relevant example or supporting detail, and contains no factual errors. "
    "FAIL: The response is vague, goes off-topic, lacks supporting details, or contains factual errors."
)

# Example 4: Domain-specific (customer support)
criteria_support = (
    "PASS: The response acknowledges the customer's problem, provides a clear solution, "
    "and offers follow-up support. FAIL: The response dismisses the problem, provides no solution, "
    "or is unclear."
)

# Example 5: Domain-specific (content summarization)
criteria_summary = (
    "PASS: The summary captures the main points, is concise (2-3 sentences), "
    "and uses language from the original text. FAIL: The summary is too long, "
    "omits key points, or introduces information not in the original."
)

Using BinaryJudgeGrader in a recipe

In a training recipe, spawn a judge model and create the grader:
from adaptive_harmony.runtime import InputConfig, recipe_main
from adaptive_harmony.runtime.context import RecipeContext
from adaptive_harmony.graders import BinaryJudgeGrader

class MyConfig(InputConfig):
    train_dataset: Dataset
    model_to_train: Model
    judge_model: Model

@recipe_main
async def main(config: MyConfig, ctx: RecipeContext):
    # Spawn the judge model with spawn_inference
    judge_builder = await config.judge_model.to_builder(ctx, tp=1)
    judge = await judge_builder.spawn_inference("judge")
    
    # Create the grader
    grader = BinaryJudgeGrader(
        grader_key="factuality",
        model=judge,
        criteria="The response contains only factually verifiable statements and cites sources when needed.",
    )
    
    # Use the grader in training...
    # grade = await grader.grade(sample)

Using few-shot examples (optional)

For more consistent judging, you can provide few-shot examples via BinaryJudgeShot. Each shot pairs a thread with the expected score and reasoning, teaching the judge your standards for PASS vs. FAIL.
from adaptive_harmony.graders.binary_judge import BinaryJudgeShot
from adaptive_harmony import StringThread

# Example 1: Good response — should PASS
good_example = StringThread([
    ("user", "What is machine learning?"),
    ("assistant", "Machine learning is a subset of AI where systems learn patterns from data "
     "without explicit programming. For example, recommendation systems use ML to suggest "
     "products based on user history.")
])

# Example 2: Poor response — should FAIL
poor_example = StringThread([
    ("user", "What is machine learning?"),
    ("assistant", "ML is cool.")
])

shots = [
    BinaryJudgeShot(thread=good_example, score="PASS", reasoning="Clear definition with a concrete example"),
    BinaryJudgeShot(thread=poor_example, score="FAIL", reasoning="Too vague, no examples or detail"),
]

Validation

We can validate the Grade object structure and few-shot setup without a live judge model. In a real recipe, calling grader.grade(thread) sends the thread to the judge LLM and returns one of these Grade objects.
from adaptive_harmony import Grade

# Simulate the Grade objects that BinaryJudgeGrader.grade() returns
pass_grade = Grade(
    value=1.0,
    grader_key="quality-check",
    reasoning="The response directly addresses the question with clear examples and accurate information.",
)

fail_grade = Grade(
    value=0.0,
    grader_key="quality-check",
    reasoning="The response is vague and does not provide concrete examples as required.",
)

# Validate structure
assert pass_grade.value == 1.0
assert fail_grade.value == 0.0
assert pass_grade.grader_key == "quality-check"
assert len(pass_grade.reasoning) > 0

print(f"✅ PASS grade: value={pass_grade.value}, reasoning=\"{pass_grade.reasoning}\"")
print(f"❌ FAIL grade: value={fail_grade.value}, reasoning=\"{fail_grade.reasoning}\"")

# Validate few-shot examples are well-formed
for shot in shots:
    assert shot.score in ("PASS", "FAIL"), f"Invalid score: {shot.score}"
    assert len(shot.thread.last_content()) > 0, "Shot thread must have content"
print(f"\n✅ All {len(shots)} few-shot examples validated")

Key takeaways

  1. BinaryJudgeGrader is the simplest LLM judge — Just provide a judge model and criteria string
  2. Write specific, measurable criteria — Vague criteria lead to inconsistent judging; be explicit about PASS/FAIL conditions
  3. Use for semantic quality checks — When rules can’t capture what makes a completion good (e.g., “helpful”, “clear”, “complete”)
  4. Optional: Add few-shot examples — Improves consistency by showing the judge what PASS and FAIL look like
  5. For finer-grained evaluation, see:
    • RangeJudgeGrader — Score on a numeric scale (e.g., 1-5) instead of binary PASS/FAIL
    • TemplatedPromptJudgeGrader — Write custom judge prompts with templates
    • CombinedGrader — Combine multiple graders for multi-dimensional evaluation