Documentation Index
Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
Use this file to discover all available pages before exploring further.
What BinaryJudgeGrader does
BinaryJudgeGrader evaluates a model completion using an LLM judge and returns either PASS (1.0) or FAIL (0.0).
Unlike rule-based graders that check for patterns or structures, BinaryJudgeGrader:
- Takes a
criteria string describing what makes a good completion
- Sends the completion to a judge LLM with the criteria
- Judge outputs
BinaryJudgeOutput with reasoning and score (PASS/FAIL)
- Grader returns 1.0 for PASS, 0.0 for FAIL
This enables semantic evaluation—checking for qualities like “clarity”, “completeness”, or “consistency”—without writing custom judge prompts.
When to use BinaryJudgeGrader
- Semantic quality checks: You need LLM judgment on aspects rules can’t capture (“Is this response helpful?”)
- Binary decisions are sufficient: You only care about PASS vs. FAIL, not a numeric scale (use
RangeJudgeGrader for scoring on a range)
- Quick setup: You want a working judge grader without writing custom judge prompts (use
TemplatedPromptJudgeGrader for full control)
- Training & evaluation: Works for both training and offline evaluation
Configuration
Create a BinaryJudgeGrader by providing:
grader_key: Unique identifier for this grader (used in logging/aggregation)
model: An inference model spawned for judging (must be an LLM capable of structured output)
criteria: A string describing what makes a completion PASS vs. FAIL
Basic usage
from adaptive_harmony.graders import BinaryJudgeGrader
grader = BinaryJudgeGrader(
grader_key="quality-check",
model=judge_model,
criteria="The response must be factually accurate, well-structured, and directly address the user's request.",
)
# Use the grader
grade = await grader.grade(thread)
# grade.value is 1.0 (PASS) or 0.0 (FAIL)
# grade.reasoning contains the judge's explanation
How the judge prompt works internally
BinaryJudgeGrader constructs a judge prompt with these components:
- System prompt: Instructs the judge LLM to evaluate the completion against the criteria
- Completion: The model’s output to be evaluated
- Output format: Judge must return JSON with two fields:
reasoning: Explanation of why the completion PASSES or FAILS
score: Either "PASS", "FAIL", or "NA" (not applicable)
If the judge fails to produce valid structured output, BinaryJudgeGrader retries automatically.
Example judge evaluation:
{
"reasoning": "The response directly answers the question with multiple specific examples and clear explanations.",
"score": "PASS"
}
Choosing good criteria
Criteria are critical—they directly shape judge behavior. Use these principles:
-
Be specific and measurable — Avoid vague language like “be good”. Instead:
- Bad: “The response should be helpful”
- Good: “The response must address the user’s request with at least one concrete example”
-
Include both positive and negative indicators — Help the judge distinguish PASS from FAIL:
- “PASS: Contains verifiable facts and cites sources. FAIL: Contains unsupported claims or hallucinations.”
-
Keep criteria focused — One dimension per grader. Use
CombinedGrader to aggregate multiple graders for multi-dimensional evaluation:
- One grader checks “Is this factually accurate?”
- Another checks “Does this follow the requested format?”
- Combine them to get overall quality score
Example criteria of increasing specificity:
# Example 1: Generic (not recommended)
criteria_generic = "The response is good."
# Example 2: More specific
criteria_specific = (
"The response must directly answer the user's question without introducing unrelated topics."
)
# Example 3: Highly specific with PASS/FAIL conditions
criteria_detailed = (
"PASS: The response directly answers the user's question with clear reasoning, "
"includes at least one relevant example or supporting detail, and contains no factual errors. "
"FAIL: The response is vague, goes off-topic, lacks supporting details, or contains factual errors."
)
# Example 4: Domain-specific (customer support)
criteria_support = (
"PASS: The response acknowledges the customer's problem, provides a clear solution, "
"and offers follow-up support. FAIL: The response dismisses the problem, provides no solution, "
"or is unclear."
)
# Example 5: Domain-specific (content summarization)
criteria_summary = (
"PASS: The summary captures the main points, is concise (2-3 sentences), "
"and uses language from the original text. FAIL: The summary is too long, "
"omits key points, or introduces information not in the original."
)
Using BinaryJudgeGrader in a recipe
In a training recipe, spawn a judge model and create the grader:
from adaptive_harmony.runtime import InputConfig, recipe_main
from adaptive_harmony.runtime.context import RecipeContext
from adaptive_harmony.graders import BinaryJudgeGrader
class MyConfig(InputConfig):
train_dataset: Dataset
model_to_train: Model
judge_model: Model
@recipe_main
async def main(config: MyConfig, ctx: RecipeContext):
# Spawn the judge model with spawn_inference
judge_builder = await config.judge_model.to_builder(ctx, tp=1)
judge = await judge_builder.spawn_inference("judge")
# Create the grader
grader = BinaryJudgeGrader(
grader_key="factuality",
model=judge,
criteria="The response contains only factually verifiable statements and cites sources when needed.",
)
# Use the grader in training...
# grade = await grader.grade(sample)
Using few-shot examples (optional)
For more consistent judging, you can provide few-shot examples via BinaryJudgeShot. Each shot pairs a thread with the expected score and reasoning, teaching the judge your standards for PASS vs. FAIL.
from adaptive_harmony.graders.binary_judge import BinaryJudgeShot
from adaptive_harmony import StringThread
# Example 1: Good response — should PASS
good_example = StringThread([
("user", "What is machine learning?"),
("assistant", "Machine learning is a subset of AI where systems learn patterns from data "
"without explicit programming. For example, recommendation systems use ML to suggest "
"products based on user history.")
])
# Example 2: Poor response — should FAIL
poor_example = StringThread([
("user", "What is machine learning?"),
("assistant", "ML is cool.")
])
shots = [
BinaryJudgeShot(thread=good_example, score="PASS", reasoning="Clear definition with a concrete example"),
BinaryJudgeShot(thread=poor_example, score="FAIL", reasoning="Too vague, no examples or detail"),
]
Validation
We can validate the Grade object structure and few-shot setup without a live judge model. In a real recipe, calling grader.grade(thread) sends the thread to the judge LLM and returns one of these Grade objects.
from adaptive_harmony import Grade
# Simulate the Grade objects that BinaryJudgeGrader.grade() returns
pass_grade = Grade(
value=1.0,
grader_key="quality-check",
reasoning="The response directly addresses the question with clear examples and accurate information.",
)
fail_grade = Grade(
value=0.0,
grader_key="quality-check",
reasoning="The response is vague and does not provide concrete examples as required.",
)
# Validate structure
assert pass_grade.value == 1.0
assert fail_grade.value == 0.0
assert pass_grade.grader_key == "quality-check"
assert len(pass_grade.reasoning) > 0
print(f"✅ PASS grade: value={pass_grade.value}, reasoning=\"{pass_grade.reasoning}\"")
print(f"❌ FAIL grade: value={fail_grade.value}, reasoning=\"{fail_grade.reasoning}\"")
# Validate few-shot examples are well-formed
for shot in shots:
assert shot.score in ("PASS", "FAIL"), f"Invalid score: {shot.score}"
assert len(shot.thread.last_content()) > 0, "Shot thread must have content"
print(f"\n✅ All {len(shots)} few-shot examples validated")
Key takeaways
- BinaryJudgeGrader is the simplest LLM judge — Just provide a judge model and criteria string
- Write specific, measurable criteria — Vague criteria lead to inconsistent judging; be explicit about PASS/FAIL conditions
- Use for semantic quality checks — When rules can’t capture what makes a completion good (e.g., “helpful”, “clear”, “complete”)
- Optional: Add few-shot examples — Improves consistency by showing the judge what PASS and FAIL look like
- For finer-grained evaluation, see:
RangeJudgeGrader — Score on a numeric scale (e.g., 1-5) instead of binary PASS/FAIL
TemplatedPromptJudgeGrader — Write custom judge prompts with templates
CombinedGrader — Combine multiple graders for multi-dimensional evaluation