What BinaryJudgeGrader does
BinaryJudgeGrader evaluates a model completion using an LLM judge and returns either PASS (1.0) or FAIL (0.0). Unlike rule-based graders that check for patterns or structures, BinaryJudgeGrader:- Takes a
criteriastring describing what makes a good completion - Sends the completion to a judge LLM with the criteria
- Judge outputs
BinaryJudgeOutputwithreasoningandscore(PASS/FAIL) - Grader returns 1.0 for PASS, 0.0 for FAIL
When to use BinaryJudgeGrader
- Semantic quality checks: You need LLM judgment on aspects rules can’t capture (“Is this response helpful?”)
- Binary decisions are sufficient: You only care about PASS vs. FAIL, not a numeric scale (use
RangeJudgeGraderfor scoring on a range) - Quick setup: You want a working judge grader without writing custom judge prompts (use
TemplatedPromptJudgeGraderfor full control) - Training & evaluation: Works for both training and offline evaluation
Configuration
Create a BinaryJudgeGrader by providing:grader_key: Unique identifier for this grader (used in logging/aggregation)model: An inference model spawned for judging (must be an LLM capable of structured output)criteria: A string describing what makes a completion PASS vs. FAIL
Basic usage
How the judge prompt works internally
BinaryJudgeGrader constructs a judge prompt with these components:- System prompt: Instructs the judge LLM to evaluate the completion against the criteria
- Completion: The model’s output to be evaluated
- Output format: Judge must return JSON with two fields:
reasoning: Explanation of why the completion PASSES or FAILSscore: Either"PASS","FAIL", or"NA"(not applicable)
Choosing good criteria
Criteria are critical—they directly shape judge behavior. Use these principles:-
Be specific and measurable — Avoid vague language like “be good”. Instead:
- Bad: “The response should be helpful”
- Good: “The response must address the user’s request with at least one concrete example”
-
Include both positive and negative indicators — Help the judge distinguish PASS from FAIL:
- “PASS: Contains verifiable facts and cites sources. FAIL: Contains unsupported claims or hallucinations.”
-
Keep criteria focused — One dimension per grader. Use
CombinedGraderto aggregate multiple graders for multi-dimensional evaluation:- One grader checks “Is this factually accurate?”
- Another checks “Does this follow the requested format?”
- Combine them to get overall quality score
Example criteria of increasing specificity:
Using BinaryJudgeGrader in a recipe
In a training recipe, spawn a judge model and create the grader:Using few-shot examples (optional)
For more consistent judging, you can provide few-shot examples viaBinaryJudgeShot. Each shot pairs a thread with the expected score and reasoning, teaching the judge your standards for PASS vs. FAIL.
Validation
We can validate the Grade object structure and few-shot setup without a live judge model. In a real recipe, callinggrader.grade(thread) sends the thread to the judge LLM and returns one of these Grade objects.
Key takeaways
- BinaryJudgeGrader is the simplest LLM judge — Just provide a judge model and criteria string
- Write specific, measurable criteria — Vague criteria lead to inconsistent judging; be explicit about PASS/FAIL conditions
- Use for semantic quality checks — When rules can’t capture what makes a completion good (e.g., “helpful”, “clear”, “complete”)
- Optional: Add few-shot examples — Improves consistency by showing the judge what PASS and FAIL look like
- For finer-grained evaluation, see:
RangeJudgeGrader— Score on a numeric scale (e.g., 1-5) instead of binary PASS/FAILTemplatedPromptJudgeGrader— Write custom judge prompts with templatesCombinedGrader— Combine multiple graders for multi-dimensional evaluation

