Ranged scoring with LLM judges

Grader description

A RangeJudgeGrader evaluates completions on a numeric scale instead of binary pass/fail. This is useful when quality exists on a spectrum and you want fine-grained feedback about model behavior. Unlike a BinaryJudgeGrader (which scores 0.0 or 1.0), a RangeJudgeGrader returns scores across a range (e.g., 1-5). The grader:

Accepts criteria describing what each score level means
Uses evaluation steps to guide the judge’s reasoning
Takes few-shot examples showing different score levels
Returns a normalized score (0-1 by default) based on logprob-weighted scoring
Optionally uses subrange expectations to help the judge understand the rubric

Pseudocode

This is the pseudocode for a range-based judge grader:

Define explicit criteria describing what each score level means
Provide numbered evaluation steps the judge should follow
Describe what each score range means using SubrangeExpectations (textual descriptions, not numeric mappings)
Include 2-3 few-shot examples showing what different score levels look like
Pass all components to RangeJudgeGrader with an InferenceModel
The LLM judge scores the completion on the numeric range (e.g., 1-5)
Return a normalized grade (0-1 by default)

Implementation

from adaptive_harmony import StringThread
from adaptive_harmony.graders.range_judge import RangeJudgeShot, SubrangeExpectations

Defining the criteria and evaluation steps

Start by writing clear criteria that describe what each score level means. The rubric should make the boundaries explicit—what distinguishes a 1 from a 2, or a 3 from a 4. Then list numbered evaluation steps that guide the judge’s reasoning. These steps should be concrete and actionable.

# Define the scoring criteria
criteria = """
Score the quality of the completion on a scale of 1-5:

1 - Irrelevant or off-topic: Completion does not address the prompt or is contradictory.
2 - Incomplete: Completion touches on the topic but lacks detail or clarity.
3 - Adequate: Completion covers the main points but is missing some depth or supporting details.
4 - Good: Completion is relevant, clear, and covers most key aspects with supporting details.
5 - Excellent: Completion is thorough, well-structured, and provides comprehensive information with clear reasoning.
"""

# Define the evaluation steps
evaluation_steps = [
    "Does the completion directly address the prompt?",
    "Is the information relevant and accurate?",
    "How complete is the answer? Are key points covered?",
    "Is the information well-organized and easy to follow?",
    "Are claims supported with reasoning or examples?",
    "Assign a score from 1-5 based on the rubric above.",
]

print("📋 Criteria:")
print(criteria)
print("📝 Evaluation steps:")
for i, step in enumerate(evaluation_steps, 1):
    print(f"  {i}. {step}")

SubrangeExpectations: textual descriptions of score ranges

SubrangeExpectations is a NamedTuple with subrange (a tuple of min and max score) and expectation (a text description of what that score range means). These help the judge understand the rubric by describing the expected quality level for each score range. This is different from reward mapping—these are purely textual descriptions to guide the LLM judge.

# Define subrange expectations: what each score range means
subrange_expectations = [
    SubrangeExpectations(
        subrange=(1, 2),
        expectation="The response is off-topic, incomplete, or of very low quality"
    ),
    SubrangeExpectations(
        subrange=(3, 3),
        expectation="The response covers the main points adequately but lacks depth or polish"
    ),
    SubrangeExpectations(
        subrange=(4, 5),
        expectation="The response is comprehensive, well-structured, and demonstrates expert understanding"
    ),
]

print("🎯 Subrange expectations:")
for se in subrange_expectations:
    print(f"  📊 Scores {se.subrange[0]}-{se.subrange[1]}: {se.expectation}")

Few-shot examples for the judge

Few-shot examples teach the judge what each score looks like by showing real examples. Each shot is a RangeJudgeShot with:

thread: A StringThread with the prompt and completion
reasoning: The judge’s reasoning for that score
score: The gold-standard score for that example

# Few-shot examples for the judge
prompt = "Explain the difference between correlation and causation."

# Example 1: Score 1 (Irrelevant)
score_1_completion = "Correlation is when two things happen together. Causation is when something causes something else."
score_1_thread = StringThread([
    ("user", prompt),
    ("assistant", score_1_completion)
])

# Example 2: Score 3 (Adequate)
score_3_completion = (
    "Correlation means two variables move together (both go up or both go down). "
    "Causation means one variable actually causes the change in the other. "
    "A classic example is ice cream sales and drowning deaths—both increase in summer, but ice cream doesn't cause drowning."
)
score_3_thread = StringThread([
    ("user", prompt),
    ("assistant", score_3_completion)
])

# Example 3: Score 5 (Excellent)
score_5_completion = (
    "Correlation describes a statistical relationship between two variables where they tend to move together. "
    "When one increases, the other tends to increase (positive correlation) or decrease (negative correlation). "
    "However, correlation does not imply causation. Two variables can be correlated for several reasons:\n"
    "1. One causes the other (genuine causal link)\n"
    "2. A third variable causes both (confounding variable)\n"
    "3. The relationship is coincidental\n\n"
    "Example: Ice cream sales and drowning deaths are positively correlated, both peak in summer. "
    "But ice cream doesn't cause drowning—warm weather is the confounding variable that causes both. "
    "To establish causation, you need controlled experiments or causal inference methods that rule out confounds."
)
score_5_thread = StringThread([
    ("user", prompt),
    ("assistant", score_5_completion)
])

# Create RangeJudgeShot objects
shots = [
    RangeJudgeShot(
        thread=score_1_thread,
        reasoning="This explanation is too brief and lacks the key insight that correlation does not imply causation. It provides only basic definitions without examples or nuance.",
        score=1
    ),
    RangeJudgeShot(
        thread=score_3_thread,
        reasoning="This explanation covers the main distinction and provides a useful example. However, it could go deeper by explaining confounding variables more clearly.",
        score=3
    ),
    RangeJudgeShot(
        thread=score_5_thread,
        reasoning="This is a comprehensive explanation that defines both concepts clearly, provides multiple reasons for correlation without causation, includes a concrete example, and explains the role of confounding variables. The structure is logical and easy to follow.",
        score=5
    ),
]

print(f"🎓 Created {len(shots)} few-shot examples:")
for shot in shots:
    print(f"  📌 Score {shot.score}: {len(shot.thread.last_content())} characters")

Validation

We can validate all the setup objects and demonstrate score normalization without a live judge model.

Creating the RangeJudgeGrader

The RangeJudgeGrader constructor requires:

grader_key: Unique identifier for this grader
model: An InferenceModel (required)
criteria: Description of the scoring rubric
score_range: Tuple of (min, max) for the numeric scale (default: (1, 5))
evaluation_steps: List of strings describing evaluation steps (optional; auto-generated if None)
subrange_expectations: List of SubrangeExpectations describing what each score range means (optional)
shots: List of RangeJudgeShot few-shot examples (optional)
normalize_score: If True (default), scores are normalized to 0-1 range

By default, scores are normalized: score 1 → 0.0, score 5 → 1.0.

from adaptive_harmony.graders import RangeJudgeGrader

# Requires an InferenceModel — shown as reference only
grader = RangeJudgeGrader(
    grader_key="completion-quality",
    model=inference_model,  # InferenceModel instance
    criteria=criteria,
    score_range=(1, 5),
    evaluation_steps=evaluation_steps,
    subrange_expectations=subrange_expectations,
    shots=shots,
    normalize_score=True,  # Scores 1-5 → 0.0-1.0
)

score_range = (1, 5)

# Validate subrange expectations cover the full score range
all_covered = set()
for se in subrange_expectations:
    for s in range(se.subrange[0], se.subrange[1] + 1):
        all_covered.add(s)
assert all_covered == set(range(score_range[0], score_range[1] + 1)), \
    f"Subranges don't cover full range: missing {set(range(*score_range)) - all_covered}"
print(f"✅ Subrange expectations cover the full {score_range[0]}-{score_range[1]} range")

# Validate few-shot examples have scores within range
for shot in shots:
    assert score_range[0] <= shot.score <= score_range[1], \
        f"Shot score {shot.score} outside range {score_range}"
    assert len(shot.reasoning) > 0, "Shot must have reasoning"
print(f"✅ All {len(shots)} few-shot scores are within [{score_range[0]}, {score_range[1]}]")

# Demonstrate normalization: raw score → [0, 1]
def normalize(raw_score, lo, hi):
    return (raw_score - lo) / (hi - lo)

print(f"\n📊 Score normalization (range {score_range[0]}-{score_range[1]}):")
for raw in range(score_range[0], score_range[1] + 1):
    norm = normalize(raw, *score_range)
    print(f"  Raw {raw} → Normalized {norm:.2f}")

How scoring works

The RangeJudgeGrader uses logprob-weighted scoring internally:

The LLM judge produces a score (1-5) with reasoning
The grader computes logprobs for each possible score (1, 2, 3, 4, 5)
These logprobs are converted to probabilities
A weighted average is computed: sum(score_i * prob_i for each score)
If normalize_score=True, the weighted score is mapped to [0, 1]:
- Raw score 1 → 0.0
- Raw score 5 → 1.0
- Linear interpolation in between

This design captures the judge’s uncertainty about borderline cases and prevents sharp boundaries.

RangeJudgeGrader vs BinaryJudgeGrader

Aspect	BinaryJudgeGrader	RangeJudgeGrader
Score	0.0 or 1.0 (PASS/FAIL)	0.0 to 1.0 (normalized range)
Use case	Clear pass/fail criteria	Spectrum-based quality
Rubric	Single PASS criterion	Multiple rubric levels
Few-shots	1-2 examples	2-3 examples (low, mid, high)
Training signal	All-or-nothing reward	Fine-grained gradients
Judge reasoning	Simple: PASS or FAIL	Structured: reasoning + score

Use BinaryJudgeGrader when you have a clear PASS/FAIL threshold (e.g., “Does the output contain valid JSON?”). Use RangeJudgeGrader when quality exists on a spectrum (e.g., “How well does this summary capture the key points?”).

Key takeaways

RangeJudgeGrader for spectrum-based quality - When binary PASS/FAIL is too coarse, numeric ranges capture degrees of quality
Write explicit criteria - Define what each score level means so the judge has clear guidance
Include numbered evaluation steps - Structured steps guide the judge’s reasoning process
Few-shot examples are essential - Show examples at low, medium, and high scores so the judge understands the full scale
Normalization is on by default - Scores are mapped to [0, 1] by default for consistency with other graders

Getting Started

Building Blocks

Training & Evaluation

Cookbook

Updates

Grader description

Pseudocode

Implementation

Defining the criteria and evaluation steps

SubrangeExpectations: textual descriptions of score ranges

Few-shot examples for the judge

Validation

Creating the RangeJudgeGrader

How scoring works

RangeJudgeGrader vs BinaryJudgeGrader

Key takeaways

Getting Started

Building Blocks

Training & Evaluation

Cookbook

Updates

Documentation Index

​Grader description

​Pseudocode

​Implementation

​Defining the criteria and evaluation steps

​SubrangeExpectations: textual descriptions of score ranges

​Few-shot examples for the judge

​Validation

​Creating the RangeJudgeGrader

​How scoring works

​RangeJudgeGrader vs BinaryJudgeGrader

​Key takeaways

Grader description

Pseudocode

Implementation

Defining the criteria and evaluation steps

SubrangeExpectations: textual descriptions of score ranges

Few-shot examples for the judge

Validation

Creating the RangeJudgeGrader

How scoring works

RangeJudgeGrader vs BinaryJudgeGrader

Key takeaways