Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt

Use this file to discover all available pages before exploring further.

Grader description

A RangeJudgeGrader evaluates completions on a numeric scale instead of binary pass/fail. This is useful when quality exists on a spectrum and you want fine-grained feedback about model behavior. Unlike a BinaryJudgeGrader (which scores 0.0 or 1.0), a RangeJudgeGrader returns scores across a range (e.g., 1-5). The grader:
  • Accepts criteria describing what each score level means
  • Uses evaluation steps to guide the judge’s reasoning
  • Takes few-shot examples showing different score levels
  • Returns a normalized score (0-1 by default) based on logprob-weighted scoring
  • Optionally uses subrange expectations to help the judge understand the rubric

Pseudocode

This is the pseudocode for a range-based judge grader:
  • Define explicit criteria describing what each score level means
  • Provide numbered evaluation steps the judge should follow
  • Describe what each score range means using SubrangeExpectations (textual descriptions, not numeric mappings)
  • Include 2-3 few-shot examples showing what different score levels look like
  • Pass all components to RangeJudgeGrader with an InferenceModel
  • The LLM judge scores the completion on the numeric range (e.g., 1-5)
  • Return a normalized grade (0-1 by default)

Implementation

from adaptive_harmony import StringThread
from adaptive_harmony.graders.range_judge import RangeJudgeShot, SubrangeExpectations

Defining the criteria and evaluation steps

Start by writing clear criteria that describe what each score level means. The rubric should make the boundaries explicit—what distinguishes a 1 from a 2, or a 3 from a 4. Then list numbered evaluation steps that guide the judge’s reasoning. These steps should be concrete and actionable.
# Define the scoring criteria
criteria = """
Score the quality of the completion on a scale of 1-5:

1 - Irrelevant or off-topic: Completion does not address the prompt or is contradictory.
2 - Incomplete: Completion touches on the topic but lacks detail or clarity.
3 - Adequate: Completion covers the main points but is missing some depth or supporting details.
4 - Good: Completion is relevant, clear, and covers most key aspects with supporting details.
5 - Excellent: Completion is thorough, well-structured, and provides comprehensive information with clear reasoning.
"""

# Define the evaluation steps
evaluation_steps = [
    "Does the completion directly address the prompt?",
    "Is the information relevant and accurate?",
    "How complete is the answer? Are key points covered?",
    "Is the information well-organized and easy to follow?",
    "Are claims supported with reasoning or examples?",
    "Assign a score from 1-5 based on the rubric above.",
]

print("📋 Criteria:")
print(criteria)
print("📝 Evaluation steps:")
for i, step in enumerate(evaluation_steps, 1):
    print(f"  {i}. {step}")

SubrangeExpectations: textual descriptions of score ranges

SubrangeExpectations is a NamedTuple with subrange (a tuple of min and max score) and expectation (a text description of what that score range means). These help the judge understand the rubric by describing the expected quality level for each score range. This is different from reward mapping—these are purely textual descriptions to guide the LLM judge.
# Define subrange expectations: what each score range means
subrange_expectations = [
    SubrangeExpectations(
        subrange=(1, 2),
        expectation="The response is off-topic, incomplete, or of very low quality"
    ),
    SubrangeExpectations(
        subrange=(3, 3),
        expectation="The response covers the main points adequately but lacks depth or polish"
    ),
    SubrangeExpectations(
        subrange=(4, 5),
        expectation="The response is comprehensive, well-structured, and demonstrates expert understanding"
    ),
]

print("🎯 Subrange expectations:")
for se in subrange_expectations:
    print(f"  📊 Scores {se.subrange[0]}-{se.subrange[1]}: {se.expectation}")

Few-shot examples for the judge

Few-shot examples teach the judge what each score looks like by showing real examples. Each shot is a RangeJudgeShot with:
  • thread: A StringThread with the prompt and completion
  • reasoning: The judge’s reasoning for that score
  • score: The gold-standard score for that example
# Few-shot examples for the judge
prompt = "Explain the difference between correlation and causation."

# Example 1: Score 1 (Irrelevant)
score_1_completion = "Correlation is when two things happen together. Causation is when something causes something else."
score_1_thread = StringThread([
    ("user", prompt),
    ("assistant", score_1_completion)
])

# Example 2: Score 3 (Adequate)
score_3_completion = (
    "Correlation means two variables move together (both go up or both go down). "
    "Causation means one variable actually causes the change in the other. "
    "A classic example is ice cream sales and drowning deaths—both increase in summer, but ice cream doesn't cause drowning."
)
score_3_thread = StringThread([
    ("user", prompt),
    ("assistant", score_3_completion)
])

# Example 3: Score 5 (Excellent)
score_5_completion = (
    "Correlation describes a statistical relationship between two variables where they tend to move together. "
    "When one increases, the other tends to increase (positive correlation) or decrease (negative correlation). "
    "However, correlation does not imply causation. Two variables can be correlated for several reasons:\n"
    "1. One causes the other (genuine causal link)\n"
    "2. A third variable causes both (confounding variable)\n"
    "3. The relationship is coincidental\n\n"
    "Example: Ice cream sales and drowning deaths are positively correlated, both peak in summer. "
    "But ice cream doesn't cause drowning—warm weather is the confounding variable that causes both. "
    "To establish causation, you need controlled experiments or causal inference methods that rule out confounds."
)
score_5_thread = StringThread([
    ("user", prompt),
    ("assistant", score_5_completion)
])

# Create RangeJudgeShot objects
shots = [
    RangeJudgeShot(
        thread=score_1_thread,
        reasoning="This explanation is too brief and lacks the key insight that correlation does not imply causation. It provides only basic definitions without examples or nuance.",
        score=1
    ),
    RangeJudgeShot(
        thread=score_3_thread,
        reasoning="This explanation covers the main distinction and provides a useful example. However, it could go deeper by explaining confounding variables more clearly.",
        score=3
    ),
    RangeJudgeShot(
        thread=score_5_thread,
        reasoning="This is a comprehensive explanation that defines both concepts clearly, provides multiple reasons for correlation without causation, includes a concrete example, and explains the role of confounding variables. The structure is logical and easy to follow.",
        score=5
    ),
]

print(f"🎓 Created {len(shots)} few-shot examples:")
for shot in shots:
    print(f"  📌 Score {shot.score}: {len(shot.thread.last_content())} characters")

Validation

We can validate all the setup objects and demonstrate score normalization without a live judge model.

Creating the RangeJudgeGrader

The RangeJudgeGrader constructor requires:
  • grader_key: Unique identifier for this grader
  • model: An InferenceModel (required)
  • criteria: Description of the scoring rubric
  • score_range: Tuple of (min, max) for the numeric scale (default: (1, 5))
  • evaluation_steps: List of strings describing evaluation steps (optional; auto-generated if None)
  • subrange_expectations: List of SubrangeExpectations describing what each score range means (optional)
  • shots: List of RangeJudgeShot few-shot examples (optional)
  • normalize_score: If True (default), scores are normalized to 0-1 range
By default, scores are normalized: score 1 → 0.0, score 5 → 1.0.
from adaptive_harmony.graders import RangeJudgeGrader

# Requires an InferenceModel — shown as reference only
grader = RangeJudgeGrader(
    grader_key="completion-quality",
    model=inference_model,  # InferenceModel instance
    criteria=criteria,
    score_range=(1, 5),
    evaluation_steps=evaluation_steps,
    subrange_expectations=subrange_expectations,
    shots=shots,
    normalize_score=True,  # Scores 1-5 → 0.0-1.0
)
score_range = (1, 5)

# Validate subrange expectations cover the full score range
all_covered = set()
for se in subrange_expectations:
    for s in range(se.subrange[0], se.subrange[1] + 1):
        all_covered.add(s)
assert all_covered == set(range(score_range[0], score_range[1] + 1)), \
    f"Subranges don't cover full range: missing {set(range(*score_range)) - all_covered}"
print(f"✅ Subrange expectations cover the full {score_range[0]}-{score_range[1]} range")

# Validate few-shot examples have scores within range
for shot in shots:
    assert score_range[0] <= shot.score <= score_range[1], \
        f"Shot score {shot.score} outside range {score_range}"
    assert len(shot.reasoning) > 0, "Shot must have reasoning"
print(f"✅ All {len(shots)} few-shot scores are within [{score_range[0]}, {score_range[1]}]")

# Demonstrate normalization: raw score → [0, 1]
def normalize(raw_score, lo, hi):
    return (raw_score - lo) / (hi - lo)

print(f"\n📊 Score normalization (range {score_range[0]}-{score_range[1]}):")
for raw in range(score_range[0], score_range[1] + 1):
    norm = normalize(raw, *score_range)
    print(f"  Raw {raw} → Normalized {norm:.2f}")

How scoring works

The RangeJudgeGrader uses logprob-weighted scoring internally:
  1. The LLM judge produces a score (1-5) with reasoning
  2. The grader computes logprobs for each possible score (1, 2, 3, 4, 5)
  3. These logprobs are converted to probabilities
  4. A weighted average is computed: sum(score_i * prob_i for each score)
  5. If normalize_score=True, the weighted score is mapped to [0, 1]:
    • Raw score 1 → 0.0
    • Raw score 5 → 1.0
    • Linear interpolation in between
This design captures the judge’s uncertainty about borderline cases and prevents sharp boundaries.

RangeJudgeGrader vs BinaryJudgeGrader

AspectBinaryJudgeGraderRangeJudgeGrader
Score0.0 or 1.0 (PASS/FAIL)0.0 to 1.0 (normalized range)
Use caseClear pass/fail criteriaSpectrum-based quality
RubricSingle PASS criterionMultiple rubric levels
Few-shots1-2 examples2-3 examples (low, mid, high)
Training signalAll-or-nothing rewardFine-grained gradients
Judge reasoningSimple: PASS or FAILStructured: reasoning + score
Use BinaryJudgeGrader when you have a clear PASS/FAIL threshold (e.g., “Does the output contain valid JSON?”). Use RangeJudgeGrader when quality exists on a spectrum (e.g., “How well does this summary capture the key points?”).

Key takeaways

  1. RangeJudgeGrader for spectrum-based quality - When binary PASS/FAIL is too coarse, numeric ranges capture degrees of quality
  2. Write explicit criteria - Define what each score level means so the judge has clear guidance
  3. Include numbered evaluation steps - Structured steps guide the judge’s reasoning process
  4. Few-shot examples are essential - Show examples at low, medium, and high scores so the judge understands the full scale
  5. Normalization is on by default - Scores are mapped to [0, 1] by default for consistency with other graders