Judges & scoring guide

Scorers evaluate model outputs based on specific criteria and provide feedback for training.

Prebuilt Judges

Harmony provides several prebuilt scorers that you can use immediately in your recipes.

BinaryJudgeScorer

The BinaryJudgeScorer evaluates outputs based on specific criteria and returns a binary 1/0 score. All the response formatting and parsing logic is already implemented in the class, you can only write your criteria in the criteria field.

Use a BinaryJudgeScorer

from adaptive_harmony.scoring.binary_judge_scorer import BinaryJudgeScorer

# Define your evaluation criteria
criteria = """**Quality Check**:
- The response should not be longer than 5 sentences
- The response should not use technical jargon"""

# Create the scorer
judge_model = client.model("openai://gpt-4o").spawn_inference("judge")
scorer = BinaryJudgeScorer(model=judge_model, criteria=criteria)

# Use the scorer
score = await scorer.score("What is machine learning?", "Machine learning is a subset of AI...")
print(score)  # Returns a score between 0 and 1

FaithfulnessScorer

The FaithfulnessScorer evaluates how faithful a response is to the input context. It scores each sentence in the last assistant turn as fully supported by the context or not (1 or 0). The final score is the average of each sentence. The context is the rest of the thread, excluding the system prompt. It requires an input language code to help to split the sentences.

Use a FaithfulnessScorer

from adaptive_harmony.scoring.faithfulness_judge_scorer import FaithfulnessScorer

# Create faithfulness scorer
faithfulness_scorer = FaithfulnessScorer(model=judge_model, language="en")

# Use the scorer
context = "The capital of France is Paris."
response = "Paris is the capital of France."
score = await faithfulness_scorer.score(context, response)
print(score)  # Higher score for more faithful responses

CombinedScorer

The CombinedScorer allows you to combine multiple scorers into a single scoring function:

Use the CombinedScorer

from adaptive_harmony.scoring.combined_scorer import CombinedScorer

# Create multiple scorers
faithfulness_scorer = FaithfulnessScorer(model=judge_model, language="en")
jargon_scorer = BinaryJudgeScorer(model=judge_model, criteria="The response should not use technical jargon")
conciseness_scorer = BinaryJudgeScorer(model=judge_model, criteria="The response should not be longer than 5 sentences")

# Combine them
combined_scorer = CombinedScorer(scorers=[faithfulness_scorer, jargon_scorer, conciseness_scorer])

# Use the combined scorer
score = await combined_scorer.score(context, response)
print(score)  # Combined score from all scorers

Creating Your Own Scorer

You can create custom scorers by implementing the Scorer class.. Here’s how to build your own scorer: The Scorer abstract class only have the async method score to implement that should return a ScoreWithMetadata object. Both Scorer and ScoreWithMetadata are to be imported from harmony.scoring.base_scorer

Define a custom scorer from the Scorer class

from adaptive_harmony.scoring.base_scorer import Scorer, ScoreWithMetadata

class CustomScorer(Scorer):
    def __init__(self, threshold: int = 50):
        self._threshold = threshold

    async def score(self, sample: StringThread) -> float:
        # Implement your scoring logic here
        # This is a simple example - replace with your actual logic
        if len(sample) > self._threshold:
            return ScoreWithMetadata(score=1.0)
        else:
            return ScoreWithMetadata(score=0.0)

You can also define a Scorer directly from a function with the from_function classmethod. In this case, the scoring function can return a float.

Define a custom scorer from a custom function

from adaptive_harmony.scoring.base_scorer import Scorer

async def score_fn(sample: StringThread) -> float:
    # Implement your scoring logic here
    # This is a simple example - replace with your actual logic
    if len(response) > self._threshold:
        return 1.0
    else:
        return 0.0

CustomScorer = Scorer.from_function(score_fn)

Formatting and Parsing Tooling

Current models in the Adaptive Engine do not support native structured output. To help you define model formatted output and parse them to get more robust judges, we have defined tooling functions.

Render Schema from Pydantic Model

The model.render_schema function takes a Pydantic model and generates a schema description that can be used to instruct the model on the expected output format.

from pydantic import BaseModel, Field
from adaptive_harmony.core.structured_output import render_schema

class EvaluationResult(BaseModel):
    strengths: list[str] = Field(description="List of strengths in the response")
    weaknesses: list[str] = Field(description="List of areas for improvement")
    completeness_score: float = Field(description="Completeness score from 0.0 to 1.0")
    conciseness_score: float = Field(description="Conciseness score from 0.0 to 1.0")


# Render the schema
schema = render_schema(EvaluationResult)
print(schema)

The rendered schema will look something like:

{
  "strengths": [str],
  "weaknesses": [str],
  "completeness_score": float,
  "conciseness_score": float
}

strengths: List of strengths in the response
weaknesses: List of areas for improvement
completeness_score: Completeness score from 0.0 to 1.0
conciseness_score: Conciseness score from 0.0 to 1.0

You can then insert this in your judge scoring prompt to get the right format from the model.

Generate and Validate with Retry

The generate_and_validate function generates an answer with the model, validates the completion against a Pydantic model, and retries if validation fails.

from adaptive_harmony.core.structured_output import generate_and_validate

# Define your Pydantic model
class EvaluationResult(BaseModel):
    strengths: List[str] = Field(description="List of strengths in the response")
    weaknesses: List[str] = Field(description="List of areas for improvement")
    completeness_score: float = Field(description="Completeness score from 0.0 to 1.0")
    conciseness_score: float = Field(description="Conciseness score from 0.0 to 1.0")


try:
    result = await generate_and_validate(
        model=model,
        prompt=prompt,
        pydantic_model=SummarizationEvaluation,
        max_retries=3
    )
    print(f"Strengths: {result.strengths}")
    print(f"Weaknesses: {result.weaknesses}")
    print(f"Conciseness: {result.conciseness_score}")
    print(f"Completeness: {result.completeness_score}")


except Exception as e:
    print(f"Failed to generate valid output after retries: {e}")

Using Scorers in Training

Scorers are commonly used in training to provide feedback for model improvement:

from adaptive_harmony.common import PPO
from adaptive_harmony.scoring.combined_scorer import BinaryJudgeScorer

@recipe_main
async def training_recipe(ctx: RecipeContext):
    client = ctx.client

    # Load models
    policy_model = await client.model("my-model").spawn_train("policy", 4096)
    value_model = await client.model("my-model").spawn_train("value", 4096)
    judge_model = await client.model("openai://gpt-4o").spawn_inference("judge")

    # Create scorers
    quality_scorer = BinaryJudgeScorer(model=judge_model, criteria=quality_criteria)
    faithfulness_scorer = FaithfulnessScorer(model=judge_model, language="en")

    # Combine scorers
    combined_scorer = CombinedScorer(scorers=[quality_scorer, faithfulness_scorer])

    # Use in PPO training
    await PPO(
        training_dataset,
        policy_model,
        value_model,
        scoring_fn=combined_scorer.score_without_metadata,
        logger=logger,
        max_num_ppo_steps=100,
    ).run()

Scoring Best Practices

Clear Criteria: Define specific, measurable criteria for your scorers
Elementary Criteria: Splitting rules to judge into several scorer, each one having one elementary criteria, shows better results
Consistent Evaluation: Ensure your scorer provides consistent results and tets them with known good and bad examples
Appropriate Models: Use appropriate judge models for your evaluation task
Error Handling: Implement proper error handling in custom scorers

Platform

Inference and Feedback

Datasets

Evaluation

Fine-tuning

Custom Recipes

Integrations

Deployment

Prebuilt Judges

BinaryJudgeScorer

FaithfulnessScorer

CombinedScorer

Creating Your Own Scorer

Formatting and Parsing Tooling

Render Schema from Pydantic Model

Generate and Validate with Retry

Using Scorers in Training

Scoring Best Practices

Platform

Inference and Feedback

Datasets

Evaluation

Fine-tuning

Custom Recipes

Integrations

Deployment

​Prebuilt Judges

​BinaryJudgeScorer

​FaithfulnessScorer

​CombinedScorer

​Creating Your Own Scorer

​Formatting and Parsing Tooling

​Render Schema from Pydantic Model

​Generate and Validate with Retry

​Using Scorers in Training

​Scoring Best Practices

Prebuilt Judges

BinaryJudgeScorer

FaithfulnessScorer

CombinedScorer

Creating Your Own Scorer

Formatting and Parsing Tooling

Render Schema from Pydantic Model

Generate and Validate with Retry

Using Scorers in Training

Scoring Best Practices