Scorers evaluate model outputs based on specific criteria and provide feedback for training.
Prebuilt Judges
Harmony provides several prebuilt scorers that you can use immediately in your recipes.
BinaryJudgeScorer
The BinaryJudgeScorer
evaluates outputs based on specific criteria and returns a binary 1/0 score.
All the response formatting and parsing logic is already implemented in the class, you can only write your criteria in the criteria field.
from adaptive_harmony.scoring.binary_judge_scorer import BinaryJudgeScorer
# Define your evaluation criteria
criteria = """**Quality Check**:
- The response should not be longer than 5 sentences
- The response should not use technical jargon"""
# Create the scorer
judge_model = client.model("openai://gpt-4o").spawn_inference("judge")
scorer = BinaryJudgeScorer(model=judge_model, criteria=criteria)
# Use the scorer
score = await scorer.score("What is machine learning?", "Machine learning is a subset of AI...")
print(score) # Returns a score between 0 and 1
FaithfulnessScorer
The FaithfulnessScorer
evaluates how faithful a response is to the input context.
It scores each sentence in the last assistant turn as fully supported by the context or not (1 or 0). The final score is the average of each sentence.
The context is the rest of the thread, excluding the system prompt.
It requires an input language code to help to split the sentences.
from adaptive_harmony.scoring.faithfulness_judge_scorer import FaithfulnessScorer
# Create faithfulness scorer
faithfulness_scorer = FaithfulnessScorer(model=judge_model, language="en")
# Use the scorer
context = "The capital of France is Paris."
response = "Paris is the capital of France."
score = await faithfulness_scorer.score(context, response)
print(score) # Higher score for more faithful responses
CombinedScorer
The CombinedScorer
allows you to combine multiple scorers into a single scoring function:
from adaptive_harmony.scoring.combined_scorer import CombinedScorer
# Create multiple scorers
faithfulness_scorer = FaithfulnessScorer(model=judge_model, language="en")
jargon_scorer = BinaryJudgeScorer(model=judge_model, criteria="The response should not use technical jargon")
conciseness_scorer = BinaryJudgeScorer(model=judge_model, criteria="The response should not be longer than 5 sentences")
# Combine them
combined_scorer = CombinedScorer(scorers=[faithfulness_scorer, jargon_scorer, conciseness_scorer])
# Use the combined scorer
score = await combined_scorer.score(context, response)
print(score) # Combined score from all scorers
Creating Your Own Scorer
You can create custom scorers by implementing the Scorer
class.. Here’s how to build your own scorer:
The Scorer
abstract class only have the async method score
to implement that should return a ScoreWithMetadata object.
Both Scorer and ScoreWithMetadata are to be imported from harmony.scoring.base_scorer
Define a custom scorer from the Scorer class
from adaptive_harmony.scoring.base_scorer import Scorer, ScoreWithMetadata
class CustomScorer(Scorer):
def __init__(self, threshold: int = 50):
self._threshold = threshold
async def score(self, sample: StringThread) -> float:
# Implement your scoring logic here
# This is a simple example - replace with your actual logic
if len(sample) > self._threshold:
return ScoreWithMetadata(score=1.0)
else:
return ScoreWithMetadata(score=0.0)
You can also define a Scorer directly from a function with the from_function
classmethod. In this case, the scoring function can return a float.
Define a custom scorer from a custom function
from adaptive_harmony.scoring.base_scorer import Scorer
async def score_fn(sample: StringThread) -> float:
# Implement your scoring logic here
# This is a simple example - replace with your actual logic
if len(response) > self._threshold:
return 1.0
else:
return 0.0
CustomScorer = Scorer.from_function(score_fn)
Current models in the Adaptive Engine do not support native structured output. To help you define model formatted output and parse them to get more robust judges, we have defined tooling functions.
Render Schema from Pydantic Model
The model.render_schema
function takes a Pydantic model and generates a schema description that can be used to instruct the model on the expected output format.
from pydantic import BaseModel, Field
from adaptive_harmony.core.structured_output import render_schema
class EvaluationResult(BaseModel):
strengths: list[str] = Field(description="List of strengths in the response")
weaknesses: list[str] = Field(description="List of areas for improvement")
completeness_score: float = Field(description="Completeness score from 0.0 to 1.0")
conciseness_score: float = Field(description="Conciseness score from 0.0 to 1.0")
# Render the schema
schema = render_schema(EvaluationResult)
print(schema)
The rendered schema will look something like:
{
"strengths": [str],
"weaknesses": [str],
"completeness_score": float,
"conciseness_score": float
}
strengths: List of strengths in the response
weaknesses: List of areas for improvement
completeness_score: Completeness score from 0.0 to 1.0
conciseness_score: Conciseness score from 0.0 to 1.0
You can then insert this in your judge scoring prompt to get the right format from the model.
Generate and Validate with Retry
The generate_and_validate
function generates an answer with the model, validates the completion against a Pydantic model, and retries if validation fails.
from adaptive_harmony.core.structured_output import generate_and_validate
# Define your Pydantic model
class EvaluationResult(BaseModel):
strengths: List[str] = Field(description="List of strengths in the response")
weaknesses: List[str] = Field(description="List of areas for improvement")
completeness_score: float = Field(description="Completeness score from 0.0 to 1.0")
conciseness_score: float = Field(description="Conciseness score from 0.0 to 1.0")
try:
result = await generate_and_validate(
model=model,
prompt=prompt,
pydantic_model=SummarizationEvaluation,
max_retries=3
)
print(f"Strengths: {result.strengths}")
print(f"Weaknesses: {result.weaknesses}")
print(f"Conciseness: {result.conciseness_score}")
print(f"Completeness: {result.completeness_score}")
except Exception as e:
print(f"Failed to generate valid output after retries: {e}")
Using Scorers in Training
Scorers are commonly used in training to provide feedback for model improvement:
from adaptive_harmony.common import PPO
from adaptive_harmony.scoring.combined_scorer import BinaryJudgeScorer
@recipe_main
async def training_recipe(ctx: RecipeContext):
client = ctx.client
# Load models
policy_model = await client.model("my-model").spawn_train("policy", 4096)
value_model = await client.model("my-model").spawn_train("value", 4096)
judge_model = await client.model("openai://gpt-4o").spawn_inference("judge")
# Create scorers
quality_scorer = BinaryJudgeScorer(model=judge_model, criteria=quality_criteria)
faithfulness_scorer = FaithfulnessScorer(model=judge_model, language="en")
# Combine scorers
combined_scorer = CombinedScorer(scorers=[quality_scorer, faithfulness_scorer])
# Use in PPO training
await PPO(
training_dataset,
policy_model,
value_model,
scoring_fn=combined_scorer.score_without_metadata,
logger=logger,
max_num_ppo_steps=100,
).run()
Scoring Best Practices
- Clear Criteria: Define specific, measurable criteria for your scorers
- Elementary Criteria: Splitting rules to judge into several scorer, each one having one elementary criteria, shows better results
- Consistent Evaluation: Ensure your scorer provides consistent results and tets them with known good and bad examples
- Appropriate Models: Use appropriate judge models for your evaluation task
- Error Handling: Implement proper error handling in custom scorers