Prebuilt Judges
Harmony provides several prebuilt scorers that you can use immediately in your recipes.BinaryJudgeScorer
TheBinaryJudgeScorer
evaluates outputs based on specific criteria and returns a binary 1/0 score.
All the response formatting and parsing logic is already implemented in the class, you can only write your criteria in the criteria field.
Use a BinaryJudgeScorer
FaithfulnessScorer
TheFaithfulnessScorer
evaluates how faithful a response is to the input context.
It scores each sentence in the last assistant turn as fully supported by the context or not (1 or 0). The final score is the average of each sentence.
The context is the rest of the thread, excluding the system prompt.
It requires an input language code to help to split the sentences.
Use a FaithfulnessScorer
CombinedScorer
TheCombinedScorer
allows you to combine multiple scorers into a single scoring function:
Use the CombinedScorer
Creating Your Own Scorer
You can create custom scorers by implementing theScorer
class.. Here’s how to build your own scorer:
The Scorer
abstract class only have the async method score
to implement that should return a ScoreWithMetadata object.
Both Scorer and ScoreWithMetadata are to be imported from harmony.scoring.base_scorer
Define a custom scorer from the Scorer class
from_function
classmethod. In this case, the scoring function can return a float.
Define a custom scorer from a custom function
Formatting and Parsing Tooling
Current models in the Adaptive Engine do not support native structured output. To help you define model formatted output and parse them to get more robust judges, we have defined tooling functions.Render Schema from Pydantic Model
Themodel.render_schema
function takes a Pydantic model and generates a schema description that can be used to instruct the model on the expected output format.
Generate and Validate with Retry
Thegenerate_and_validate
function generates an answer with the model, validates the completion against a Pydantic model, and retries if validation fails.
Using Scorers in Training
Scorers are commonly used in training to provide feedback for model improvement:Scoring Best Practices
- Clear Criteria: Define specific, measurable criteria for your scorers
- Elementary Criteria: Splitting rules to judge into several scorer, each one having one elementary criteria, shows better results
- Consistent Evaluation: Ensure your scorer provides consistent results and tets them with known good and bad examples
- Appropriate Models: Use appropriate judge models for your evaluation task
- Error Handling: Implement proper error handling in custom scorers