- Training: Graders provide reward signals during model training, enabling reinforcement learning techniques
- Evaluation: Graders score model outputs to assess performance, quality, and alignment with desired behaviors
Types of Graders
Numerical scores can be obtained from different means. To support a wide range of feedback sources and downstream applications, Adaptive Engine supports four distinct types of graders:1. AI Judge
AI judge graders use an LLM to grade completions based on a predefined criterion, effectively turning evaluation into a reasoning-backed binary classification task (pass or fail according to the criterion). Using LLMs as judges enables cheap and effective model evaluation at scale, with little to no dependency on human automation and labor cost. Use cases:- Go-to grader type for most situations, where human annotation would be expensive or time-consuming
- Evaluations that benefit from consistent, scalable judgment
2. Pre-built Graders
Pre-built graders are authored by Adaptive ML and designed for common grading tasks. These provide out-of-the-box solutions for standard quality metrics. Current Prebuilt graders are use LLMs as judges. Available pre-built graders:- Faithfulness: Measures a completion’s adherence to the context or documents provided in the prompt.
- Context Relevancy: Measures the overall relevance of the information presented in supporting context/documents with regard to the prompt.
- Answer Relevancy: Measures the overall relevance/effectiveness of a completion when it comes to answering the user query.
- Quick setup for standard tasks with optimised scoring from the Adaptive team
3. Reward Server
Reward servers allow you to integrate your own grading systems with Adaptive Engine. You can send interactions to your external system and return reward scores via API. Use cases:- Complex setup and grading logic
- Any grading requiring an external system like a database, simulated environment or sandbox
4. Custom Grader
Custom graders are defined in Python code directly within your recipes, and integrate with the Adaptive platform to log and display their scores generated from your custom logic. Use cases:- Custom code that does not require local external resources, such as API calls
- Simple Python function-based grading, such as checking part of a completion against a ground truth attached as interaction metadata (useful for classification for example)