Overview

Graders are scoring systems that evaluate LLM completions and provide quantitative feedback. They serve as the foundation for both training and evaluation workflows.

Training: Graders provide reward signals during model training, enabling reinforcement learning techniques
Evaluation: Graders score model outputs to assess performance, quality, and alignment with desired behaviors

Types of Graders

Numerical scores can be obtained from different means. To support a wide range of feedback sources and downstream applications, Adaptive Engine supports four distinct types of graders:

1. AI Judge

AI judge graders use an LLM to grade completions based on a predefined criterion, effectively turning evaluation into a reasoning-backed binary classification task (pass or fail according to the criterion). Using LLMs as judges enables cheap and effective model evaluation at scale, with little to no dependency on human automation and labor cost. Use cases:

Go-to grader type for most situations, where human annotation would be expensive or time-consuming
Evaluations that benefit from consistent, scalable judgment

2. Pre-built Graders

Pre-built graders are authored by Adaptive ML and designed for common grading tasks. These provide out-of-the-box solutions for standard quality metrics. Current Prebuilt graders are use LLMs as judges. Available pre-built graders:

Faithfulness: Measures a completion’s adherence to the context or documents provided in the prompt.
Context Relevancy: Measures the overall relevance of the information presented in supporting context/documents with regard to the prompt.
Answer Relevancy: Measures the overall relevance/effectiveness of a completion when it comes to answering the user query.

Use cases:

Quick setup for standard tasks with optimised scoring from the Adaptive team

3. Reward Server

Reward servers allow you to integrate your own grading systems with Adaptive Engine. You can send interactions to your external system and return reward scores via API. Use cases:

Complex setup and grading logic
Any grading requiring an external system like a database, simulated environment or sandbox

You should bias towards using custom graders in all situations where custom logic is required for grading, and only use reward servers in case an external system/resource is required in the process of grading a completion.

4. Custom Grader

Custom graders are defined in Python code directly within your recipes, and integrate with the Adaptive platform to log and display their scores generated from your custom logic. Use cases:

Custom code that does not require local external resources, such as API calls
Simple Python function-based grading, such as checking part of a completion against a ground truth attached as interaction metadata (useful for classification for example)

Platform

Inference

Evaluation

Graders

Recipes & Runs

Datasets

Interactions

Integrations

Deployment

Types of Graders

1. AI Judge

2. Pre-built Graders

3. Reward Server

4. Custom Grader

Platform

Inference

Evaluation

Graders

Recipes & Runs

Datasets

Interactions

Integrations

Deployment

​Types of Graders

​1. AI Judge

​2. Pre-built Graders

​3. Reward Server

​4. Custom Grader

Types of Graders

1. AI Judge

2. Pre-built Graders

3. Reward Server

4. Custom Grader