Evaluate models

In Adaptive Engine, you can evaluate models to know which off-the-shelf model performs best on your task, or how much did your fine-tuned model improve vs. its base model or others after training.

Launch an evaluation

Evaluation is done by launching a run of an evaluation recipe. An Evaluation recipe is a recipe that produces an EvaluationArtefact. Adaptive provides a built-in recipe that supports most evaluation use-cases. For more tailored usage, for example to use custom graders, you can create your own recipe with your own completion grading strategy by following our guides Custom Graders and Write an Evaluation Recipe The SDK snippet below shows an evaluation run scoring 3 models with 2 graders.

Adaptive SDK

adaptive.jobs.run(
    recipe_key='eval',
    num_gpus=2,
    compute_pool='fraud-team-pool',
    args={
        'dataset': 'fraud-eval-prompts',
        'eval_model_tp': 1,  # tensor parallelism for the evaluated models
        'graders': [
            'fraud-ai-judge',
            'speech-clarity-ai-judge'
        ],
        'judge_model_tp': 2, # applies only if the AI judge is a local model
        'models_to_evaluate': [
            'fraud-agent-grpo-v7',
            'ext_gpt-5-nano-olivier',
            'gemma-3-4b'
        ]
    }
)

Visualize results

Once an evaluation run is finished, it will produce an evaluation artifact. This will contain:

An evaluation score table that summarises all model’s scores for the graders that were used during the eval
A detailed list of interactions from all graded samples in the evaluated dataset.

Drilling down to individual interactions

Evaluation results can also be analyzed in the interaction store, by filtering by evaluation artifact ID. Interactions can be then analyzed in multiple ways described below.

Grouping by model

This manipulation allows to stack-rank models by the aggregate metric of interest.

Grouping by prompt

This manipulation allows to find prompts on which models agree or disagree. You can then click on the fork icon to open a side-by-side view.

Side-by-side comparison

This view allows to compare the responses and metrics of two models that were submitted the same prompt.

Platform

Inference

Interactions

Datasets

Evaluation

Graders

Recipes & runs

Integrations

Deployment

Launch an evaluation

Visualize results