Write your own recipe for model evaluation, and log eval artifacts to the product
EvaluationArtifact
, which will be populated with several EvalSample
(evaluated model’s completions with attached grades). When you visit the Adaptive UI, and EvaluationArtifact
will show aggregate evaluation scores per model on all Graders selected/created in the evaluation.
config
with datasets, models and graders we refer to in the next steps.EvalSampleInteraction
by spawning each model, running inference to obtain completions, and collecting the results.
Use the spawn_inference
method to create an inference instance for each model. After generating completions, deallocate the model to optimize memory usage.
async_map_fallible
is an util function that concurrently processes all samples and skips completions with errors (when for example the sample is too big for the model’s maximum sequence length).
graders
list above, have its grades added to eval_samples
and later shown in the UI. If the grader_key
you hardcode for you recipe grader does not exist yet in the platform, it will be automatically created and reported in the evaluation results; no need to pre-create the recipe grader.EvaluationArtifact
and populate it with your EvalSamples
.
You do not need to push this to Adaptive with a specific method; the Adaptive Engine automatically detects that an EvaluationArtifact
was written in the course of the recipe’s execution, registering and displaying it in the UI once the run is complete.
Evaluation table, result of saving an EvaluationArtifact
EvaluationArtifact
class: