A/B Tests

A/B Tests allow you to evaluate a set of models on a given metric against a percentage of incoming inference requests. They can be used as the final quality assurance step to decide which model performs best based on feedback from your live production application, or select users/expert annotators.

When you make an inference request for a use case with an active A/B test without addressing a specific model, it is routed to the test with a probability of traffic split %. A/B test requests are then distributed equally among tested models. For example, if your configured traffic split is 10% and you are A/B testing 2 models, 5% of the full use case traffic will be routed to each model.

AB tests can run on metric or preference feedback. If you configure the test to run on metric feedback, preference feedback you log for completions during the course of the test will not count towards its results and vice versa, even if the request was routed to one of the AB tested models.

You can create an A/B Test as follows:

Create an Adaptive client first

response = adaptive.ab_tests.create(
    ab_test_key="my_ab_test",
    feedback_key="feedback_key",
    models=["model_key_1", "model_key_2", "model_key_3"],
    traffic_split=0.5,
    feedback_type="metric"
)

After creating an A/B test, you can start making inference requests and providing feedback on the resulting completions. When enough feedback has been logged to reach a statistically significant performance difference between the evaluated models, the A/B test ends automatically. If you ran the test on metric feedback, the test results will show the average value of feedbacks logged for each model. If you ran the test on preference feedback, the test results will show each model’s win rate against the others.

If you want to bypass the configured traffic split and guarantee your request counts towards the A/B test, you can specify the A/B test key in the ab_campaign parameter when using the Chat API.

See the SDK Reference for all A/B test-related methods.

.

Platform

Inference and Feedback

Datasets

Evaluation

Fine-tuning

Custom Recipes

Integrations

Deployment