Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt

Use this file to discover all available pages before exploring further.

You can import datasets previously uploaded to Adaptive directly within your recipes. Datasets stored on Adaptive can be loaded by specifying a parameter of type AdaptiveDataset in your recipe’s InputConfig class. When a recipe is launched as a run and an AdaptiveDataset is found in your input config, a file is written to disk containing all the samples in that dataset. You can then load it in your recipe as a list of StringThread objects by referring to dataset.file.

Load a Dataset

First, define your dataset in your recipe’s input config:
from adaptive_harmony import InputConfig, AdaptiveDataset

class MyConfig(InputConfig):
    dataset: AdaptiveDataset
To load a dataset from Adaptive, you can use the load_dataset method on the dataset itself:
def my_recipe(ctx: RecipeContext, config: MyConfig):
    dataset = config.dataset.load_dataset(ctx)
This utility can also load local files structured in the Adaptive-supported format, which you can leverage if you are testing a recipe locally.

StringThread object

The atomic element of any dataset in the adaptive_harmony codebase is a StringThread, which is a Rust backed object exposed in Python. A StringThread simply contains all the messages in a thread of conversation, along with any metadata associated with that thread (such as metric feedback, ground truth labels or any other metadata). StringThread exposes a few helpful methods:
from adaptive_harmony import StringThread

thread = StringThread(
    turns=[
        ("user", "Hello, who are you?"),
        ("assistant", "I am a large language model. How can I help you today?"),
    ]
)

thread_with_metadata = StringThread.with_metadata(
    turns=[
        ("user", "Hello, who are you?"),
        ("assistant", "I am a large language model. How can I help you today?"),
    ],
    metadata={"label": "polite"}
)

thread.turns() # Returns a list of tuples, each containing a role and a message
thread.last_content() # Returns the last message in the thread, useful to get a model's response after calling .generate() (the last turn is the assistant response in this case)

new_thread = StringThread([])
new_thread = new_thread.system("You are a hel")
thread = thread.user("My name is John.") # Returns a new thread with a new user message added to the end
thread = thread.assistant("Nice to meet you, John!") # Returns a new thread with a new assistant message added to the end

Loading from Hugging Face

You can also load datasets directly from Hugging Face in your recipe. adaptive_harmony exposes helper methods to convert arbitrary datasets into a list of StringThread objects by allowing you to specify the column in the original dataset that contains chat messages.
from adaptive_harmony.core.dataset import load_from_hf, convert_sample_dict

def load_hf_dataset():
    # Helper function to convert HF dataset to Adaptive StringThread
    convert_sample_fn = convert_sample_dict(
        turns_key="messages", 
        role_key="role", 
        content_key="content"
    )
    
    # Load the dataset
    dataset = load_from_hf(
        "HuggingFaceH4/ultrachat_200k", 
        "train_sft", 
        convert_sample_fn
    )
    
    return dataset