Load datasets and StringThread - Adaptive ML Documentation

You can import datasets previously uploaded to Adaptive directly within your recipes. Datasets stored on Adaptive can be loaded by specifying a parameter of type AdaptiveDataset in your recipe’s InputConfig class. When a recipe is launched as a run and an AdaptiveDataset is found in your input config, a file is written to disk containing all the samples in that dataset. You can then load it in your recipe as a list of StringThread objects by referring to dataset.file.

Load a Dataset

First, define your dataset in your recipe’s input config:

from adaptive_harmony import InputConfig, AdaptiveDataset

class MyConfig(InputConfig):
    dataset: AdaptiveDataset

To load a dataset from Adaptive, you can use the load_dataset method on the dataset itself:

def my_recipe(ctx: RecipeContext, config: MyConfig):
    dataset = config.dataset.load_dataset(ctx)

This utility can also load local files structured in the Adaptive-supported format, which you can leverage if you are testing a recipe locally.

StringThread object

The atomic element of any dataset in the adaptive_harmony codebase is a StringThread, which is a Rust backed object exposed in Python. A StringThread simply contains all the messages in a thread of conversation, along with any metadata associated with that thread (such as metric feedback, ground truth labels or any other metadata). StringThread exposes a few helpful methods:

from adaptive_harmony import StringThread

thread = StringThread(
    turns=[
        ("user", "Hello, who are you?"),
        ("assistant", "I am a large language model. How can I help you today?"),
    ]
)

thread_with_metadata = StringThread.with_metadata(
    turns=[
        ("user", "Hello, who are you?"),
        ("assistant", "I am a large language model. How can I help you today?"),
    ],
    metadata={"label": "polite"}
)

thread.turns() # Returns a list of tuples, each containing a role and a message
thread.last_content() # Returns the last message in the thread, useful to get a model's response after calling .generate() (the last turn is the assistant response in this case)

new_thread = StringThread([])
new_thread = new_thread.system("You are a hel")
thread = thread.user("My name is John.") # Returns a new thread with a new user message added to the end
thread = thread.assistant("Nice to meet you, John!") # Returns a new thread with a new assistant message added to the end

Loading from Hugging Face

You can also load datasets directly from Hugging Face in your recipe. adaptive_harmony exposes helper methods to convert arbitrary datasets into a list of StringThread objects by allowing you to specify the column in the original dataset that contains chat messages.

from adaptive_harmony.core.dataset import load_from_hf, convert_sample_dict

def load_hf_dataset():
    # Helper function to convert HF dataset to Adaptive StringThread
    convert_sample_fn = convert_sample_dict(
        turns_key="messages", 
        role_key="role", 
        content_key="content"
    )
    
    # Load the dataset
    dataset = load_from_hf(
        "HuggingFaceH4/ultrachat_200k", 
        "train_sft", 
        convert_sample_fn
    )
    
    return dataset

Documentation Index

​Load a Dataset

​StringThread object

​Loading from Hugging Face

Load a Dataset

StringThread object

Loading from Hugging Face