Load datasets and StringThread

You can import datasets previously uploaded to Adaptive directly within your recipes. Datasets stored on Adaptive can be loaded by specifying a parameter of type Dataset from adaptive_harmony.parameters in your recipe’s InputConfig class. You can then load it in your recipe as a list of StringThread objects by calling await dataset.load(ctx).

Load a dataset

First, define your dataset in your recipe’s input config:

from adaptive_harmony.runtime import InputConfig
from adaptive_harmony.parameters import Dataset

class MyConfig(InputConfig):
    dataset: Dataset

To load a dataset from Adaptive, you can use the load method on the dataset:

async def my_recipe(config: MyConfig, ctx: RecipeContext):
    dataset = await config.dataset.load(ctx)

This utility can also load local files structured in the Adaptive-supported format, which you can leverage if you are testing a recipe locally. Load a local dataset with Dataset(dataset_key="local-file", local_file_path="your_file.jsonl").

StringThread object

The atomic element of any dataset in the adaptive_harmony codebase is a StringThread, which is a Rust backed object exposed in Python. A StringThread simply contains all the messages in a thread of conversation, along with any metadata associated with that thread (such as metric feedback, ground truth labels or any other metadata). StringThread exposes a few helpful methods:

from adaptive_harmony import StringThread

thread = StringThread(
    turns=[
        ("user", "Hello, who are you?"),
        ("assistant", "I am a large language model. How can I help you today?"),
    ]
)

thread_with_metadata = StringThread.with_metadata(
    turns=[
        ("user", "Hello, who are you?"),
        ("assistant", "I am a large language model. How can I help you today?"),
    ],
    metadata={"label": "polite"}
)

# Return all turns of thread as NamedTuples
all_turns = thread.get_turns()
all_turns[0][0] == all_turns[0].role
# Returns a list of tuples, each containing a role and a message
# Does not include the completion, if there is one (last turn with `assistant` role)
messages = thread.messages() 
assert messages[-1][0] != "assistant"
# Returns the string content of the last message in the thread if its role is `assistant`
# Otherwise, returns None
completion = thread.completion()


new_thread = StringThread([])
new_thread = new_thread.system("You are a helpful bot.")
thread = thread.user("My name is John.") # Returns a new thread with a new user message added to the end
thread = thread.assistant("Nice to meet you, John!") # Returns a new thread with a new assistant message added to the end

Loading from Hugging Face

You can also load datasets directly from Hugging Face in your recipe. adaptive_harmony exposes helper methods to convert arbitrary datasets into a list of StringThread objects by allowing you to specify the column in the original dataset that contains chat messages.

from adaptive_harmony.core.dataset import load_from_hf, convert_sample_dict

def load_hf_dataset():
    # Helper function to convert HF dataset to Adaptive StringThread
    convert_sample_fn = convert_sample_dict(
        turns_key="messages", 
        role_key="role", 
        content_key="content"
    )
    
    # Load the dataset
    dataset = load_from_hf(
        "HuggingFaceH4/ultrachat_200k", 
        "train_sft", 
        convert_sample_fn
    )
    
    return dataset

Getting Started

Building Blocks

Training & Evaluation

Cookbook

Updates

Load a dataset

StringThread object

Loading from Hugging Face

Getting Started

Building Blocks

Training & Evaluation

Cookbook

Updates

Documentation Index

​Load a dataset

​StringThread object

​Loading from Hugging Face

Load a dataset

StringThread object

Loading from Hugging Face