Overview

In custom recipes, you can load datasets using the AdaptiveDataset class, which provides a convenient way to access dataset files and work with them in your training or inference tasks. Dataset loading is done through the config feature.

Step 1: Define Dataset in Your Config

First, define your dataset in your recipe’s input configuration:

from typing import Annotated
from adaptive_harmony import InputConfig, AdaptiveDataset

class MyConfig(InputConfig):
    dataset: AdaptiveDataset

Step 2: Access the Dataset File

In your recipe function, you can access the dataset file directly:

def my_recipe(config: MyConfig):
    # Get the dataset file path
    dataset_file = config.dataset.file
    
    # Use the dataset file
    with open(dataset_file, "r") as file:
        # Read and process your dataset
        data = file.read()
        # Your processing logic here

Loading from Hugging Face

You can also load datasets directly from Hugging Face:

from adaptive_harmony.core.dataset import load_from_hf, convert_sample_dict

def load_hf_dataset():
    # Helper function to convert HF dataset to Adaptive StringThread
    convert_sample_fn = convert_sample_dict(
        turns_key="messages", 
        role_key="role", 
        content_key="content"
    )
    
    # Load the dataset
    dataset = load_from_hf(
        "HuggingFaceH4/ultrachat_200k", 
        "train_sft", 
        convert_sample_fn
    )
    
    return dataset