Output artifacts

Artifacts are outputs that recipes create and register with the Adaptive Engine. All you have to do is create and add content to one in your recipe code, and it will appear in the UI after the run completes. Harmony supports two types of artifacts: evaluation results and new datasets.

Artifacts are currently only registered in the platform if the recipe that creates them is executed in the platform. Artifacts created in interactive sessions do not get logged to the platform.

Evaluation Artifact

EvaluationArtifact stores evaluation results from model assessments. Each artifact contains multiple EvalSample objects (model completions with attached grades). When viewed in the Adaptive UI, the artifact displays aggregate evaluation scores per model across all graders that were used in the evaluation, in an interactive table format. Import from adaptive_harmony: EvaluationArtifact, EvalSample, EvalSampleInteraction, Grade. Create an artifact with EvaluationArtifact(name, ctx) where name is the display name in the UI and ctx is the recipe context. Use add_samples(samples) to append a list of EvalSample objects to the artifact. The method returns self for chaining and raises ValueError if the samples list is empty.

Example

from adaptive_harmony import EvaluationArtifact, EvalSample, EvalSampleInteraction, Grade, StringThread
from adaptive_harmony.runtime import RecipeContext

# Create the artifact
eval_artifact = EvaluationArtifact(name="Model Evaluation", ctx=ctx)

# Build evaluation samples
samples = [
    EvalSample(
        interaction=EvalSampleInteraction(
            source="my-model-key",
            thread=StringThread(turns=[
                ("user", "What is the capital of France?"),
                ("assistant", "The capital of France is Paris.")
            ])
        ),
        grades=[
            Grade(
                value=0.95,
                grader_key="accuracy-grader",
                reason="Correct and concise answer"
            )
        ],
        dataset_key="geography-qa"
    )
]

# Add samples to the artifact
eval_artifact.add_samples(samples)

The artifact is automatically registered with the job and appears as an evaluation table in the UI when the recipe completes.

The evaluation artifact will fail to render in the UI if the source model key in EvalSampleInteraction does not exist in the platform. When using adaptive_harmony.parameters.Model instances, get the key via model.model_key to ensure it matches the spawned model. Same goes for dataset_key, which you can get from dataset.dataset_key if you are using adaptive_harmony.parameters.Dataset.

The grader_key in each Grade can be a new name for a custom grader defined in the recipe. If the grader key doesn’t exist in the platform, Adaptive automatically creates a new custom grader entry to reflect it in the evaluation results.

Dataset Artifact

DatasetArtifact saves datasets generated during recipe execution. It supports different dataset kinds and saves samples in JSONL format compatible with the platform’s dataset format. Import from adaptive_harmony: DatasetArtifact, StringThread. Import from adaptive_harmony.runtime: AdaptiveDatasetKind. Create an artifact with DatasetArtifact(name, ctx, kind=AdaptiveDatasetKind.Mixed) where kind specifies the dataset type:

Prompt — Prompt-only samples (no completions)
Completion — Prompt-completion pairs
Metric — Samples with evaluation metrics
Preference — Preference samples (good vs bad completions)
Mixed — Any combination of the above (default)

Use add_samples_from_thread(threads) to convert StringThread objects to dataset samples and append them. This method only works with Prompt, Completion, and Mixed kinds. For Metric and Preference datasets, use add_samples(samples) with structured sample objects instead. Both methods return self for chaining. The sample_count property returns the number of samples added.

Example

from adaptive_harmony import DatasetArtifact, StringThread
from adaptive_harmony.runtime import AdaptiveDatasetKind, RecipeContext

# Create prompt dataset artifact
prompt_artifact = DatasetArtifact(
    name="Generated Prompts",
    ctx=ctx,
    kind=AdaptiveDatasetKind.Prompt
)

# Create completion dataset artifact
completion_artifact = DatasetArtifact(
    name="Generated Completions",
    ctx=ctx,
    kind=AdaptiveDatasetKind.Completion
)

# Generate prompts (just conversation turns, no completion)
prompts = [
    StringThread(turns=[
        ("user", "Explain machine learning"),
        ("assistant", "What aspect would you like me to focus on?"),
        ("user", "Give me a comprehensive overview")
    ]),
    # ... more prompts
]

# Generate completions (conversation turns + final assistant response)
completions = [
    StringThread(turns=[
        ("user", "Explain machine learning"),
        ("assistant", "What aspect would you like me to focus on?"),
        ("user", "Give me a comprehensive overview")
    ]).assistant("Here's a comprehensive overview of machine learning...")
    # ... more completions
]

# Add samples from threads
prompt_artifact.add_samples_from_thread(prompts)
completion_artifact.add_samples_from_thread(completions)

print(f"Created {prompt_artifact.sample_count} prompts")
print(f"Created {completion_artifact.sample_count} completions")

Metric datasets store samples with evaluation scores from multiple graders. Each sample requires a metrics dict mapping grader keys to float values.

from adaptive_harmony import DatasetArtifact
from adaptive_harmony.runtime import AdaptiveDatasetKind
from harmony_client.runtime.dto.DatasetSampleFormats import DatasetMetricSample, SampleMetadata
import uuid
from datetime import datetime

# Create metric dataset artifact
metric_artifact = DatasetArtifact(
    name="Graded Samples",
    ctx=ctx,
    kind=AdaptiveDatasetKind.Metric
)

# Build metric samples
metric_samples = [
    DatasetMetricSample(
        prompt=[("user", "What is the capital of France?")],
        completion=("assistant", "The capital of France is Paris."),
        metrics={
            "accuracy": 1.0,
            "helpfulness": 0.95,
            "conciseness": 0.9
        },
        metadata=SampleMetadata(
            id=uuid.uuid4(),
            created_at=int(datetime.now().timestamp()),
            external_data={"source": "evaluation-run-1"}
        )
    ),
    # ... more samples
]

# Add samples to the artifact
metric_artifact.add_samples(metric_samples)

Preference datasets store samples with both good and bad completions for the same prompt.

from adaptive_harmony import DatasetArtifact
from adaptive_harmony.runtime import AdaptiveDatasetKind
from harmony_client.runtime.dto.DatasetSampleFormats import DatasetPreferenceSample, SampleMetadata
import uuid
from datetime import datetime

# Create preference dataset artifact
preference_artifact = DatasetArtifact(
    name="Preference Pairs",
    ctx=ctx,
    kind=AdaptiveDatasetKind.Preference
)

# Build preference samples
preference_samples = [
    DatasetPreferenceSample(
        prompt=[("user", "Explain quantum computing")],
        good_completion=("assistant", "Quantum computing uses quantum bits (qubits) that can exist in superposition..."),
        bad_completion=("assistant", "Quantum computing is just really fast computers."),
        metric="preference",  # optional, defaults to "preference"
        metadata=SampleMetadata(
            id=uuid.uuid4(),
            created_at=int(datetime.now().timestamp()),
            external_data={"annotator": "expert-1"}
        )
    ),
    # ... more samples
]

# Add samples to the artifact
preference_artifact.add_samples(preference_samples)

All artifacts are automatically registered with the run and appear in the Adaptive UI when the run completes.

Getting Started

Building Blocks

Training & Evaluation

Cookbook

Evaluation Artifact

Example

Dataset Artifact

Example

Getting Started

Building Blocks

Training & Evaluation

Cookbook

​Evaluation Artifact

​Example

​Dataset Artifact

​Example

Evaluation Artifact

Example

Dataset Artifact

Example