Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt

Use this file to discover all available pages before exploring further.

v1.1.0
2026-04-22
Depends on harmony-client~=1.1.

Breaking changes

Training length must be specified explicitly PPO, GRPO, GSPO, and ENVPPO now require exactly one of epochs or the corresponding max_num_*_steps to be set. Recipes that previously relied on a default training length raise AssertionError.
# Before
await GRPO(dataset=dataset, model=model, grader=grader).run()

# After — pick one
await GRPO(dataset=dataset, model=model, grader=grader, epochs=1.0).run()
await GRPO(dataset=dataset, model=model, grader=grader, max_num_grpo_steps=100).run()
epochs is a float counting dataset passes; max_num_*_steps is an integer counting optimizer steps.Environments must return a list of grades Returning a single Grade instead of list[Grade] from react_to() now raises ValueError. Wrap single grades in a list.
# Before
return next_thread, Grade(value=1.0, grader_key="env")

# After
return next_thread, [Grade(value=1.0, grader_key="env")]
Removed batch_process_fallible Use async_map_batch instead. async_map_batch works on an iterator, processes a fixed-size batch concurrently, and refills the batch from the iterator when samples fail.WandB runs now keyed by project ID WandbLogger initialised by get_prod_logger() now passes the project ID (a stable identifier) as the WandB project, instead of the human-readable project name. Existing WandB dashboards filtering on the old project name lose continuity. Update filters to use the project ID, or set project_name explicitly when instantiating WandbLogger directly.

What’s new

Structured output via response_model InferenceModel.generate() and InferenceModel.generate_tokens() now accept a Pydantic model class. They return a (result, parsed) tuple where parsed is an instance of your model.
from pydantic import BaseModel

class Verdict(BaseModel):
    reasoning: str
    pass_: bool

result, parsed = await model.generate(thread, Verdict)
print(parsed.pass_)
This wraps constrained_decoding and Pydantic parsing, so JSON output is guaranteed to validate against the schema. JSON Schemas are emitted with additionalProperties: false for compatibility with OpenAI strict mode.Function graders loadable from source code BaseGrader.from_source_code(grader_key, source_code) compiles a string of Python that defines an async grade function and returns a grader. BaseGrader.from_config() now handles Function grader entries from the platform, calling from_source_code under the hood. Function graders only run inside sandboxed recipe jobs submitted via the platform — outside the sandbox the source code is not available and from_config raises a descriptive error.Train by epochs in RL recipes PPO, GRPO, GSPO, and ENVPPO accept epochs: float | None to specify training length in dataset passes. Pairs with the breaking change above (exactly one of epochs or max_num_*_steps must be set).
await GRPO(
    dataset=prompts,
    model=policy_model,
    grader=grader,
    epochs=2.0,  # two passes over the dataset
).run()
DPO with arbitrary loss functions A new train_dpo_tangents joins train_sft_tangents and train_grpo_tangents for tangent-based DPO training.
from adaptive_harmony.core.losses import train_dpo_tangents

metrics = await train_dpo_tangents(
    model,
    pos_thread,
    neg_thread,
    ref_logprobs_pos,
    ref_logprobs_neg,
    beta=0.1,
)
See the Bring your own loss functions section for the broader pattern.Logger step persisted across resumes All loggers (WandbLogger, MLFlowLogger, TBMetricsLogger, AdaptiveLogger, StdoutLogger, CompositeLogger) now expose get_state() / set_state(). The trainer saves logger state into the checkpoint and restores it on resume, so step counters no longer reset to 0 mid-run.Multi-stage checkpointing Recipes that chain multiple trainers (for example, SFT followed by GSPO) now checkpoint each stage independently. When a job is resumed, every stage picks up where it left off — earlier stages skip, the active stage restarts from its latest checkpoint, later stages run fresh. No code changes required; the runtime tracks the stage counter for you.Concurrency control on async_map async_map and async_map_fallible accept max_concurrent_samples: int | None = None to bound parallelism with a semaphore, and stage_notifier: StageNotifier | None = None to report per-sample progress to the platform.
results = await async_map(
    grader.score_float_value,
    samples,
    max_concurrent_samples=8,
    stage_notifier=ctx.job.stage_notifier("Grading"),
)
async_map_fallible adds a return_indices=True overload that returns list[tuple[int, T]], so you can map results back to the original input indices.Rich progress bars for model spawning spawn_train() and spawn_inference() now render progress with rich.progress.Progress instead of tqdm. Visible behaviour change in interactive sessions.

Bug fixes

  • OpenAI strict-mode schemas now validate. additionalProperties: false is added recursively to all object schemas, including those reached via $defs, definitions, anyOf/oneOf/allOf, and items.
  • Repeated samples in GRPO/GSPO deepcopy metadata. Sample metadata is no longer shared across grouped completions, preventing accidental mutation across the group.
  • Thread weighting in dataset loading. SFT and RLHF recipes had a bug where some assistant turns received incorrect weights. Loading and recipe weighting now agree.
  • MLflow experiment tag for project lookup. The MLflow run tag used for cross-system project lookup was wrong; it now matches what the platform expects.
  • Improved user-error messages in trainer init. Hyperparameter mismatches and missing inputs raise more specific messages instead of opaque assertion errors.
v1.0.0
2026-03-02
Depends on harmony-client~=1.0.

Breaking changes

“Use Cases” renamed to “Projects” Anywhere use_case was used is now converted to project, namely in get_client and RecipeConfig.Environment API rewrite Environment.react_to() now returns a StringThread and a list of Grades, rather than a single next turn and a single reward float.
# Before (v0.1.x)
async def react_to(self, thread: StringThread) -> list[tuple[str, str]] | TrajectoryScore:
    ...

# After (v1.0.0)
async def react_to(self, thread: StringThread) -> tuple[StringThread | None, list[Grade] | None]:
    ...
  • Return (next_thread, grades) on each turn. Return (None, grades) to terminate the trajectory.
  • When next_thread is not a prefix of the current thread, the environment starts a new conversation segment within the same trajectory. This enables multi-turn training with arbitrary state transitions.
Environment.bootstrap_prompt() is renamed to Environment.initialize_state().IgnoreScoreException renamed to IgnoreGradeException All grader code that catches or raises this exception must update the class name.
# Before
from adaptive_harmony.graders import IgnoreScoreException

# After
from adaptive_harmony.graders import IgnoreGradeException
The exception now carries a detailed default message explaining when ignoring a grade is appropriate. The message parameter type changed from str to str | None (default None).CheckpointCallback deprecated CheckpointCallback is deprecated. Use trainer-level checkpointing (checkpoint_frequency parameter) instead. Using both may produce inconsistent checkpoint states.Rich progress display off by default The training progress display is now off by default. Set environment variable ENABLE_RICH_PROGRESS=1 to enable it. The previous DISABLE_RICH_PROGRESS environment variable is no longer recognized.Experiment tag key renamed The metric logger tag adaptive.use_case_id is renamed to adaptive.project_id. Update any code or dashboards that filter on this tag.

What’s new

Arbitrary state transition support in environment-based trainers Environment-based trainers (PPO, GRPO, GSPO) can now handle environments that produce multiple conversation threads per trajectory. Future rewards are “propagated” back, with each turn’s advantage derived from the (normalized) future cumulative reward as described in section 4.1.3 of DeepSeekMath.ENVPPO: multi-turn PPO A new trainer at adaptive_harmony.common.env_ppo.ENVPPO supports PPO training within multi-turn environments. It uses a separate value model for advantage estimation via GAE (Generalized Advantage Estimation), with independent learning rates and schedulers for the policy and value models.
from adaptive_harmony.common.env_ppo import ENVPPO

await ENVPPO(
    dataset=prompts,
    model=policy_model,
    value_model=value_model,
    environment_factory=env_factory,
    lr_policy=0.75e-6,
    lr_value=1e-6,
    max_num_ppo_steps=100,
    gae_lambda=0.95,
    gae_gamma=1.0,
    value_only_fraction=0.25,
).run()
Key parameters:
  • value_only_fraction: Fraction of training steps where only the value model updates (default: 0.25). The policy model stays frozen during this warmup.
  • lr_policy / lr_value: Independent learning rates for the policy and value models.
  • gae_lambda / gae_gamma: GAE parameters for advantage estimation (defaults: 0.95 and 1.0).
Tangent-based loss computation SFT, GRPO, ENVGRPO, and ENVGSPO recipes now support use_tangents=True, which computes gradients through an explicit loss function (adaptive_harmony.core.losses) instead of relying on built-in training methods. This exposes per-step training metrics.
from adaptive_harmony.common.sft import SFT

await SFT(
    dataset=dataset,
    model=model,
    use_tangents=True,
    logger=logger,
).run()
When enabled, the logger receives additional policy/* metrics:
  • SFT: policy/loss, policy/num_tokens
  • GRPO/ENVGRPO/ENVGSPO: policy/pg_clipfrac, policy/inner_kl_div, policy/Dkl
Dataset seeding All training recipes now accept a data_seed parameter (default: 42) that controls dataset shuffling. This ensures reproducible training runs across restarts.
await GRPO(
    dataset=dataset,
    model=model,
    grader=grader,
    data_seed=42,  # Reproducible shuffle
).run()
Optional system prompt in TemplatedPromptJudge The system_template parameter in TemplatedPromptJudge now accepts None. When None, the judge thread contains only a user turn.Dataset weight validation for SFT SFT training now checks dataset turn weights before training starts and logs warnings for:
  • User turns with non-zero weight (you may be unintentionally training on user messages)
  • Assistant turns with zero weight (these turns do not contribute to training)
System prompt override utility A new override_system_prompt() function replaces or adds a system prompt on every thread in a dataset:
from adaptive_harmony.core.utils import override_system_prompt

dataset = override_system_prompt(dataset, "You are a helpful assistant.")
Image utilities New functions in adaptive_harmony.core.image_utils:
  • pil_to_base64(): converts a PIL image to a base64 string with optional resizing and grayscale conversion
  • image_to_base64(): converts an image file path to a base64 string
Weight-aware display StringThread and TokenizedThread repr output now highlights trained turns (weight > 0) in green and untrained turns in blue. The HTML export (string_thread_to_html) applies matching color coding and green borders on images in trained turns.Improved error messages
  • async_map_batch now raises a RuntimeError with a clear message when the batch failure rate exceeds the threshold, including the allowed failure percentage and the last error thrown.
  • async_map_fallible now logs exceptions before suppressing them.

Bug fixes

  • External dataset feedback values are no longer silently dropped. When parsing datasets with a feedbacks field and a configured feedback_key, the reward value is now correctly extracted into the thread metadata.
  • Progress percentage no longer exceeds 100%. When async_map_batch retries failed samples, the processed count could exceed the total. Progress reporting now clamps to the total and logs a warning.
  • Failed trajectories in ENVGRPO/ENVGSPO no longer discard the entire group. Trajectory generation within a group switched from async_map to async_map_batch. If a trajectory fails, a new one is attempted instead of losing all completions for that prompt. Group normalization runs on the successful trajectories.