Documentation Index
Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
Use this file to discover all available pages before exploring further.
Depends on Removed This wraps DPO with arbitrary loss functions
A new See the Bring your own loss functions section for the broader pattern.Logger step persisted across resumes
All loggers (
harmony-client~=1.1.Breaking changes
Training length must be specified explicitly PPO, GRPO, GSPO, and ENVPPO now require exactly one ofepochs or the corresponding max_num_*_steps to be set. Recipes that previously relied on a default training length raise AssertionError.epochs is a float counting dataset passes; max_num_*_steps is an integer counting optimizer steps.Environments must return a list of grades
Returning a single Grade instead of list[Grade] from react_to() now raises ValueError. Wrap single grades in a list.batch_process_fallible
Use async_map_batch instead. async_map_batch works on an iterator, processes a fixed-size batch concurrently, and refills the batch from the iterator when samples fail.WandB runs now keyed by project ID
WandbLogger initialised by get_prod_logger() now passes the project ID (a stable identifier) as the WandB project, instead of the human-readable project name. Existing WandB dashboards filtering on the old project name lose continuity. Update filters to use the project ID, or set project_name explicitly when instantiating WandbLogger directly.What’s new
Structured output viaresponse_model
InferenceModel.generate() and InferenceModel.generate_tokens() now accept a Pydantic model class. They return a (result, parsed) tuple where parsed is an instance of your model.constrained_decoding and Pydantic parsing, so JSON output is guaranteed to validate against the schema. JSON Schemas are emitted with additionalProperties: false for compatibility with OpenAI strict mode.Function graders loadable from source code
BaseGrader.from_source_code(grader_key, source_code) compiles a string of Python that defines an async grade function and returns a grader. BaseGrader.from_config() now handles Function grader entries from the platform, calling from_source_code under the hood. Function graders only run inside sandboxed recipe jobs submitted via the platform — outside the sandbox the source code is not available and from_config raises a descriptive error.Train by epochs in RL recipes
PPO, GRPO, GSPO, and ENVPPO accept epochs: float | None to specify training length in dataset passes. Pairs with the breaking change above (exactly one of epochs or max_num_*_steps must be set).train_dpo_tangents joins train_sft_tangents and train_grpo_tangents for tangent-based DPO training.WandbLogger, MLFlowLogger, TBMetricsLogger, AdaptiveLogger, StdoutLogger, CompositeLogger) now expose get_state() / set_state(). The trainer saves logger state into the checkpoint and restores it on resume, so step counters no longer reset to 0 mid-run.Multi-stage checkpointing
Recipes that chain multiple trainers (for example, SFT followed by GSPO) now checkpoint each stage independently. When a job is resumed, every stage picks up where it left off — earlier stages skip, the active stage restarts from its latest checkpoint, later stages run fresh. No code changes required; the runtime tracks the stage counter for you.Concurrency control on async_map
async_map and async_map_fallible accept max_concurrent_samples: int | None = None to bound parallelism with a semaphore, and stage_notifier: StageNotifier | None = None to report per-sample progress to the platform.async_map_fallible adds a return_indices=True overload that returns list[tuple[int, T]], so you can map results back to the original input indices.Rich progress bars for model spawning
spawn_train() and spawn_inference() now render progress with rich.progress.Progress instead of tqdm. Visible behaviour change in interactive sessions.Bug fixes
- OpenAI strict-mode schemas now validate.
additionalProperties: falseis added recursively to all object schemas, including those reached via$defs,definitions,anyOf/oneOf/allOf, anditems. - Repeated samples in GRPO/GSPO deepcopy metadata. Sample metadata is no longer shared across grouped completions, preventing accidental mutation across the group.
- Thread weighting in dataset loading. SFT and RLHF recipes had a bug where some assistant turns received incorrect weights. Loading and recipe weighting now agree.
- MLflow experiment tag for project lookup. The MLflow run tag used for cross-system project lookup was wrong; it now matches what the platform expects.
- Improved user-error messages in trainer init. Hyperparameter mismatches and missing inputs raise more specific messages instead of opaque assertion errors.
Depends on The exception now carries a detailed default message explaining when ignoring a grade is appropriate. The Key parameters:When enabled, the logger receives additional Optional system prompt in TemplatedPromptJudge
The Image utilities
New functions in
harmony-client~=1.0.Breaking changes
“Use Cases” renamed to “Projects” Anywhereuse_case was used is now converted to project, namely in get_client and RecipeConfig.Environment API rewrite
Environment.react_to() now returns a StringThread and a list of Grades, rather than a single next turn and a single reward float.- Return
(next_thread, grades)on each turn. Return(None, grades)to terminate the trajectory. - When
next_threadis not a prefix of the current thread, the environment starts a new conversation segment within the same trajectory. This enables multi-turn training with arbitrary state transitions.
Environment.bootstrap_prompt() is renamed to Environment.initialize_state().IgnoreScoreException renamed to IgnoreGradeException
All grader code that catches or raises this exception must update the class name.message parameter type changed from str to str | None (default None).CheckpointCallback deprecated
CheckpointCallback is deprecated. Use trainer-level checkpointing (checkpoint_frequency parameter) instead. Using both may produce inconsistent checkpoint states.Rich progress display off by default
The training progress display is now off by default. Set environment variable ENABLE_RICH_PROGRESS=1 to enable it. The previous DISABLE_RICH_PROGRESS environment variable is no longer recognized.Experiment tag key renamed
The metric logger tag adaptive.use_case_id is renamed to adaptive.project_id. Update any code or dashboards that filter on this tag.What’s new
Arbitrary state transition support in environment-based trainers Environment-based trainers (PPO, GRPO, GSPO) can now handle environments that produce multiple conversation threads per trajectory. Future rewards are “propagated” back, with each turn’s advantage derived from the (normalized) future cumulative reward as described in section 4.1.3 of DeepSeekMath.ENVPPO: multi-turn PPO A new trainer atadaptive_harmony.common.env_ppo.ENVPPO supports PPO training within multi-turn environments. It uses a separate value model for advantage estimation via GAE (Generalized Advantage Estimation), with independent learning rates and schedulers for the policy and value models.value_only_fraction: Fraction of training steps where only the value model updates (default:0.25). The policy model stays frozen during this warmup.lr_policy/lr_value: Independent learning rates for the policy and value models.gae_lambda/gae_gamma: GAE parameters for advantage estimation (defaults:0.95and1.0).
use_tangents=True, which computes gradients through an explicit loss function (adaptive_harmony.core.losses) instead of relying on built-in training methods. This exposes per-step training metrics.policy/* metrics:- SFT:
policy/loss,policy/num_tokens - GRPO/ENVGRPO/ENVGSPO:
policy/pg_clipfrac,policy/inner_kl_div,policy/Dkl
data_seed parameter (default: 42) that controls dataset shuffling. This ensures reproducible training runs across restarts.system_template parameter in TemplatedPromptJudge now accepts None. When None, the judge thread contains only a user turn.Dataset weight validation for SFT
SFT training now checks dataset turn weights before training starts and logs warnings for:- User turns with non-zero weight (you may be unintentionally training on user messages)
- Assistant turns with zero weight (these turns do not contribute to training)
override_system_prompt() function replaces or adds a system prompt on every thread in a dataset:adaptive_harmony.core.image_utils:pil_to_base64(): converts a PIL image to a base64 string with optional resizing and grayscale conversionimage_to_base64(): converts an image file path to a base64 string
StringThread and TokenizedThread repr output now highlights trained turns (weight > 0) in green and untrained turns in blue. The HTML export (string_thread_to_html) applies matching color coding and green borders on images in trained turns.Improved error messagesasync_map_batchnow raises aRuntimeErrorwith a clear message when the batch failure rate exceeds the threshold, including the allowed failure percentage and the last error thrown.async_map_falliblenow logs exceptions before suppressing them.
Bug fixes
- External dataset feedback values are no longer silently dropped. When parsing datasets with a
feedbacksfield and a configuredfeedback_key, the reward value is now correctly extracted into the thread metadata. - Progress percentage no longer exceeds 100%. When
async_map_batchretries failed samples, the processed count could exceed the total. Progress reporting now clamps to the total and logs a warning. - Failed trajectories in ENVGRPO/ENVGSPO no longer discard the entire group. Trajectory generation within a group switched from
async_maptoasync_map_batch. If a trajectory fails, a new one is attempted instead of losing all completions for that prompt. Group normalization runs on the successful trajectories.

