> ## Documentation Index
> Fetch the complete documentation index at: https://docs.adaptive-ml.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Release notes

> Release notes for adaptive-harmony

<Update label="v1.1.0" description="2026-04-22">
  Depends on `harmony-client~=1.1`.

  ### Breaking changes

  **Training length must be specified explicitly**
  PPO, GRPO, GSPO, and ENVPPO now require exactly one of `epochs` or the corresponding `max_num_*_steps` to be set. Recipes that previously relied on a default training length raise `AssertionError`.

  ```python theme={null}
  # Before
  await GRPO(dataset=dataset, model=model, grader=grader).run()

  # After — pick one
  await GRPO(dataset=dataset, model=model, grader=grader, epochs=1.0).run()
  await GRPO(dataset=dataset, model=model, grader=grader, max_num_grpo_steps=100).run()
  ```

  `epochs` is a float counting dataset passes; `max_num_*_steps` is an integer counting optimizer steps.

  **Environments must return a list of grades**
  Returning a single `Grade` instead of `list[Grade]` from `react_to()` now raises `ValueError`. Wrap single grades in a list.

  ```python theme={null}
  # Before
  return next_thread, Grade(value=1.0, grader_key="env")

  # After
  return next_thread, [Grade(value=1.0, grader_key="env")]
  ```

  **Removed `batch_process_fallible`**
  Use `async_map_batch` instead. `async_map_batch` works on an iterator, processes a fixed-size batch concurrently, and refills the batch from the iterator when samples fail.

  **WandB runs now keyed by project ID**
  `WandbLogger` initialised by `get_prod_logger()` now passes the project ID (a stable identifier) as the WandB project, instead of the human-readable project name. Existing WandB dashboards filtering on the old project name lose continuity. Update filters to use the project ID, or set `project_name` explicitly when instantiating `WandbLogger` directly.

  ### What's new

  **Structured output via `response_model`**
  `InferenceModel.generate()` and `InferenceModel.generate_tokens()` now accept a Pydantic model class. They return a `(result, parsed)` tuple where `parsed` is an instance of your model.

  ```python theme={null}
  from pydantic import BaseModel

  class Verdict(BaseModel):
      reasoning: str
      pass_: bool

  result, parsed = await model.generate(thread, Verdict)
  print(parsed.pass_)
  ```

  This wraps `constrained_decoding` and Pydantic parsing, so JSON output is guaranteed to validate against the schema. JSON Schemas are emitted with `additionalProperties: false` for compatibility with OpenAI strict mode.

  **Function graders loadable from source code**
  `BaseGrader.from_source_code(grader_key, source_code)` compiles a string of Python that defines an async `grade` function and returns a grader. `BaseGrader.from_config()` now handles `Function` grader entries from the platform, calling `from_source_code` under the hood. Function graders only run inside sandboxed recipe jobs submitted via the platform — outside the sandbox the source code is not available and `from_config` raises a descriptive error.

  **Train by epochs in RL recipes**
  `PPO`, `GRPO`, `GSPO`, and `ENVPPO` accept `epochs: float | None` to specify training length in dataset passes. Pairs with the breaking change above (exactly one of `epochs` or `max_num_*_steps` must be set).

  ```python theme={null}
  await GRPO(
      dataset=prompts,
      model=policy_model,
      grader=grader,
      epochs=2.0,  # two passes over the dataset
  ).run()
  ```

  **DPO with arbitrary loss functions**
  A new `train_dpo_tangents` joins `train_sft_tangents` and `train_grpo_tangents` for tangent-based DPO training.

  ```python theme={null}
  from adaptive_harmony.core.losses import train_dpo_tangents

  metrics = await train_dpo_tangents(
      model,
      pos_thread,
      neg_thread,
      ref_logprobs_pos,
      ref_logprobs_neg,
      beta=0.1,
  )
  ```

  See the [Bring your own loss functions](/v0.14/harmony/algorithms#bring-your-own-loss-functions) section for the broader pattern.

  **Logger step persisted across resumes**
  All loggers (`WandbLogger`, `MLFlowLogger`, `TBMetricsLogger`, `AdaptiveLogger`, `StdoutLogger`, `CompositeLogger`) now expose `get_state()` / `set_state()`. The trainer saves logger state into the checkpoint and restores it on resume, so step counters no longer reset to 0 mid-run.

  **Multi-stage checkpointing**
  Recipes that chain multiple trainers (for example, SFT followed by GSPO) now checkpoint each stage independently. When a job is resumed, every stage picks up where it left off — earlier stages skip, the active stage restarts from its latest checkpoint, later stages run fresh. No code changes required; the runtime tracks the stage counter for you.

  **Concurrency control on `async_map`**
  `async_map` and `async_map_fallible` accept `max_concurrent_samples: int | None = None` to bound parallelism with a semaphore, and `stage_notifier: StageNotifier | None = None` to report per-sample progress to the platform.

  ```python theme={null}
  results = await async_map(
      grader.score_float_value,
      samples,
      max_concurrent_samples=8,
      stage_notifier=ctx.job.stage_notifier("Grading"),
  )
  ```

  `async_map_fallible` adds a `return_indices=True` overload that returns `list[tuple[int, T]]`, so you can map results back to the original input indices.

  **Rich progress bars for model spawning**
  `spawn_train()` and `spawn_inference()` now render progress with `rich.progress.Progress` instead of `tqdm`. Visible behaviour change in interactive sessions.

  ### Bug fixes

  * **OpenAI strict-mode schemas now validate.** `additionalProperties: false` is added recursively to all object schemas, including those reached via `$defs`, `definitions`, `anyOf`/`oneOf`/`allOf`, and `items`.
  * **Repeated samples in GRPO/GSPO deepcopy metadata.** Sample metadata is no longer shared across grouped completions, preventing accidental mutation across the group.
  * **Thread weighting in dataset loading.** SFT and RLHF recipes had a bug where some assistant turns received incorrect weights. Loading and recipe weighting now agree.
  * **MLflow experiment tag for project lookup.** The MLflow run tag used for cross-system project lookup was wrong; it now matches what the platform expects.
  * **Improved user-error messages in trainer init.** Hyperparameter mismatches and missing inputs raise more specific messages instead of opaque assertion errors.
</Update>

<Update label="v1.0.0" description="2026-03-02">
  Depends on `harmony-client~=1.0`.

  ### Breaking changes

  **"Use Cases" renamed to "Projects"**
  Anywhere `use_case` was used is now converted to `project`, namely in `get_client` and `RecipeConfig`.

  **Environment API rewrite**
  `Environment.react_to()` now returns a StringThread and a list of Grades, rather than a single next turn and a single reward float.

  ```python theme={null}
  # Before (v0.1.x)
  async def react_to(self, thread: StringThread) -> list[tuple[str, str]] | TrajectoryScore:
      ...

  # After (v1.0.0)
  async def react_to(self, thread: StringThread) -> tuple[StringThread | None, list[Grade] | None]:
      ...
  ```

  * Return `(next_thread, grades)` on each turn. Return `(None, grades)` to terminate the trajectory.
  * When `next_thread` is not a prefix of the current thread, the environment starts a new conversation segment within the same trajectory. This enables multi-turn training with arbitrary state transitions.

  `Environment.bootstrap_prompt()` is renamed to `Environment.initialize_state()`.

  **IgnoreScoreException renamed to IgnoreGradeException**
  All grader code that catches or raises this exception must update the class name.

  ```python theme={null}
  # Before
  from adaptive_harmony.graders import IgnoreScoreException

  # After
  from adaptive_harmony.graders import IgnoreGradeException
  ```

  The exception now carries a detailed default message explaining when ignoring a grade is appropriate. The `message` parameter type changed from `str` to `str | None` (default `None`).

  **CheckpointCallback deprecated**
  `CheckpointCallback` is deprecated. Use trainer-level checkpointing (`checkpoint_frequency` parameter) instead. Using both may produce inconsistent checkpoint states.

  **Rich progress display off by default**
  The training progress display is now off by default. Set environment variable `ENABLE_RICH_PROGRESS=1` to enable it. The previous `DISABLE_RICH_PROGRESS` environment variable is no longer recognized.

  **Experiment tag key renamed**
  The metric logger tag `adaptive.use_case_id` is renamed to `adaptive.project_id`. Update any code or dashboards that filter on this tag.

  ### What's new

  **Arbitrary state transition support in environment-based trainers**
  Environment-based trainers (PPO, GRPO, GSPO) can now handle environments that produce multiple conversation threads per trajectory.
  Future rewards are "propagated" back, with each turn's advantage derived from the (normalized) future cumulative reward as described in section 4.1.3 of [DeepSeekMath](https://arxiv.org/pdf/2402.03300).

  **ENVPPO: multi-turn PPO**
  A new trainer at `adaptive_harmony.common.env_ppo.ENVPPO` supports PPO training within multi-turn environments. It uses a separate value model for advantage estimation via GAE (Generalized Advantage Estimation), with independent learning rates and schedulers for the policy and value models.

  ```python theme={null}
  from adaptive_harmony.common.env_ppo import ENVPPO

  await ENVPPO(
      dataset=prompts,
      model=policy_model,
      value_model=value_model,
      environment_factory=env_factory,
      lr_policy=0.75e-6,
      lr_value=1e-6,
      max_num_ppo_steps=100,
      gae_lambda=0.95,
      gae_gamma=1.0,
      value_only_fraction=0.25,
  ).run()
  ```

  Key parameters:

  * **`value_only_fraction`**: Fraction of training steps where only the value model updates (default: `0.25`). The policy model stays frozen during this warmup.
  * **`lr_policy`** / **`lr_value`**: Independent learning rates for the policy and value models.
  * **`gae_lambda`** / **`gae_gamma`**: GAE parameters for advantage estimation (defaults: `0.95` and `1.0`).

  **Tangent-based loss computation**
  SFT, GRPO, ENVGRPO, and ENVGSPO recipes now support `use_tangents=True`, which computes gradients through an explicit loss function (`adaptive_harmony.core.losses`) instead of relying on built-in training methods. This exposes per-step training metrics.

  ```python theme={null}
  from adaptive_harmony.common.sft import SFT

  await SFT(
      dataset=dataset,
      model=model,
      use_tangents=True,
      logger=logger,
  ).run()
  ```

  When enabled, the logger receives additional `policy/*` metrics:

  * **SFT**: `policy/loss`, `policy/num_tokens`
  * **GRPO/ENVGRPO/ENVGSPO**: `policy/pg_clipfrac`, `policy/inner_kl_div`, `policy/Dkl`

  **Dataset seeding**
  All training recipes now accept a `data_seed` parameter (default: `42`) that controls dataset shuffling. This ensures reproducible training runs across restarts.

  ```python theme={null}
  await GRPO(
      dataset=dataset,
      model=model,
      grader=grader,
      data_seed=42,  # Reproducible shuffle
  ).run()
  ```

  **Optional system prompt in TemplatedPromptJudge**
  The `system_template` parameter in `TemplatedPromptJudge` now accepts `None`. When `None`, the judge thread contains only a user turn.

  **Dataset weight validation for SFT**
  SFT training now checks dataset turn weights before training starts and logs warnings for:

  * User turns with non-zero weight (you may be unintentionally training on user messages)
  * Assistant turns with zero weight (these turns do not contribute to training)

  **System prompt override utility**
  A new `override_system_prompt()` function replaces or adds a system prompt on every thread in a dataset:

  ```python theme={null}
  from adaptive_harmony.core.utils import override_system_prompt

  dataset = override_system_prompt(dataset, "You are a helpful assistant.")
  ```

  **Image utilities**
  New functions in `adaptive_harmony.core.image_utils`:

  * `pil_to_base64()`: converts a PIL image to a base64 string with optional resizing and grayscale conversion
  * `image_to_base64()`: converts an image file path to a base64 string

  **Weight-aware display**
  `StringThread` and `TokenizedThread` repr output now highlights trained turns (weight > 0) in green and untrained turns in blue. The HTML export (`string_thread_to_html`) applies matching color coding and green borders on images in trained turns.

  **Improved error messages**

  * `async_map_batch` now raises a `RuntimeError` with a clear message when the batch failure rate exceeds the threshold, including the allowed failure percentage and the last error thrown.
  * `async_map_fallible` now logs exceptions before suppressing them.

  ### Bug fixes

  * **External dataset feedback values are no longer silently dropped.** When parsing datasets with a `feedbacks` field and a configured `feedback_key`, the reward value is now correctly extracted into the thread metadata.
  * **Progress percentage no longer exceeds 100%.** When `async_map_batch` retries failed samples, the processed count could exceed the total. Progress reporting now clamps to the total and logs a warning.
  * **Failed trajectories in ENVGRPO/ENVGSPO no longer discard the entire group.** Trajectory generation within a group switched from `async_map` to `async_map_batch`. If a trajectory fails, a new one is attempted instead of losing all completions for that prompt. Group normalization runs on the successful trajectories.
</Update>
