Skip to main content
v1.0.0
2026-03-02
Depends on harmony-client~=1.0.

Breaking changes

“Use Cases” renamed to “Projects” Anywhere use_case was used is now converted to project, namely in get_client and RecipeConfig.Environment API rewrite Environment.react_to() now returns a StringThread and a list of Grades, rather than a single next turn and a single reward float.
# Before (v0.1.x)
async def react_to(self, thread: StringThread) -> list[tuple[str, str]] | TrajectoryScore:
    ...

# After (v1.0.0)
async def react_to(self, thread: StringThread) -> tuple[StringThread | None, list[Grade] | None]:
    ...
  • Return (next_thread, grades) on each turn. Return (None, grades) to terminate the trajectory.
  • When next_thread is not a prefix of the current thread, the environment starts a new conversation segment within the same trajectory. This enables multi-turn training with arbitrary state transitions.
Environment.bootstrap_prompt() is renamed to Environment.initialize_state().IgnoreScoreException renamed to IgnoreGradeException All grader code that catches or raises this exception must update the class name.
# Before
from adaptive_harmony.graders import IgnoreScoreException

# After
from adaptive_harmony.graders import IgnoreGradeException
The exception now carries a detailed default message explaining when ignoring a grade is appropriate. The message parameter type changed from str to str | None (default None).CheckpointCallback deprecated CheckpointCallback is deprecated. Use trainer-level checkpointing (checkpoint_frequency parameter) instead. Using both may produce inconsistent checkpoint states.Rich progress display off by default The training progress display is now off by default. Set environment variable ENABLE_RICH_PROGRESS=1 to enable it. The previous DISABLE_RICH_PROGRESS environment variable is no longer recognized.Experiment tag key renamed The metric logger tag adaptive.use_case_id is renamed to adaptive.project_id. Update any code or dashboards that filter on this tag.

What’s new

Arbitrary state transition support in environment-based trainers Environment-based trainers (PPO, GRPO, GSPO) can now handle environments that produce multiple conversation threads per trajectory. Future rewards are “propagated” back, with each turn’s advantage derived from the (normalized) future cumulative reward as described in section 4.1.3 of DeepSeekMath.ENVPPO: multi-turn PPO A new trainer at adaptive_harmony.common.env_ppo.ENVPPO supports PPO training within multi-turn environments. It uses a separate value model for advantage estimation via GAE (Generalized Advantage Estimation), with independent learning rates and schedulers for the policy and value models.
from adaptive_harmony.common.env_ppo import ENVPPO

await ENVPPO(
    dataset=prompts,
    model=policy_model,
    value_model=value_model,
    environment_factory=env_factory,
    lr_policy=0.75e-6,
    lr_value=1e-6,
    max_num_ppo_steps=100,
    gae_lambda=0.95,
    gae_gamma=1.0,
    value_only_fraction=0.25,
).run()
Key parameters:
  • value_only_fraction: Fraction of training steps where only the value model updates (default: 0.25). The policy model stays frozen during this warmup.
  • lr_policy / lr_value: Independent learning rates for the policy and value models.
  • gae_lambda / gae_gamma: GAE parameters for advantage estimation (defaults: 0.95 and 1.0).
Dataset seeding All training recipes now accept a data_seed parameter (default: 42) that controls dataset shuffling. This ensures reproducible training runs across restarts.
await GRPO(
    dataset=dataset,
    model=model,
    grader=grader,
    data_seed=42,  # Reproducible shuffle
).run()
Optional system prompt in TemplatedPromptJudge The system_template parameter in TemplatedPromptJudge now accepts None. When None, the judge thread contains only a user turn.Dataset weight validation for SFT SFT training now checks dataset turn weights before training starts and logs warnings for:
  • User turns with non-zero weight (you may be unintentionally training on user messages)
  • Assistant turns with zero weight (these turns do not contribute to training)
System prompt override utility A new override_system_prompt() function replaces or adds a system prompt on every thread in a dataset:
from adaptive_harmony.core.utils import override_system_prompt

dataset = override_system_prompt(dataset, "You are a helpful assistant.")
Image utilities New functions in adaptive_harmony.core.image_utils:
  • pil_to_base64(): converts a PIL image to a base64 string with optional resizing and grayscale conversion
  • image_to_base64(): converts an image file path to a base64 string
Weight-aware display StringThread and TokenizedThread repr output now highlights trained turns (weight > 0) in green and untrained turns in blue. The HTML export (string_thread_to_html) applies matching color coding and green borders on images in trained turns.Improved error messages
  • async_map_batch now raises a RuntimeError with a clear message when the batch failure rate exceeds the threshold, including the allowed failure percentage and the last error thrown.
  • async_map_fallible now logs exceptions before suppressing them.

Bug fixes

  • External dataset feedback values are no longer silently dropped. When parsing datasets with a feedbacks field and a configured feedback_key, the reward value is now correctly extracted into the thread metadata.
  • Progress percentage no longer exceeds 100%. When async_map_batch retries failed samples, the processed count could exceed the total. Progress reporting now clamps to the total and logs a warning.
  • Failed trajectories in ENVGRPO/ENVGSPO no longer discard the entire group. Trajectory generation within a group switched from async_map to async_map_batch. If a trajectory fails, a new one is attempted instead of losing all completions for that prompt. Group normalization runs on the successful trajectories.