Depends on The exception now carries a detailed default message explaining when ignoring a grade is appropriate. The Key parameters:Optional system prompt in TemplatedPromptJudge
The Image utilities
New functions in
harmony-client~=1.0.Breaking changes
“Use Cases” renamed to “Projects” Anywhereuse_case was used is now converted to project, namely in get_client and RecipeConfig.Environment API rewrite
Environment.react_to() now returns a StringThread and a list of Grades, rather than a single next turn and a single reward float.- Return
(next_thread, grades)on each turn. Return(None, grades)to terminate the trajectory. - When
next_threadis not a prefix of the current thread, the environment starts a new conversation segment within the same trajectory. This enables multi-turn training with arbitrary state transitions.
Environment.bootstrap_prompt() is renamed to Environment.initialize_state().IgnoreScoreException renamed to IgnoreGradeException
All grader code that catches or raises this exception must update the class name.message parameter type changed from str to str | None (default None).CheckpointCallback deprecated
CheckpointCallback is deprecated. Use trainer-level checkpointing (checkpoint_frequency parameter) instead. Using both may produce inconsistent checkpoint states.Rich progress display off by default
The training progress display is now off by default. Set environment variable ENABLE_RICH_PROGRESS=1 to enable it. The previous DISABLE_RICH_PROGRESS environment variable is no longer recognized.Experiment tag key renamed
The metric logger tag adaptive.use_case_id is renamed to adaptive.project_id. Update any code or dashboards that filter on this tag.What’s new
Arbitrary state transition support in environment-based trainers Environment-based trainers (PPO, GRPO, GSPO) can now handle environments that produce multiple conversation threads per trajectory. Future rewards are “propagated” back, with each turn’s advantage derived from the (normalized) future cumulative reward as described in section 4.1.3 of DeepSeekMath.ENVPPO: multi-turn PPO A new trainer atadaptive_harmony.common.env_ppo.ENVPPO supports PPO training within multi-turn environments. It uses a separate value model for advantage estimation via GAE (Generalized Advantage Estimation), with independent learning rates and schedulers for the policy and value models.value_only_fraction: Fraction of training steps where only the value model updates (default:0.25). The policy model stays frozen during this warmup.lr_policy/lr_value: Independent learning rates for the policy and value models.gae_lambda/gae_gamma: GAE parameters for advantage estimation (defaults:0.95and1.0).
data_seed parameter (default: 42) that controls dataset shuffling. This ensures reproducible training runs across restarts.system_template parameter in TemplatedPromptJudge now accepts None. When None, the judge thread contains only a user turn.Dataset weight validation for SFT
SFT training now checks dataset turn weights before training starts and logs warnings for:- User turns with non-zero weight (you may be unintentionally training on user messages)
- Assistant turns with zero weight (these turns do not contribute to training)
override_system_prompt() function replaces or adds a system prompt on every thread in a dataset:adaptive_harmony.core.image_utils:pil_to_base64(): converts a PIL image to a base64 string with optional resizing and grayscale conversionimage_to_base64(): converts an image file path to a base64 string
StringThread and TokenizedThread repr output now highlights trained turns (weight > 0) in green and untrained turns in blue. The HTML export (string_thread_to_html) applies matching color coding and green borders on images in trained turns.Improved error messagesasync_map_batchnow raises aRuntimeErrorwith a clear message when the batch failure rate exceeds the threshold, including the allowed failure percentage and the last error thrown.async_map_falliblenow logs exceptions before suppressing them.
Bug fixes
- External dataset feedback values are no longer silently dropped. When parsing datasets with a
feedbacksfield and a configuredfeedback_key, the reward value is now correctly extracted into the thread metadata. - Progress percentage no longer exceeds 100%. When
async_map_batchretries failed samples, the processed count could exceed the total. Progress reporting now clamps to the total and logs a warning. - Failed trajectories in ENVGRPO/ENVGSPO no longer discard the entire group. Trajectory generation within a group switched from
async_maptoasync_map_batch. If a trajectory fails, a new one is attempted instead of losing all completions for that prompt. Group normalization runs on the successful trajectories.

