[skyrl-train] Add trainer-side max_response_length for Dr. GRPO normalization and DAPO overlong handling#1440
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces max_response_length as a dedicated configuration parameter for loss normalization and algorithmic corrections like DAPO overlong handling. It centralizes the DAPO punishment logic into a new utility function, apply_dapo_soft_overlong_punishment, and refactors several examples to use it. Additionally, the trainer now resolves the appropriate normalizer for seq_mean_token_sum_norm loss reduction, maintaining backward compatibility with max_seq_len. A review comment highlighted a missing validation in the new utility function where an overlong_buffer_len exceeding the response length could result in incorrect penalty applications.
|
|
||
| adjusted_rewards: List[Reward] = [] | ||
| for response, reward, sample_max_response_length in zip(response_ids, rewards, max_response_lengths): | ||
| response_length = len(response) |
There was a problem hiding this comment.
The validation logic in validate_cfg checks if overlong_buffer_len > max_response_length, but this check is missing here. If overlong_buffer_len is greater than sample_max_response_length, max_exceed_length will be negative. This causes the penalty logic to be incorrectly applied to all sequences that are not overlong, which is likely not the intended behavior. Adding this validation would make this utility more robust, especially for direct usage outside the main training entrypoint.
if overlong_buffer_len > sample_max_response_length:
raise ValueError(
f"`overlong_buffer_len` ({overlong_buffer_len}) must be <= `sample_max_response_length` ({sample_max_response_length})"
)
response_length = len(response)
Summary
Refs #756.
This PR adds a trainer-side
max_response_lengthconfiguration that can be used for Dr. GRPO-style loss normalization and for DAPO overlong punishment logic, without changing the existing meaning oftrainer.algorithm.max_seq_len.The implementation stays additive and backward-compatible:
trainer.algorithm.max_response_lengthis now the preferred constant forseq_mean_token_sum_normtrainer.algorithm.max_seq_lenremains the total sequence length knob and the legacy fallback normalizerWhat Changed
Core training/config behavior
trainer.algorithm.max_response_lengthto the train configmax_response_length, with fallback tomax_seq_lenmax_seq_lenas the prompt+response / batching-oriented setting instead of repurposing itmax_response_lengthvaluesseq_mean_token_sum_normis used withoutmax_response_lengthDAPO support
apply_dapo_soft_overlong_punishment(...)into a shared trainer utilitymax_response_lengthis unsetDocs and examples
max_response_lengthin the config docs and base PPO configtrainer.algorithm.max_response_lengthTests
resolve_seq_loss_normalizer_normalize_advantages()prefersmax_response_lengthovermax_seq_lenWhy This Design
Issue #756 asks for a trainer-side response-length flag that can be used by DAPO and Dr. GRPO. The key design goal here is to separate two concepts that were previously overloaded:
Keeping those separate avoids changing the semantics of
max_seq_len, preserves compatibility with existing batching and Harbor expectations, and gives algorithms a cleaner trainer-side constant to reference.Validation
Targeted checks run in the isolated worktree:
pytest --noconftest tests/train/test_config.py -q->20 passedpytest --noconftest tests/train/utils/test_reward_shaping.py -q->5 passedpytest --noconftest tests/backends/skyrl_train/utils/test_ppo_utils.py -q -k 'ApplyLossReductionToAdvantagesMinibatch or resolve_seq_loss_normalizer'->9 passed, 22 deselectedpytest --noconftest tests/train/test_trainer.py::test_normalize_advantages_prefers_max_response_length -q->1 passedpython -m compileallon the changed modules/tests -> passedgit diff --check-> passedKnown Limitation In This Environment
I could not run the full Ray-backed pytest paths in this macOS sandbox because unrelated Ray initialization hits a
psutilPermissionErrorduring process inspection before the relevant test bodies execute. The focused checks above cover the new config, trainer normalization, and DAPO helper behavior introduced in this PR.