Skip to content

[skyrl-train] Add trainer-side max_response_length for Dr. GRPO normalization and DAPO overlong handling#1440

Open
taivu1998 wants to merge 2 commits intoNovaSky-AI:mainfrom
taivu1998:tdv/issue-756-max-response-length
Open

[skyrl-train] Add trainer-side max_response_length for Dr. GRPO normalization and DAPO overlong handling#1440
taivu1998 wants to merge 2 commits intoNovaSky-AI:mainfrom
taivu1998:tdv/issue-756-max-response-length

Conversation

@taivu1998
Copy link
Copy Markdown

@taivu1998 taivu1998 commented Apr 2, 2026

Summary

Refs #756.

This PR adds a trainer-side max_response_length configuration that can be used for Dr. GRPO-style loss normalization and for DAPO overlong punishment logic, without changing the existing meaning of trainer.algorithm.max_seq_len.

The implementation stays additive and backward-compatible:

  • trainer.algorithm.max_response_length is now the preferred constant for seq_mean_token_sum_norm
  • trainer.algorithm.max_seq_len remains the total sequence length knob and the legacy fallback normalizer
  • no trainer-side truncation is introduced in this PR
  • DAPO overlong logic is shared through a reusable helper instead of duplicating slightly different loops across examples

What Changed

Core training/config behavior

  • added trainer.algorithm.max_response_length to the train config
  • updated Dr. GRPO normalization to prefer max_response_length, with fallback to max_seq_len
  • kept max_seq_len as the prompt+response / batching-oriented setting instead of repurposing it
  • added config validation for invalid max_response_length values
  • added an explicit warning when seq_mean_token_sum_norm is used without max_response_length

DAPO support

  • extracted apply_dapo_soft_overlong_punishment(...) into a shared trainer utility
  • updated the first-party DAPO example entrypoints to use the shared helper
  • preserved prompt-aware per-sample response budgets for the TIS DAPO path when max_response_length is unset
  • allowed examples to opt into the new trainer-side constant while still falling back to generator-side limits for backward compatibility

Docs and examples

  • documented max_response_length in the config docs and base PPO config
  • updated the Dr. GRPO example launcher to pass trainer.algorithm.max_response_length
  • updated the DAPO README to describe the new knob and the fallback behavior

Tests

  • added tests for config defaults, explicit config parsing, validation, and legacy fallback warnings
  • added tests for resolve_seq_loss_normalizer
  • added a trainer test proving _normalize_advantages() prefers max_response_length over max_seq_len
  • added focused unit tests for the shared DAPO reward-shaping helper

Why This Design

Issue #756 asks for a trainer-side response-length flag that can be used by DAPO and Dr. GRPO. The key design goal here is to separate two concepts that were previously overloaded:

  • total sequence length / context budget
  • response-length normalization constant

Keeping those separate avoids changing the semantics of max_seq_len, preserves compatibility with existing batching and Harbor expectations, and gives algorithms a cleaner trainer-side constant to reference.

Validation

Targeted checks run in the isolated worktree:

  • pytest --noconftest tests/train/test_config.py -q -> 20 passed
  • pytest --noconftest tests/train/utils/test_reward_shaping.py -q -> 5 passed
  • pytest --noconftest tests/backends/skyrl_train/utils/test_ppo_utils.py -q -k 'ApplyLossReductionToAdvantagesMinibatch or resolve_seq_loss_normalizer' -> 9 passed, 22 deselected
  • pytest --noconftest tests/train/test_trainer.py::test_normalize_advantages_prefers_max_response_length -q -> 1 passed
  • python -m compileall on the changed modules/tests -> passed
  • git diff --check -> passed

Known Limitation In This Environment

I could not run the full Ray-backed pytest paths in this macOS sandbox because unrelated Ray initialization hits a psutil PermissionError during process inspection before the relevant test bodies execute. The focused checks above cover the new config, trainer normalization, and DAPO helper behavior introduced in this PR.


Open with Devin

@taivu1998 taivu1998 closed this Apr 3, 2026
@taivu1998 taivu1998 reopened this Apr 4, 2026
@taivu1998 taivu1998 marked this pull request as ready for review April 5, 2026 00:07
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces max_response_length as a dedicated configuration parameter for loss normalization and algorithmic corrections like DAPO overlong handling. It centralizes the DAPO punishment logic into a new utility function, apply_dapo_soft_overlong_punishment, and refactors several examples to use it. Additionally, the trainer now resolves the appropriate normalizer for seq_mean_token_sum_norm loss reduction, maintaining backward compatibility with max_seq_len. A review comment highlighted a missing validation in the new utility function where an overlong_buffer_len exceeding the response length could result in incorrect penalty applications.


adjusted_rewards: List[Reward] = []
for response, reward, sample_max_response_length in zip(response_ids, rewards, max_response_lengths):
response_length = len(response)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The validation logic in validate_cfg checks if overlong_buffer_len > max_response_length, but this check is missing here. If overlong_buffer_len is greater than sample_max_response_length, max_exceed_length will be negative. This causes the penalty logic to be incorrectly applied to all sequences that are not overlong, which is likely not the intended behavior. Adding this validation would make this utility more robust, especially for direct usage outside the main training entrypoint.

        if overlong_buffer_len > sample_max_response_length:
            raise ValueError(
                f"`overlong_buffer_len` ({overlong_buffer_len}) must be <= `sample_max_response_length` ({sample_max_response_length})"
            )
        response_length = len(response)

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant