[skyrl-train] Add trainer-side max_response_length for Dr. GRPO normalization and DAPO overlong handling by taivu1998 · Pull Request #1440 · NovaSky-AI/SkyRL

taivu1998 · 2026-04-02T23:08:29Z

Summary

Refs #756.

This PR adds a trainer-side max_response_length configuration that can be used for Dr. GRPO-style loss normalization and for DAPO overlong punishment logic, without changing the existing meaning of trainer.algorithm.max_seq_len.

The implementation stays additive and backward-compatible:

trainer.algorithm.max_response_length is now the preferred constant for seq_mean_token_sum_norm
trainer.algorithm.max_seq_len remains the total sequence length knob and the legacy fallback normalizer
no trainer-side truncation is introduced in this PR
DAPO overlong logic is shared through a reusable helper instead of duplicating slightly different loops across examples

What Changed

Core training/config behavior

added trainer.algorithm.max_response_length to the train config
updated Dr. GRPO normalization to prefer max_response_length, with fallback to max_seq_len
kept max_seq_len as the prompt+response / batching-oriented setting instead of repurposing it
added config validation for invalid max_response_length values
added an explicit warning when seq_mean_token_sum_norm is used without max_response_length

DAPO support

extracted apply_dapo_soft_overlong_punishment(...) into a shared trainer utility
updated the first-party DAPO example entrypoints to use the shared helper
preserved prompt-aware per-sample response budgets for the TIS DAPO path when max_response_length is unset
allowed examples to opt into the new trainer-side constant while still falling back to generator-side limits for backward compatibility

Docs and examples

documented max_response_length in the config docs and base PPO config
updated the Dr. GRPO example launcher to pass trainer.algorithm.max_response_length
updated the DAPO README to describe the new knob and the fallback behavior

Tests

added tests for config defaults, explicit config parsing, validation, and legacy fallback warnings
added tests for resolve_seq_loss_normalizer
added a trainer test proving _normalize_advantages() prefers max_response_length over max_seq_len
added focused unit tests for the shared DAPO reward-shaping helper

Why This Design

Issue #756 asks for a trainer-side response-length flag that can be used by DAPO and Dr. GRPO. The key design goal here is to separate two concepts that were previously overloaded:

total sequence length / context budget
response-length normalization constant

Keeping those separate avoids changing the semantics of max_seq_len, preserves compatibility with existing batching and Harbor expectations, and gives algorithms a cleaner trainer-side constant to reference.

Validation

Targeted checks run in the isolated worktree:

pytest --noconftest tests/train/test_config.py -q -> 20 passed
pytest --noconftest tests/train/utils/test_reward_shaping.py -q -> 5 passed
pytest --noconftest tests/backends/skyrl_train/utils/test_ppo_utils.py -q -k 'ApplyLossReductionToAdvantagesMinibatch or resolve_seq_loss_normalizer' -> 9 passed, 22 deselected
pytest --noconftest tests/train/test_trainer.py::test_normalize_advantages_prefers_max_response_length -q -> 1 passed
python -m compileall on the changed modules/tests -> passed
git diff --check -> passed

Known Limitation In This Environment

I could not run the full Ray-backed pytest paths in this macOS sandbox because unrelated Ray initialization hits a psutil PermissionError during process inspection before the relevant test bodies execute. The focused checks above cover the new config, trainer normalization, and DAPO helper behavior introduced in this PR.

gemini-code-assist

Code Review

This pull request introduces max_response_length as a dedicated configuration parameter for loss normalization and algorithmic corrections like DAPO overlong handling. It centralizes the DAPO punishment logic into a new utility function, apply_dapo_soft_overlong_punishment, and refactors several examples to use it. Additionally, the trainer now resolves the appropriate normalizer for seq_mean_token_sum_norm loss reduction, maintaining backward compatibility with max_seq_len. A review comment highlighted a missing validation in the new utility function where an overlong_buffer_len exceeding the response length could result in incorrect penalty applications.

gemini-code-assist · 2026-04-05T00:11:26Z

skyrl/train/utils/reward_shaping.py

+
+    adjusted_rewards: List[Reward] = []
+    for response, reward, sample_max_response_length in zip(response_ids, rewards, max_response_lengths):
+        response_length = len(response)


The validation logic in validate_cfg checks if overlong_buffer_len > max_response_length, but this check is missing here. If overlong_buffer_len is greater than sample_max_response_length, max_exceed_length will be negative. This causes the penalty logic to be incorrectly applied to all sequences that are not overlong, which is likely not the intended behavior. Adding this validation would make this utility more robust, especially for direct usage outside the main training entrypoint.

if overlong_buffer_len > sample_max_response_length: raise ValueError( f"`overlong_buffer_len` ({overlong_buffer_len}) must be <= `sample_max_response_length` ({sample_max_response_length})" ) response_length = len(response)

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Add max response length normalization support

bb3a688

taivu1998 closed this Apr 3, 2026

taivu1998 reopened this Apr 4, 2026

Fix PR CI linting and fork GPU check handling

1694549

taivu1998 marked this pull request as ready for review April 5, 2026 00:07

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

devin-ai-integration bot reviewed Apr 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[skyrl-train] Add trainer-side max_response_length for Dr. GRPO normalization and DAPO overlong handling#1440

[skyrl-train] Add trainer-side max_response_length for Dr. GRPO normalization and DAPO overlong handling#1440
taivu1998 wants to merge 2 commits intoNovaSky-AI:mainfrom
taivu1998:tdv/issue-756-max-response-length

taivu1998 commented Apr 2, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taivu1998 commented Apr 2, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Core training/config behavior

DAPO support

Docs and examples

Tests

Why This Design

Validation

Known Limitation In This Environment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taivu1998 commented Apr 2, 2026 •

edited by devin-ai-integration bot

Loading