Change default KL estimator from k3 to k2 for loss-based KL by taivu1998 · Pull Request #1445 · NovaSky-AI/SkyRL

taivu1998 · 2026-04-02T23:08:36Z

Summary

Addresses #805 by updating SkyRL's default KL estimator from k3 to k2 while keeping the existing default KL placement unchanged.

This PR intentionally keeps:

trainer.algorithm.use_kl_loss = true
trainer.algorithm.use_kl_in_reward = false

That makes the change narrow and low-risk: it modernizes the default estimator without also flipping the repo-wide default from loss-based KL to reward-based KL.

What Changed

Changed the default KL estimator to k2 in the typed config dataclass.
Updated ppo_base_config.yaml to match the Python config default.
Updated the configuration docs to say k2 is the default for loss-based KL and clarified that reward-based KL is still supported.
Added a focused regression test for the default KL configuration.
Updated the fallback default on compute_approx_kl(...) to k2 for consistency with the new repo default.

Why This Shape

Issue #805 discusses two possible directions:

k1 in reward
k2 in loss

This PR takes the k2-in-loss path because it aligns with the issue while preserving SkyRL's current default training behavior. It avoids turning this fix into a larger redesign of KL placement across the codebase.

Files Changed

skyrl/train/config/config.py
skyrl/train/config/ppo_base_config.yaml
skyrl/backends/skyrl_train/utils/ppo_utils.py
docs/content/docs/configuration/config.mdx
tests/train/test_config.py

Validation

Directly relevant checks passed locally:

RAY_ENABLE_UV_RUN_RUNTIME_ENV=0 /Users/vuductai/Documents/Projects/SkyRL/.venv/bin/python -m pytest --noconftest tests/train/test_config.py::test_default_kl_regularization_config tests/backends/skyrl_train/utils/test_ppo_utils.py::test_compute_approx_kl -q

A broader non-Ray-gated pass also succeeded:

RAY_ENABLE_UV_RUN_RUNTIME_ENV=0 /Users/vuductai/Documents/Projects/SkyRL/.venv/bin/python -m pytest --noconftest tests/train/test_config.py tests/backends/skyrl_train/utils/test_ppo_utils.py -q -k 'not test_registry_cross_ray_process and not test_registry_named_actor_creation and not test_registry_reset_after_ray_shutdown'

Result:

35 passed, 3 deselected

The three deselected tests are Ray registry tests that still require ray.init() process inspection, which is sandbox-restricted in this local macOS environment. CI should cover those normally.

Notes

This PR does not change kl_loss_coef.
This PR does not change the default KL placement.
Explicit user overrides of KL settings continue to behave the same way.

gemini-code-assist

Code Review

This pull request updates the default kl_estimator_type from k3 to k2 across the codebase, including configuration files, utility functions, and documentation, while adding a unit test to verify the new defaults. Feedback suggests refining the documentation to clarify that k2 is the global default and to remove redundant descriptions of the use_kl_in_reward parameter.

gemini-code-assist · 2026-04-05T00:13:19Z

 - `horizon`: Controls the update rate of the adaptive KL controller.

- `algorithm.kl_estimator_type`: KL estimator type to use. Options include: `k1`, `k2`, `k3`, `abs`. See [this blog post](http://joschu.net/blog/kl-approx.html) for details. We use `k3` as the default.
+- `algorithm.kl_estimator_type`: KL estimator type to use. Options include: `k1`, `k2`, `k3`, `abs`. See [this blog post](http://joschu.net/blog/kl-approx.html) for details. We use `k2` as the default for loss-based KL. Reward-based KL remains supported via `algorithm.use_kl_in_reward`.


The phrasing "We use k2 as the default for loss-based KL" is slightly misleading because k2 is now the global default for the kl_estimator_type field, regardless of whether loss-based or reward-based KL is used. Additionally, the clarification about use_kl_in_reward is redundant as it is explicitly described in the very next bullet point. A more concise description would improve clarity.

- `algorithm.kl_estimator_type`: KL estimator type to use. Options include: `k1`, `k2`, `k3`, `abs`. See [this blog post](http://joschu.net/blog/kl-approx.html) for details. The default is `k2` (optimized for loss-based KL).

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Change default KL estimator to k2

5fc3cd2

taivu1998 closed this Apr 3, 2026

taivu1998 reopened this Apr 4, 2026

Skip GPU workflow when Anyscale token is unavailable

c0c1742

taivu1998 marked this pull request as ready for review April 5, 2026 00:07

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

devin-ai-integration bot reviewed Apr 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default KL estimator from k3 to k2 for loss-based KL#1445

Change default KL estimator from k3 to k2 for loss-based KL#1445
taivu1998 wants to merge 2 commits intoNovaSky-AI:mainfrom
taivu1998:tdv/issue-805-kl-defaults

taivu1998 commented Apr 2, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taivu1998 commented Apr 2, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Why This Shape

Files Changed

Validation

Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taivu1998 commented Apr 2, 2026 •

edited by devin-ai-integration bot

Loading