Skip to content

Change default KL estimator from k3 to k2 for loss-based KL#1445

Open
taivu1998 wants to merge 2 commits intoNovaSky-AI:mainfrom
taivu1998:tdv/issue-805-kl-defaults
Open

Change default KL estimator from k3 to k2 for loss-based KL#1445
taivu1998 wants to merge 2 commits intoNovaSky-AI:mainfrom
taivu1998:tdv/issue-805-kl-defaults

Conversation

@taivu1998
Copy link
Copy Markdown

@taivu1998 taivu1998 commented Apr 2, 2026

Summary

Addresses #805 by updating SkyRL's default KL estimator from k3 to k2 while keeping the existing default KL placement unchanged.

This PR intentionally keeps:

  • trainer.algorithm.use_kl_loss = true
  • trainer.algorithm.use_kl_in_reward = false

That makes the change narrow and low-risk: it modernizes the default estimator without also flipping the repo-wide default from loss-based KL to reward-based KL.

What Changed

  • Changed the default KL estimator to k2 in the typed config dataclass.
  • Updated ppo_base_config.yaml to match the Python config default.
  • Updated the configuration docs to say k2 is the default for loss-based KL and clarified that reward-based KL is still supported.
  • Added a focused regression test for the default KL configuration.
  • Updated the fallback default on compute_approx_kl(...) to k2 for consistency with the new repo default.

Why This Shape

Issue #805 discusses two possible directions:

  • k1 in reward
  • k2 in loss

This PR takes the k2-in-loss path because it aligns with the issue while preserving SkyRL's current default training behavior. It avoids turning this fix into a larger redesign of KL placement across the codebase.

Files Changed

  • skyrl/train/config/config.py
  • skyrl/train/config/ppo_base_config.yaml
  • skyrl/backends/skyrl_train/utils/ppo_utils.py
  • docs/content/docs/configuration/config.mdx
  • tests/train/test_config.py

Validation

Directly relevant checks passed locally:

RAY_ENABLE_UV_RUN_RUNTIME_ENV=0 /Users/vuductai/Documents/Projects/SkyRL/.venv/bin/python -m pytest --noconftest tests/train/test_config.py::test_default_kl_regularization_config tests/backends/skyrl_train/utils/test_ppo_utils.py::test_compute_approx_kl -q

A broader non-Ray-gated pass also succeeded:

RAY_ENABLE_UV_RUN_RUNTIME_ENV=0 /Users/vuductai/Documents/Projects/SkyRL/.venv/bin/python -m pytest --noconftest tests/train/test_config.py tests/backends/skyrl_train/utils/test_ppo_utils.py -q -k 'not test_registry_cross_ray_process and not test_registry_named_actor_creation and not test_registry_reset_after_ray_shutdown'

Result:

  • 35 passed, 3 deselected

The three deselected tests are Ray registry tests that still require ray.init() process inspection, which is sandbox-restricted in this local macOS environment. CI should cover those normally.

Notes

  • This PR does not change kl_loss_coef.
  • This PR does not change the default KL placement.
  • Explicit user overrides of KL settings continue to behave the same way.

Open with Devin

@taivu1998 taivu1998 closed this Apr 3, 2026
@taivu1998 taivu1998 reopened this Apr 4, 2026
@taivu1998 taivu1998 marked this pull request as ready for review April 5, 2026 00:07
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the default kl_estimator_type from k3 to k2 across the codebase, including configuration files, utility functions, and documentation, while adding a unit test to verify the new defaults. Feedback suggests refining the documentation to clarify that k2 is the global default and to remove redundant descriptions of the use_kl_in_reward parameter.

- `horizon`: Controls the update rate of the adaptive KL controller.

- `algorithm.kl_estimator_type`: KL estimator type to use. Options include: `k1`, `k2`, `k3`, `abs`. See [this blog post](http://joschu.net/blog/kl-approx.html) for details. We use `k3` as the default.
- `algorithm.kl_estimator_type`: KL estimator type to use. Options include: `k1`, `k2`, `k3`, `abs`. See [this blog post](http://joschu.net/blog/kl-approx.html) for details. We use `k2` as the default for loss-based KL. Reward-based KL remains supported via `algorithm.use_kl_in_reward`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrasing "We use k2 as the default for loss-based KL" is slightly misleading because k2 is now the global default for the kl_estimator_type field, regardless of whether loss-based or reward-based KL is used. Additionally, the clarification about use_kl_in_reward is redundant as it is explicitly described in the very next bullet point. A more concise description would improve clarity.

- `algorithm.kl_estimator_type`: KL estimator type to use. Options include: `k1`, `k2`, `k3`, `abs`. See [this blog post](http://joschu.net/blog/kl-approx.html) for details. The default is `k2` (optimized for loss-based KL).

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant