Change default KL estimator from k3 to k2 for loss-based KL#1445
Change default KL estimator from k3 to k2 for loss-based KL#1445taivu1998 wants to merge 2 commits intoNovaSky-AI:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the default kl_estimator_type from k3 to k2 across the codebase, including configuration files, utility functions, and documentation, while adding a unit test to verify the new defaults. Feedback suggests refining the documentation to clarify that k2 is the global default and to remove redundant descriptions of the use_kl_in_reward parameter.
| - `horizon`: Controls the update rate of the adaptive KL controller. | ||
|
|
||
| - `algorithm.kl_estimator_type`: KL estimator type to use. Options include: `k1`, `k2`, `k3`, `abs`. See [this blog post](http://joschu.net/blog/kl-approx.html) for details. We use `k3` as the default. | ||
| - `algorithm.kl_estimator_type`: KL estimator type to use. Options include: `k1`, `k2`, `k3`, `abs`. See [this blog post](http://joschu.net/blog/kl-approx.html) for details. We use `k2` as the default for loss-based KL. Reward-based KL remains supported via `algorithm.use_kl_in_reward`. |
There was a problem hiding this comment.
The phrasing "We use k2 as the default for loss-based KL" is slightly misleading because k2 is now the global default for the kl_estimator_type field, regardless of whether loss-based or reward-based KL is used. Additionally, the clarification about use_kl_in_reward is redundant as it is explicitly described in the very next bullet point. A more concise description would improve clarity.
- `algorithm.kl_estimator_type`: KL estimator type to use. Options include: `k1`, `k2`, `k3`, `abs`. See [this blog post](http://joschu.net/blog/kl-approx.html) for details. The default is `k2` (optimized for loss-based KL).
Summary
Addresses #805 by updating SkyRL's default KL estimator from
k3tok2while keeping the existing default KL placement unchanged.This PR intentionally keeps:
trainer.algorithm.use_kl_loss = truetrainer.algorithm.use_kl_in_reward = falseThat makes the change narrow and low-risk: it modernizes the default estimator without also flipping the repo-wide default from loss-based KL to reward-based KL.
What Changed
k2in the typed config dataclass.ppo_base_config.yamlto match the Python config default.k2is the default for loss-based KL and clarified that reward-based KL is still supported.compute_approx_kl(...)tok2for consistency with the new repo default.Why This Shape
Issue #805 discusses two possible directions:
k1in rewardk2in lossThis PR takes the
k2-in-loss path because it aligns with the issue while preserving SkyRL's current default training behavior. It avoids turning this fix into a larger redesign of KL placement across the codebase.Files Changed
skyrl/train/config/config.pyskyrl/train/config/ppo_base_config.yamlskyrl/backends/skyrl_train/utils/ppo_utils.pydocs/content/docs/configuration/config.mdxtests/train/test_config.pyValidation
Directly relevant checks passed locally:
A broader non-Ray-gated pass also succeeded:
RAY_ENABLE_UV_RUN_RUNTIME_ENV=0 /Users/vuductai/Documents/Projects/SkyRL/.venv/bin/python -m pytest --noconftest tests/train/test_config.py tests/backends/skyrl_train/utils/test_ppo_utils.py -q -k 'not test_registry_cross_ray_process and not test_registry_named_actor_creation and not test_registry_reset_after_ray_shutdown'Result:
35 passed, 3 deselectedThe three deselected tests are Ray registry tests that still require
ray.init()process inspection, which is sandbox-restricted in this local macOS environment. CI should cover those normally.Notes
kl_loss_coef.