feat(tj): update unizero ppo implementation by tAnGjIa520 · Pull Request #473 · opendilab/LightZero

tAnGjIa520 · 2026-02-13T07:36:49Z

Main Changes

Use ding library's GAE implementation: Replace custom GAE computation with ding.rl_utils.gae and ding.rl_utils.gae_data
functions in muzero_collector
Fix PPO value loss computation: Remove incorrect gamma^t discounting in PPO value loss calculation (world model loss
still correctly uses gamma^t for temporal discounting)
Fix PPO returns computation order: Compute returns before advantage normalization to prevent incorrect value calculations
Optimize buffer and collector logic:
- Fix data duplication issues in Segment concatenation
- Add proper masking for variable-length sequences
- Improve sample logic to prevent out-of-range errors
Enhance world model training:
- Add proper device handling and logging
- Improve PPO loss computation with correct advantage and value target handling
- Optimize memory management with cache clearing
Refine training configurations:
- Add online training config (lunarlander_disc_unizero_ppo_online_config.py)
- Update hyperparameters (learning rate, batch size, PPO epochs)
- Add comprehensive configuration documentation
Update .gitignore: Add Claude Code related directories (.claude/, .omc/, codex_home/), scripts/, and docs/
Remove docs/ from git tracking: Keep local documentation but exclude from version control
Add comprehensive documentation:
- Enhance CLAUDE.md with framework architecture guide
- Add GAE computation implementation plan
- Add online training flow documentation

主要改动

使用 ding 库的 GAE 实现：在 muzero_collector 中用 ding.rl_utils.gae 和 ding.rl_utils.gae_data 函数替代自定义 GAE 计算
修复 PPO value loss 计算：移除 PPO value loss 中错误的 gamma^t 折扣（world model loss 仍正确使用 gamma^t 进行时序折扣）
修复 PPO returns 计算顺序：在 advantage 归一化之前计算 returns，避免错误的值计算
优化 buffer 和 collector 逻辑：
- 修复 Segment 拼接时的数据重复问题
- 为变长序列添加正确的 mask 处理
- 改进 sample 逻辑，避免超出范围错误
增强 world model 训练：
- 添加正确的设备处理和日志记录
- 改进 PPO loss 计算，正确处理 advantage 和 value target
- 优化内存管理，添加缓存清理
优化训练配置：
- 新增在线训练配置（lunarlander_disc_unizero_ppo_online_config.py）
- 更新超参数（学习率、batch size、PPO epochs）
- 添加全面的配置文档
更新 .gitignore：添加 Claude Code 相关目录（.claude/、.omc/、codex_home/）、scripts/ 和 docs/
从 git 跟踪中移除 docs/：保留本地文档但从版本控制中排除
添加全面的文档：
- 增强 CLAUDE.md，添加框架架构指南
- 添加 GAE 计算实现方案
- 添加在线训练流程文档

…environment

…files and update configurations

- Replace manual GAE computation with ding.rl_utils.gae_data and gae - Keep original implementation as _batch_compute_gae_for_pool_bak for backup - Add test script to verify GAE computation correctness - Fix lunarlander_env.py to handle both int and numpy array actions - Add lunarlander_disc_unizero_ppo_config.py for PPO training

…d game demos - Refine train_unizero, game buffers, world_model, unizero policy and collector - Add policy utils (GAE-related); remove test_gae_computation.py - Add LunarLander online PPO config; update Atari/LunarLander PPO configs - Add GAME_README, GAE plan, lunarlander flow doc, game scripts and requirements Made-with: Cursor

Ensure returns are derived from raw value+advantage before adv normalization to keep target values correct. Made-with: Cursor

…O value loss - Restructure CLAUDE.md with LightZero framework overview - Add quick start commands, testing instructions, and architecture guide - Document PPO modifications with detailed data flow - Include development patterns and debugging tips - Fix PPO value loss: remove gamma^t discount from value loss computation - Keep gamma^t discount only for world model losses (obs/rewards) - Align with standard PPO value loss formulation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add Claude Code directories to .gitignore (.codex, codex_home/, .omc/, .claude/, scripts/, docs/) - Update CLAUDE.md with environment setup and PPO implementation details - Refine split training mode in train_unizero.py for better World Model/PPO separation - Enhance buffer tracking with latest_push_count and new_data_ratio in game_buffer.py - Improve PPO batch extraction and padding logic in game_buffer_unizero.py - Add loss_type parameter support in world_model.py for flexible training modes - Update PPO configurations with split training options and improved hyperparameters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove docs/ directory from version control as specified in .gitignore. The directory will remain locally but won't be tracked by git. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

xiongjyu and others added 10 commits August 27, 2025 13:11

adaptively set the config of batchsize and accumulation_steps

825225e

Merge branch 'opendilab:main' into main

1bce74a

Unizero changes MCTs to PPO for strategy optimization in the Jericho …

fd81ca3

…environment

your commit message

20b909e

Update README.md

8ac34c0

Rename project from LightZero1 to LightZero

944fc6b

Add test file

1fa0f0f

Update UniZero PPO implementation: merge PPO functionality into main …

2500766

…files and update configurations

Update: PPO loss computation improvements and code cleanup

226fb46

puyuan1996 added research Research work in progress enhancement New feature or request labels Mar 3, 2026

tAnGjIa520 and others added 4 commits April 17, 2026 14:07

fix(tj): compute PPO returns before advantage normalization

b744cda

Ensure returns are derived from raw value+advantage before adv normalization to keep target values correct. Made-with: Cursor

tAnGjIa520 force-pushed the unizero-ppo-updates branch from c10fa1a to 48b72b2 Compare April 30, 2026 04:22

chore: remove docs/ directory from git tracking

d6c9c92

Remove docs/ directory from version control as specified in .gitignore. The directory will remain locally but won't be tracked by git. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tj): update unizero ppo implementation#473

feat(tj): update unizero ppo implementation#473
tAnGjIa520 wants to merge 15 commits into
opendilab:mainfrom
tAnGjIa520:unizero-ppo-updates

tAnGjIa520 commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tAnGjIa520 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main Changes

主要改动

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tAnGjIa520 commented Feb 13, 2026 •

edited

Loading