Skip to content

feat(tj): update unizero ppo implementation#473

Open
tAnGjIa520 wants to merge 15 commits into
opendilab:mainfrom
tAnGjIa520:unizero-ppo-updates
Open

feat(tj): update unizero ppo implementation#473
tAnGjIa520 wants to merge 15 commits into
opendilab:mainfrom
tAnGjIa520:unizero-ppo-updates

Conversation

@tAnGjIa520

@tAnGjIa520 tAnGjIa520 commented Feb 13, 2026

Copy link
Copy Markdown
Contributor

Main Changes

  • Use ding library's GAE implementation: Replace custom GAE computation with ding.rl_utils.gae and ding.rl_utils.gae_data
    functions in muzero_collector
  • Fix PPO value loss computation: Remove incorrect gamma^t discounting in PPO value loss calculation (world model loss
    still correctly uses gamma^t for temporal discounting)
  • Fix PPO returns computation order: Compute returns before advantage normalization to prevent incorrect value calculations
  • Optimize buffer and collector logic:
    • Fix data duplication issues in Segment concatenation
    • Add proper masking for variable-length sequences
    • Improve sample logic to prevent out-of-range errors
  • Enhance world model training:
    • Add proper device handling and logging
    • Improve PPO loss computation with correct advantage and value target handling
    • Optimize memory management with cache clearing
  • Refine training configurations:
    • Add online training config (lunarlander_disc_unizero_ppo_online_config.py)
    • Update hyperparameters (learning rate, batch size, PPO epochs)
    • Add comprehensive configuration documentation
  • Update .gitignore: Add Claude Code related directories (.claude/, .omc/, codex_home/), scripts/, and docs/
  • Remove docs/ from git tracking: Keep local documentation but exclude from version control
  • Add comprehensive documentation:
    • Enhance CLAUDE.md with framework architecture guide
    • Add GAE computation implementation plan
    • Add online training flow documentation

主要改动

  • 使用 ding 库的 GAE 实现:在 muzero_collector 中用 ding.rl_utils.gae 和 ding.rl_utils.gae_data 函数替代自定义 GAE 计算
  • 修复 PPO value loss 计算:移除 PPO value loss 中错误的 gamma^t 折扣(world model loss 仍正确使用 gamma^t 进行时序折扣)
  • 修复 PPO returns 计算顺序:在 advantage 归一化之前计算 returns,避免错误的值计算
  • 优化 buffer 和 collector 逻辑:
    • 修复 Segment 拼接时的数据重复问题
    • 为变长序列添加正确的 mask 处理
    • 改进 sample 逻辑,避免超出范围错误
  • 增强 world model 训练:
    • 添加正确的设备处理和日志记录
    • 改进 PPO loss 计算,正确处理 advantage 和 value target
    • 优化内存管理,添加缓存清理
  • 优化训练配置:
    • 新增在线训练配置(lunarlander_disc_unizero_ppo_online_config.py)
    • 更新超参数(学习率、batch size、PPO epochs)
    • 添加全面的配置文档
  • 更新 .gitignore:添加 Claude Code 相关目录(.claude/、.omc/、codex_home/)、scripts/ 和 docs/
  • 从 git 跟踪中移除 docs/:保留本地文档但从版本控制中排除
  • 添加全面的文档:
    • 增强 CLAUDE.md,添加框架架构指南
    • 添加 GAE 计算实现方案
    • 添加在线训练流程文档

xiongjyu and others added 10 commits August 27, 2025 13:11
- Replace manual GAE computation with ding.rl_utils.gae_data and gae
- Keep original implementation as _batch_compute_gae_for_pool_bak for backup
- Add test script to verify GAE computation correctness
- Fix lunarlander_env.py to handle both int and numpy array actions
- Add lunarlander_disc_unizero_ppo_config.py for PPO training
@puyuan1996 puyuan1996 added research Research work in progress enhancement New feature or request labels Mar 3, 2026
tAnGjIa520 and others added 4 commits April 17, 2026 14:07
…d game demos

- Refine train_unizero, game buffers, world_model, unizero policy and collector
- Add policy utils (GAE-related); remove test_gae_computation.py
- Add LunarLander online PPO config; update Atari/LunarLander PPO configs
- Add GAME_README, GAE plan, lunarlander flow doc, game scripts and requirements

Made-with: Cursor
Ensure returns are derived from raw value+advantage before adv normalization to keep target values correct.

Made-with: Cursor
…O value loss

- Restructure CLAUDE.md with LightZero framework overview
- Add quick start commands, testing instructions, and architecture guide
- Document PPO modifications with detailed data flow
- Include development patterns and debugging tips

- Fix PPO value loss: remove gamma^t discount from value loss computation
- Keep gamma^t discount only for world model losses (obs/rewards)
- Align with standard PPO value loss formulation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add Claude Code directories to .gitignore (.codex, codex_home/, .omc/, .claude/, scripts/, docs/)
- Update CLAUDE.md with environment setup and PPO implementation details
- Refine split training mode in train_unizero.py for better World Model/PPO separation
- Enhance buffer tracking with latest_push_count and new_data_ratio in game_buffer.py
- Improve PPO batch extraction and padding logic in game_buffer_unizero.py
- Add loss_type parameter support in world_model.py for flexible training modes
- Update PPO configurations with split training options and improved hyperparameters

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tAnGjIa520 tAnGjIa520 force-pushed the unizero-ppo-updates branch from c10fa1a to 48b72b2 Compare April 30, 2026 04:22
Remove docs/ directory from version control as specified in .gitignore.
The directory will remain locally but won't be tracked by git.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request research Research work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants