Skip to content

Fix: rezero buffers variable action space#480

Open
SteamMachinist wants to merge 3 commits into
opendilab:mainfrom
SteamMachinist:fix/rezero-buffers-variable-action-space
Open

Fix: rezero buffers variable action space#480
SteamMachinist wants to merge 3 commits into
opendilab:mainfrom
SteamMachinist:fix/rezero-buffers-variable-action-space

Conversation

@SteamMachinist

Copy link
Copy Markdown
Contributor

No description provided.

@puyuan1996

Copy link
Copy Markdown
Collaborator

Thanks for the PR. I agree with the motivation to handle variable action spaces correctly, but I think this change may break the existing child_visit_segment contract.

For varied_action_space, the current non-reanalyzed target path treats child_visit[current_index] as a compact distribution aligned with legal_actions, and expands it back to the full action space later. This PR writes the expanded full-action policy back into child_visit_segment. As a result, when the same segment is later used by the non-reanalyzed path, it will be expanded again incorrectly.

For example, with action_space_size=5, legal_actions=[1, 3], and compact distribution [0.25, 0.75], this PR stores [0, 0.25, 0, 0.75, 0]. Later _compute_target_policy_non_reanalyzed() will read distributions[0] and distributions[1] for legal actions 1 and 3, producing an incorrect target.

I think we should either:

  1. keep child_visit_segment compact and only return full-size target_policies, or
  2. change all consumers of child_visit_segment to treat it as full-size, update the docs/comments, and add regression tests.

Could you add a unit test covering ReZero + varied_action_space, especially the path where a reanalyzed segment is later sampled through the non-reanalyzed target computation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants