[graph_trainer] Fix DSv3 bucketing order for multinode bitwise numerics with eager#3770
Merged
Merged
Conversation
…cs with eager Bug report: DSv3 GraphTrainer numerics sweeps were comparing Eager and GraphTrainer weight hashes after each step. After the bucketing-order investigation, one ordering difference was that GraphTrainer chunked loss can move the lm_head weight use under module_fqn "loss". The non-chunked default transformer buckets ended with only ["norm", "lm_head"], so chunked loss left the final lm_head-related work outside the bucket intended to preserve final-layer ordering. Repro: Run the 16-GPU DSv3 numerics matrix with weight hashes enabled, for example the FLASH/NOFLASH, CG/NOCG, BS=1/16 sweep with: ```bash GT_WEIGHT_HASH=1 --debug.deterministic --metrics.perf_metrics_only ``` The BS16 cases exposed weight-hash mismatches and required checking bucket ordering around the final norm/lm_head/loss region. Explanation and fix: The default final transformer block bucket remains ["norm", "lm_head"] for non-chunked loss. When GraphTrainer is configured with chunked CE loss, the pass builder calls the bucket helper with `chunked_loss_enabled=True`, which extends that final bucket to include "loss". That keeps chunked-loss lm_head uses in the same final bucket as the lm_head module and avoids letting GraphTrainer reorder that final dependency relative to the rest of the model, while preserving the original bucket plan for non-chunked loss. The unit test pins the compile-time pass decision so the loss bucket is enabled only when the configured loss is chunked CE, without asserting the helper's literal bucket values. Test Plan: ```bash /home/ivankobzarev/local/b/pytorch-env/bin/python -m py_compile \ torchtitan/experiments/graph_trainer/common_utils.py \ torchtitan/experiments/graph_trainer/passes.py \ torchtitan/experiments/graph_trainer/tests/test_passes.py /home/ivankobzarev/local/b/pytorch-env/bin/python -m unittest \ torchtitan.experiments.graph_trainer.tests.test_passes.TestDefaultTransformerBlockBuckets ``` Authored with assistance from OpenAI Codex.
aditvenk
approved these changes
Jun 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
DSv3 GraphTrainer numerics sweeps were comparing Eager and GraphTrainer weight hashes after each step. After the bucketing-order investigation, one ordering difference was that GraphTrainer chunked loss can move the lm_head weight use under module_fqn "loss". The non-chunked default transformer buckets ended with only ["norm", "lm_head"], so chunked loss left the final lm_head-related work outside the bucket intended to preserve final-layer ordering.
The default final transformer block bucket remains ["norm", "lm_head"] for non-chunked loss. When GraphTrainer is configured with chunked CE loss, the pass builder calls the bucket helper with
chunked_loss_enabled=True, which extends that final bucket to include "loss". That keeps chunked-loss lm_head uses in the same final bucket as the lm_head module and avoids letting GraphTrainer reorder that final dependency relative to the rest of the model, while preserving the original bucket plan for non-chunked loss.