Skip to content

[Pytorch] Make test script generate checkpoints if they don't exist#2650

Open
kainzhong wants to merge 1 commit intoNVIDIA:mainfrom
kainzhong:fix/generate_ckpt_for_test
Open

[Pytorch] Make test script generate checkpoints if they don't exist#2650
kainzhong wants to merge 1 commit intoNVIDIA:mainfrom
kainzhong:fix/generate_ckpt_for_test

Conversation

@kainzhong
Copy link
Collaborator

Description

For users who run pytorch tests for the first time, they are likely to haven't generate checkpoint files and they have to go to tests/pytorch/test_checkpoint.py to figure out how to run this file to generate them.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • If checkpoint path doesn't exist, then run tests/pytorch/test_checkpoint.py to generate all the checkpoint files under that path
  • Make checkpoint_dir.mkdir create parent directories if they don't exist, which is likely to be the case for people who do export TE_PATH=$(pwd) before run the tests

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 4, 2026

Greptile Overview

Greptile Summary

This PR improves the first-run experience for users running PyTorch tests by automatically generating checkpoint files if they don't exist. The changes address previous review feedback by properly exporting the NVTE_TEST_CHECKPOINT_ARTIFACT_PATH environment variable before use.

Key changes:

  • Test script now exports NVTE_TEST_CHECKPOINT_ARTIFACT_PATH environment variable before checking directory existence
  • Added automatic checkpoint generation when the checkpoint directory doesn't exist
  • Updated mkdir call to create parent directories with parents=True parameter

Impact:
Users running tests for the first time no longer need to manually figure out how to generate checkpoint files - the test script handles this automatically.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are straightforward infrastructure improvements that enhance test automation without modifying core functionality. Previous review concerns have been addressed (environment variable export, error handling with error_exit). The mkdir change ensures robustness when parent directories don't exist.
  • No files require special attention

Important Files Changed

Filename Overview
qa/L0_pytorch_unittest/test.sh Added checkpoint generation logic that exports environment variable and generates checkpoints if directory doesn't exist; addresses previous review feedback about missing export
tests/pytorch/test_checkpoint.py Added parents=True to mkdir call to ensure parent directories are created when needed

Sequence Diagram

sequenceDiagram
    participant Script as test.sh
    participant Env as Environment
    participant FS as File System
    participant CheckpointGen as test_checkpoint.py
    participant PyTest as pytest

    Script->>Env: export NVTE_TEST_CHECKPOINT_ARTIFACT_PATH
    Script->>FS: Check if directory exists
    alt Directory does not exist
        Script->>CheckpointGen: python3 test_checkpoint.py --save-checkpoint all
        CheckpointGen->>FS: mkdir(parents=True, exist_ok=True)
        loop For each checkpoint type
            CheckpointGen->>CheckpointGen: Create module
            CheckpointGen->>FS: Save checkpoint file
        end
        CheckpointGen-->>Script: Success or error_exit
    else Directory exists
        Script->>Script: Skip generation
    end
    Script->>PyTest: Run pytest test_checkpoint.py
    PyTest->>FS: Load checkpoint files
    PyTest->>PyTest: Test module loading
    PyTest-->>Script: test_fail or success
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

NVTE_TEST_CHECKPOINT_ARTIFACT_PATH=$TE_PATH/artifacts/tests/pytorch/test_checkpoint python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_checkpoint.xml $TE_PATH/tests/pytorch/test_checkpoint.py || test_fail "test_checkpoint.py"
NVTE_TEST_CHECKPOINT_ARTIFACT_PATH=$TE_PATH/artifacts/tests/pytorch/test_checkpoint
if [ ! -d "$NVTE_TEST_CHECKPOINT_ARTIFACT_PATH" ]; then
python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVTE_TEST_CHECKPOINT_ARTIFACT_PATH not exported; checkpoint generation won't use it

Suggested change
python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all
NVTE_TEST_CHECKPOINT_ARTIFACT_PATH=$NVTE_TEST_CHECKPOINT_ARTIFACT_PATH python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all

@kainzhong kainzhong force-pushed the fix/generate_ckpt_for_test branch from 5eb12d9 to 63b6471 Compare February 4, 2026 22:08
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

NVTE_TEST_CHECKPOINT_ARTIFACT_PATH=$TE_PATH/artifacts/tests/pytorch/test_checkpoint python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_checkpoint.xml $TE_PATH/tests/pytorch/test_checkpoint.py || test_fail "test_checkpoint.py"
export NVTE_TEST_CHECKPOINT_ARTIFACT_PATH=$TE_PATH/artifacts/tests/pytorch/test_checkpoint
if [ ! -d "$NVTE_TEST_CHECKPOINT_ARTIFACT_PATH" ]; then
python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for checkpoint generation - if this command fails, the script continues and pytest will fail with a less clear error message

Suggested change
python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all
python3 $TE_PATH/tests/pytorch/test_checkpoint.py --save-checkpoint all || error_exit "Failed to generate checkpoint files"

Signed-off-by: Kaining Zhong <kainingz@nvidia.com>
@kainzhong kainzhong force-pushed the fix/generate_ckpt_for_test branch from 63b6471 to 66aaead Compare February 4, 2026 22:13
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@kainzhong
Copy link
Collaborator Author

/te-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant