Torch-NPU主干和上游社区集成测试#127
Conversation
- Add Dockerfile for pytorch-npu-builder with CANN 9.0.0-beta.2 - Add build-docker-image.yml for scheduled/manual image build - Add _build.yml for PyTorch and torch_npu compilation - Add _collect.yml for pytest case collection and sharding - Add _test.yml for test execution with subprocess isolation - Add npu-full-test.yml as main orchestration workflow - Add scripts: collect_all_cases.py, run_npu_test_shard.py, generate_report.py - Add CLAUDE.md with complete design documentation Key features: - Docker image pass-through via needs.build.outputs.docker-image - Case-level sharding for load balancing - Per-case subprocess execution for crash isolation - Distributed (serial) vs Regular (32 workers) test execution Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
- Rename npu-full-test.yml to "PyTorch NPU Full Test(main 分支)" - Add pull_request trigger to build-docker-image.yml for Dockerfile and workflow changes Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
- Change runner to ubuntu-22.04-arm (supports Docker) - Skip login and push for pull_request events (only build test) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
- Convert repository_owner to lowercase using tr command - Docker requires all image names to be lowercase Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
- Change base image from ghcr.io/pytorch/manylinux-builder (private) to quay.io/pypa/manylinux_2_28_aarch64 (public, matches PyTorch main) - Add necessary OS packages matching PyTorch's Dockerfile - Set Python 3.11 from manylinux as default (PATH=/opt/python/cp311-cp311/bin) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
- Add gcc-toolset-13 for modern GCC toolchain (required for PyTorch build) - Add language environment variables (LC_ALL, LANG, LANGUAGE) - Add git safe.directory config for bind-mounted repos - Remove pytest dependencies (install at test time, not in build image) - Match PyTorch's .ci/docker/manywheel/Dockerfile_2_28_aarch64 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
CANN toolkit installation requires Python environment. Previously Python PATH was set after CANN installation, causing cann-ge-compiler install failure (exit code 4). Now Python 3.11 PATH is configured before CANN installation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
NNAL package (atb) requires CANN environment variables to be set. Add 'source /usr/local/Ascend/cann/set_env.sh' before nnal install. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Use 'ascend' (lowercase) instead of github.repository_owner to match the actual image name pushed to ghcr.io. Docker requires image names to be lowercase. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Trigger when workflow files, scripts, or docker files are modified: - .github/workflows/** (workflow files) - .github/scripts/** (Python scripts) - .github/docker/** (Dockerfile) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
In reusable workflows (workflow_call), the env context is not available in container configuration. Replace env.REGISTRY with hardcoded ghcr.io value. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
ghcr.io images are private by default and require authentication. Add credentials configuration to all reusable workflows that pull the pytorch-npu-builder image: - _build.yml - _collect.yml - _test.yml Uses github.actor and GITHUB_TOKEN for authentication. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Also remove related PR conditional checks since they are no longer needed:
- Remove 'if: github.event_name != 'pull_request' from login step
- Change 'push: ${{ github.event_name != 'pull_request' }}' to 'push: true'
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Use GitHub API to set the container package visibility to public
after pushing the image. This allows anyone to pull the image
without authentication.
PATCH /orgs/{org}/packages/container/{package_name}
with {"visibility":"public"}
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Passkerer-ai, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
- Add pull_request trigger for Dockerfile and workflow changes
- PR builds only test build, not push (push: ${{ github.event_name != 'pull_request' }})
- Skip visibility step for PR events
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
1. 统一所有 workflow 的 pip 缓存配置 - build workflow: 4 处 PIP_CACHE_DIR 设置 - collect/regular/dist/custom workflow: 添加 PIP_CACHE_DIR 和 Cache pip action - 所有 pip install 步骤使用缓存加速下载 2. 添加 torchvision 安装(忽略版本检查) - 使用 --no-deps 绕过 torch 版本绑定问题 - 解决 onnx/test_models 系列测试的依赖缺失 3. 更新黑名单配置 case_paths_ci.yml - 新增 torch_openreg 测试(需单独编译,上游默认排除) - 新增 dynamo/test_torchrec(fbgemm-gpu 无 ARM64 支持) - 新增 onnx/exporter/test_hf_models_e2e(transformers 依赖) - 详细注释说明每个黑名单的原因 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
- run_npu_test_shard.py: 移除--case-paths-config参数,discover调用时传None - _torch-npu-upstream-collect.yml: 移除collect_all_cases.py的--case-paths-config传参 - _torch-npu-upstream-test-dist.yml: 删除冗余的error logs上传步骤 - _torch-npu-upstream-test-regular.yml: 删除冗余的error logs上传步骤 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
主要修改: - 新增setup-npu-test-env action:封装checkout、cache、安装torch/torch_npu、测试依赖等公共步骤 - 简化4个子workflow:collect/custom/dist/regular统一调用action - 优化collect_all_cases.py日志:display_name提前计算避免重复逻辑 - 简化run_npu_test_shard.py:移除废弃的shard discovery模式,保留cases-json和test-files模式 - distributed测试串行执行通过max_workers=1实现 参数清理: - action只保留实际使用的参数:python_version、torch_wheel_artifact、torch_npu_wheel_artifact、pytorch_src_artifact - 删除无意义的pytorch_version、cache_key_prefix、patch_log_suffix参数 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
问题:收集逻辑只检查"::"存在,错误收集了非测试用例:
- @torch.library.register_fake("torchvision::nms") 包含::但不是测试用例
修复:添加严格过滤条件:
1. 必须包含 ".py::" (Python测试文件标识)
2. 不能以 "@" 开头 (装饰器/注册符号)
3. 不能以 "<" 开头 (pytest收集标记)
4. 不能包含 "(" (函数调用语法)
影响文件:
- collect_all_cases.py: collect_cases_for_file()
- run_npu_test_shard.py: collect_test_cases()
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
pytest --collect-only -q 模式输出标准格式: - 每行一个完整nodeid: test_file.py::TestClass::test_method - 最后有统计信息: "X tests collected" 简化解析规则: 1. 跳过空行 2. 跳过包含 "collected"/"selected" 的统计行 3. 跳过以 "=" 开头的分隔线 4. 只检查 ".py::" 确保是Python测试文件 移除之前的复杂字符串匹配(检查@、<、(等), 因为-q模式输出已经很干净,不需要防御性过滤。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
问题:collect_test_cases()与collect_all_cases.py的collect_cases_for_file()功能重复 修改: 1. 删除collect_test_cases()函数(110行) 2. 删除run_tests_with_concurrent_isolation()函数(214行) 3. 添加import collect_all_cases 4. --test-files模式改为: - 调用collect_all_cases.collect_all_cases()收集用例 - 构建CaseExecutionTask列表 - 调用run_tests_with_tasks_concurrent()执行 统一执行流程: - --cases-json模式:预收集用例 → run_tests_with_tasks_concurrent() - --test-files模式:现场收集用例 → run_tests_with_tasks_concurrent() 代码减少291行(1584→1293) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLA Signature Guide@kerer-ai , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
Key features: