Conversation
…ning containers in bash
… for each other before anyone is allowed to reset the memory pool
| pip install dist/apex-*.whl | ||
| " | ||
|
|
||
| - name: Run L0 tests |
There was a problem hiding this comment.
Could you run the import extensions test first
python tests/test_extension_import.py
.github/workflows/rocm-ci.yml
Outdated
| docker_image: | ||
| description: 'Docker image to use' | ||
| required: false | ||
| default: 'rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1' |
There was a problem hiding this comment.
| default: 'rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1' | |
| default: 'rocm/pytorch:latest' |
We don't want to have to update this from time-to-time, so if rocm/pytorch:latest works as a reasonable default, let's use that. If it doesn't, then we need to understand why and make it work.
.github/workflows/rocm-ci.yml
Outdated
| type: boolean | ||
|
|
||
| env: | ||
| DOCKER_IMAGE: ${{ inputs.docker_image || 'rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1' }} |
There was a problem hiding this comment.
| DOCKER_IMAGE: ${{ inputs.docker_image || 'rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1' }} | |
| DOCKER_IMAGE: ${{ inputs.docker_image || 'rocm/pytorch:latest' }} |
Also, to follow the ODR (One Definition Rule), let's remove the default value for the inputs.docker_image since it'll get set from here if it's empty.
| jobs: | ||
| build: | ||
| name: Build Apex Wheel | ||
| runs-on: build-only-apex |
There was a problem hiding this comment.
@leo-amd Is this a runner label configured just for Apex builds? I'd assume we would use a more generic runner label. In fact, we should use Github's runners for this, because this is just the build job
There was a problem hiding this comment.
Yes, they are configured per repository
| run: | | ||
| docker exec apex-test-container bash -c " | ||
| set -e | ||
| pip install expecttest onnxscript |
There was a problem hiding this comment.
@amd-sriram Can these be added to some requirements.txt or requirements-dev.txt file of Apex?
| run: | | ||
| docker exec apex-test-container bash -c " | ||
| set -eo pipefail | ||
| export HSA_FORCE_FINE_GRAIN_PCIE=1 |
There was a problem hiding this comment.
@amd-sriram RCCL team (Wenkai) told me a while back that HSA_FORCE_FINE_GRAIN_PCIE is not needed anymore: "since ROCm 5.7, no longer required". Can you please confirm if that's true on ROCm7.2, and remove it from here (and any Apex documentation), if so?
| docker exec apex-test-container bash -c " | ||
| set -eo pipefail | ||
| export HSA_FORCE_FINE_GRAIN_PCIE=1 | ||
| export HSA_ENABLE_SDMA=0 |
.github/workflows/rocm-ci.yml
Outdated
| L0_results.log | ||
| contrib_results.log | ||
| halo_results.log | ||
| syncbn_results.log |
There was a problem hiding this comment.
@leo-amd Can we just do *.log here to reduce maintenance when/if we update any of the test steps or log names?
There was a problem hiding this comment.
@amd-sriram / @leo-amd The changes to this file shouldn't be part of this PR, they belong in a different PR. If they are already merged into master, please rebase this PR branch so it doesn't show the diffs in this file.
No description provided.