[CI][Packaging] Cibuildwheel-based multi-platform wheel publishing#19641
[CI][Packaging] Cibuildwheel-based multi-platform wheel publishing#19641tlopex wants to merge 43 commits into
Conversation
Linking the full LLVM static archives into libtvm_compiler.so on Linux produces a library large enough to trigger a GNU ld (binutils) R_X86_64_GOTPCRELX relaxation bug. conda-forge LLVM is built with -fno-plt, so calls are emitted as indirect GOT calls; ld relaxes one of them to a direct call with an incorrect displacement that lands in read-only data instead of the intended function. The result is a runtime SIGSEGV inside llvm::X86Subtarget while building the code-generation pass pipeline (e.g. tvm.compile(target="llvm")). This reproduces with a freshly linked, unrepaired manylinux build (gcc-toolset-14 + GNU ld 2.41), so it is the linker relaxation -- not auditwheel, LLVM 22, or any TVM source change. The same objects linked by a different ld (or with relaxation disabled) work correctly. Pass -Wl,--no-relax when linking tvm_compiler with LLVM on Linux to keep the GOT-indirect sequences and avoid the miscompilation. It is harmless when LLVM is linked dynamically. See binutils bug ld/25754. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the debugging instrumentation added while diagnosing the Linux wheel SIGSEGV now that the root cause is fixed (linker --no-relax): - verify_tvm_linux.sh: remove the gdb backtrace wrapper and the readelf/nm symbol dumps; it now just clears dev library-path overrides and runs the cross-platform verifier directly. - publish workflow + wheel action: remove the debug_symbols option (which produced unstripped -g wheels and is unsuitable for publishing). The legitimate verification (import, runtime-from-wheel and static-LLVM checks, and the TIRX/Relax compile smoke tests) is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Replace the select_build_matrix job (which generated the matrix from embedded Python and the build_target input) with a static strategy.matrix.include in build_wheels. The workflow always builds all four platform wheels, which is what publishing requires anyway, so the dynamic-matrix indirection and the build_target input/validation are removed. - Single-source the LLVM toolchain version: it is now set once in LLVM_VERSION and referenced by the cache key and the conda install steps, instead of being hardcoded in three places. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Windows repair command (rewrite_wheel.py) passed --distribution-name but not --distribution-version, so Windows wheels kept the pyproject version while Linux/macOS wheels got the override. When publishing to TestPyPI with a version override this produced a mismatched, stale-version Windows wheel that collided with a previously used filename and failed the upload. Pass --distribution-version too (empty is a no-op). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
find_include_path() only searched source-tree-relative locations (<pkg>/../include, ...), so it could not find the headers bundled in an installed wheel (at <pkg>/include), and it looked for the tvm-ffi/dlpack headers under 3rdparty/ which wheels do not ship. As a result target="c" and Module.export_library() (compiling generated C/C++ sources) failed from an installed wheel with "Cannot find the source directory" or a missing <tvm/ffi/c_api.h>. Search "." as well (the wheel layout), and fall back to the include directories reported by the installed apache-tvm-ffi package (tvm_ffi.libinfo) for the tvm-ffi and dlpack headers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
test_transform_lower_tirx.py imported and called Simplify, but the pass was renamed to StmtSimplify (the only simplify pass exported from tvm.tirx.transform). The stale name caused a collection-time ImportError. Use StmtSimplify; the file now collects and all 32 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new wheel packaging flow for TVM, adding GitHub composite actions and helper scripts to build, repair, validate, and verify LLVM-enabled and CUDA-enabled Python wheels. It also updates CMake configurations to handle relative RPATHs and work around a linker relaxation bug, and adjusts TVM's Python library and RPC server to better support wheel-installed environments. The review feedback suggests copying the create_system attribute in rewrite_wheel.py to preserve ZIP platform metadata, and adding a defensive check in verify_tvm_install.py to ensure the TVM runtime library is loaded correctly before accessing its path.
| def _copy_info(info: zipfile.ZipInfo, filename: str) -> zipfile.ZipInfo: | ||
| copied = zipfile.ZipInfo(filename=filename, date_time=info.date_time) | ||
| copied.compress_type = info.compress_type | ||
| copied.comment = info.comment | ||
| copied.extra = info.extra | ||
| copied.internal_attr = info.internal_attr | ||
| copied.external_attr = info.external_attr | ||
| return copied |
There was a problem hiding this comment.
The _copy_info function does not copy the create_system attribute from the original ZipInfo object. This can cause the rewritten wheel to lose its original system platform metadata (e.g., defaulting to Windows/MS-DOS 0 instead of Unix 3), which can affect file permissions when the wheel is unpacked and makes the build non-reproducible depending on the host OS running the script.
| def _copy_info(info: zipfile.ZipInfo, filename: str) -> zipfile.ZipInfo: | |
| copied = zipfile.ZipInfo(filename=filename, date_time=info.date_time) | |
| copied.compress_type = info.compress_type | |
| copied.comment = info.comment | |
| copied.extra = info.extra | |
| copied.internal_attr = info.internal_attr | |
| copied.external_attr = info.external_attr | |
| return copied | |
| def _copy_info(info: zipfile.ZipInfo, filename: str) -> zipfile.ZipInfo: | |
| copied = zipfile.ZipInfo(filename=filename, date_time=info.date_time) | |
| copied.create_system = info.create_system | |
| copied.compress_type = info.compress_type | |
| copied.comment = info.comment | |
| copied.extra = info.extra | |
| copied.internal_attr = info.internal_attr | |
| copied.external_attr = info.external_attr | |
| return copied |
| def _assert_loaded_runtime_from_wheel(libdir: Path, runtime_candidates: list[Path]) -> None: | ||
| import tvm.base as tvm_base # pylint: disable=import-outside-toplevel | ||
|
|
||
| loaded_runtime = Path(tvm_base._LIB_RUNTIME._name).resolve() # pylint: disable=protected-access |
There was a problem hiding this comment.
To prevent potential AttributeError or TypeError if the TVM runtime library fails to load or is initialized as None, add a defensive check before accessing _LIB_RUNTIME._name. This aligns with the defensive checks used elsewhere in the file (e.g., _log_tvm_ffi_details).
| loaded_runtime = Path(tvm_base._LIB_RUNTIME._name).resolve() # pylint: disable=protected-access | |
| if tvm_base._LIB_RUNTIME is None or not hasattr(tvm_base._LIB_RUNTIME, "_name"): | |
| raise RuntimeError("TVM runtime library was not loaded correctly or is missing '_name' attribute") | |
| loaded_runtime = Path(tvm_base._LIB_RUNTIME._name).resolve() # pylint: disable=protected-access |
MasterJH5574
left a comment
There was a problem hiding this comment.
Finished a round on .github/workflows/publish_wheel.yml
| fi | ||
|
|
||
| - name: Checkout source | ||
| uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 |
There was a problem hiding this comment.
Use tag instead of hash. Same for other places.
| default: "none" | ||
| type: choice | ||
| options: | ||
| - none |
There was a problem hiding this comment.
What does "none" mean? If it means a build dryrun, maybe add a brief comment about it.
| permissions: | ||
| actions: read | ||
| contents: read | ||
| id-token: write | ||
| attestations: write |
There was a problem hiding this comment.
Are the permissions necessary?
| - name: Check wheel sizes | ||
| shell: bash | ||
| run: | | ||
| set -euo pipefail | ||
| limit_bytes=100000000 | ||
| shopt -s nullglob | ||
| wheels=(dist/*.whl) | ||
| if [[ "${#wheels[@]}" -eq 0 ]]; then | ||
| echo "No wheel artifacts found under dist/" >&2 | ||
| exit 1 | ||
| fi | ||
| if [[ "${#wheels[@]}" -ne 4 ]]; then | ||
| echo "Expected 4 wheel artifacts, found ${#wheels[@]}" >&2 | ||
| printf '%s\n' "${wheels[@]}" >&2 | ||
| exit 1 | ||
| fi | ||
| failed=0 | ||
| for wheel in "${wheels[@]}"; do | ||
| size="$(stat -c '%s' "$wheel")" | ||
| printf '%s %s bytes\n' "$wheel" "$size" | ||
| if (( size > limit_bytes )); then | ||
| echo "Wheel exceeds 100 MB PyPI/TestPyPI upload limit: ${wheel}" >&2 | ||
| failed=1 | ||
| fi | ||
| done | ||
| exit "$failed" |
There was a problem hiding this comment.
Do we need to manual check? I assume it will automatically fail at the publish step if the wheel is larger than 100MB?
| - name: Generate artifact attestation for wheels | ||
| uses: actions/attest-build-provenance@v1 | ||
| with: | ||
| subject-path: dist/* |
There was a problem hiding this comment.
I was wondering what this is used for.
| - name: Publish package distributions to TestPyPI | ||
| if: ${{ inputs.publish_repository == 'testpypi' }} | ||
| uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # v1.13.0 | ||
| with: | ||
| attestations: true | ||
| verbose: true | ||
| repository-url: https://test.pypi.org/legacy/ | ||
|
|
||
| - name: Publish package distributions to PyPI | ||
| if: ${{ inputs.publish_repository == 'pypi' }} | ||
| uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # v1.13.0 | ||
| with: | ||
| attestations: true | ||
| verbose: true |
There was a problem hiding this comment.
Not sure if it's possible to merge these two branches.
| - name: Check out source | ||
| uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | ||
| with: | ||
| ref: ${{ inputs.tag }} | ||
| submodules: recursive | ||
| fetch-depth: 0 | ||
| fetch-tags: true | ||
|
|
||
| - uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # v8.0.1 | ||
| with: | ||
| name: tvm-wheel-linux-x86_64-manylinux_2_28 | ||
| path: wheelhouse |
There was a problem hiding this comment.
Do we need checkout and artifact download here?
| - name: Verify package from TestPyPI | ||
| if: ${{ inputs.publish_repository == 'testpypi' }} | ||
| env: | ||
| TVM_PYTHON: python | ||
| TVM_TEST_INDEX_URL: https://test.pypi.org/simple/ | ||
| TVM_EXTRA_INDEX_URL: https://pypi.org/simple | ||
| TVM_EXPECT_LLVM_ENABLED: "1" | ||
| TVM_EXPECT_STATIC_LLVM: "1" | ||
| TVM_EXPECT_CUDA_RUNTIME: "1" | ||
| run: ci/scripts/package/tvm_wheel_helper.sh verify-pypi | ||
|
|
||
| - name: Verify package from PyPI | ||
| if: ${{ inputs.publish_repository == 'pypi' }} | ||
| env: | ||
| TVM_PYTHON: python | ||
| TVM_TEST_INDEX_URL: https://pypi.org/simple/ | ||
| TVM_EXTRA_INDEX_URL: https://pypi.org/simple | ||
| TVM_EXPECT_LLVM_ENABLED: "1" | ||
| TVM_EXPECT_STATIC_LLVM: "1" | ||
| TVM_EXPECT_CUDA_RUNTIME: "1" | ||
| run: ci/scripts/package/tvm_wheel_helper.sh verify-pypi |
There was a problem hiding this comment.
I wonder if it's possible to merge these two as well.
| - name: Build CUDA runtime | ||
| id: build_cuda | ||
| uses: ./.github/actions/build-cuda | ||
| with: | ||
| arch: ${{ matrix.arch }} | ||
| linux_image: ${{ matrix.linux_image }} | ||
| linux_image_tag: ${{ matrix.linux_image_tag }} | ||
| cuda_architectures: ${{ inputs.cuda_architectures }} | ||
| include_cuda_runtime: ${{ matrix.include_cuda_runtime }} |
There was a problem hiding this comment.
Skip when cuda is not needed.
| include_cuda_runtime: | ||
| description: "Set to true to build the CUDA runtime library" | ||
| required: false | ||
| default: "false" |
There was a problem hiding this comment.
It should always be true? If that's the case, we can remove this param?
| - name: Detect CUDA inputs | ||
| id: cuda_inputs | ||
| shell: bash -l {0} | ||
| env: | ||
| INPUT_INCLUDE_CUDA_RUNTIME: ${{ inputs.include_cuda_runtime }} | ||
| run: | | ||
| set -eux | ||
| include_cuda_runtime="$(printf '%s' "${INPUT_INCLUDE_CUDA_RUNTIME}" | tr '[:upper:]' '[:lower:]')" | ||
| case "${include_cuda_runtime}" in | ||
| 1|true|yes|on) include_cuda_runtime=1 ;; | ||
| 0|false|no|off) include_cuda_runtime=0 ;; | ||
| *) | ||
| echo "include_cuda_runtime must be a boolean value" >&2 | ||
| exit 1 | ||
| ;; | ||
| esac | ||
| echo "include_cuda_runtime=${include_cuda_runtime}" >> "${GITHUB_OUTPUT}" |
There was a problem hiding this comment.
Is this check unnecessary?
| - name: Reject non-Linux CUDA runtime builds | ||
| if: runner.os != 'Linux' && steps.cuda_inputs.outputs.include_cuda_runtime == '1' | ||
| shell: bash -l {0} | ||
| run: | | ||
| echo "CUDA runtime wheels are only enabled on Linux in this workflow" >&2 | ||
| exit 1 |
| - name: Report CUDA runtime output | ||
| id: cuda_runtime | ||
| shell: bash -l {0} | ||
| env: | ||
| INCLUDE_CUDA_RUNTIME: ${{ steps.cuda_inputs.outputs.include_cuda_runtime }} | ||
| TVM_CUDA_BUILD_DIR: ${{ runner.temp }}/tvm-wheel-cuda | ||
| TVM_CUDA_RUNTIME_PATH: "" | ||
| run: | | ||
| set -eux | ||
| if [[ "${INCLUDE_CUDA_RUNTIME}" != "1" ]]; then | ||
| echo "path=" >> "${GITHUB_OUTPUT}" | ||
| exit 0 | ||
| fi | ||
| cuda_runtime="$(ci/scripts/package/tvm_wheel_helper.sh cuda-path)" | ||
| if [[ -z "${cuda_runtime}" ]]; then | ||
| echo "CUDA runtime build did not produce libtvm_runtime_cuda.so" >&2 | ||
| exit 1 | ||
| fi | ||
| case "${cuda_runtime}" in | ||
| "${TVM_CUDA_BUILD_DIR}"/*) ;; | ||
| *) | ||
| echo "CUDA runtime path is outside the expected build directory: ${cuda_runtime}" >&2 | ||
| exit 1 | ||
| ;; | ||
| esac | ||
| echo "path=${cuda_runtime}" >> "${GITHUB_OUTPUT}" |
There was a problem hiding this comment.
Do we need this report step? Or the "Build CUDA runtime in manylinux" step above already reports necessary info (e.g., success, failure, or failure reason)
This adds a reproducible
cibuildwheel-based pipeline to build and publish TVM binary wheels across platforms, with LLVM linked statically and an optional CUDA runtime, so users canpip installa working TVM without a source build.