[CI] Add GitHub workflow for building and releasing fat wheels#91
[CI] Add GitHub workflow for building and releasing fat wheels#91tongke6 wants to merge 9 commits into
Conversation
…83) Replace the monolithic `cula.cudac` extension with per-arch extensions (`cula._cudac_sm90`, `cula._cudac_sm100`) so that SM90 and SM100/SM103 kernels are compiled independently with their own `-gencode` flags. This enables building fat-binary wheels containing all architectures without needing the target GPU present at build time. Key changes: - Split pybind.cu into per-file PYBIND11_MODULE definitions - Add `cula/cudac.py` proxy module for backwards-compatible imports - Add `CULA_BUILD_ALL_ARCHS=1` env var to enable all SM targets - Add `--fat` flag to build_wheel.sh for CI fat-binary builds - Pin dependency versions and use `no-local-version` scheme for reproducible wheel filenames - Use setuptools_scm for dynamic `__version__` - Document pre-built wheel installation in README
There was a problem hiding this comment.
Code Review
This pull request restructures the build system and CUDA extension packaging for cuLA to support separate per-architecture extensions (_cudac_sm100 and _cudac_sm90) and fat-binary builds. It introduces a lazy-loading proxy module (cula.cudac) to dynamically expose the compiled extension functions, and updates the versioning to use setuptools_scm. Feedback on these changes highlights a thread-safety vulnerability in the lazy-loading proxy that could cause race conditions during concurrent imports, and advises against strict version pinning of runtime dependencies in pyproject.toml to prevent dependency conflicts for downstream users.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Pull request overview
This PR refactors cuLA’s CUDA packaging so architecture-specific kernels are built as separate extensions (SM90 vs SM100/SM103) and adds CI automation to build and publish “fat” wheels (multi-arch) via GitHub Releases, while keeping import cula.cudac working via a Python proxy module.
Changes:
- Split the monolithic CUDA extension into per-architecture
CUDAExtensions and movePYBIND11_MODULEbindings into per-arch.cuentrypoints. - Add
CULA_BUILD_ALL_ARCHS=1support and a--fatoption in the wheel build script for CI fat-binary builds. - Introduce a GitHub Actions workflow to build cu129/cu130 wheels for x86_64/aarch64 and attach them to GitHub Releases; update versioning + README install docs accordingly.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/build-release.yml |
New CI workflow to build and upload CUDA-versioned wheel artifacts and draft a GitHub Release. |
setup.py |
Builds per-arch CUDA extensions (cula._cudac_sm90, cula._cudac_sm100) and adds CULA_BUILD_ALL_ARCHS behavior. |
csrc/api/pybind.cu |
Removes the monolithic binding module entrypoint. |
csrc/api/kda_sm90.cu |
Adds SM90-specific PYBIND11_MODULE bindings. |
csrc/api/kda_sm100.cu |
Adds SM100/SM103-specific PYBIND11_MODULE bindings. |
cula/cudac.py |
New proxy module to preserve import cula.cudac API across split extensions. |
cula/__init__.py |
Switches to setuptools_scm-generated runtime version when available. |
scripts/build_wheel.sh |
Adds --fat flag to set CULA_BUILD_ALL_ARCHS=1 during wheel builds. |
README.md |
Documents installing pre-built wheels from GitHub Releases. |
pyproject.toml |
Uses setuptools_scm no-local-version scheme and keeps writing cula/_version.py. |
tests/conftest.py |
Minor formatting adjustment in test collection marker logic. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add double-checked locking to _CudacProxy._load() to prevent race conditions in multi-threaded environments. Raise a descriptive ImportError when no CUDA extensions can be loaded instead of silently producing AttributeError later.
The blanket `except ImportError: pass` swallowed the actual failure reason, making it impossible to diagnose missing shared libraries or build issues. Collect each extension's ImportError and include them in the raised message.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
/gemini review again |
There was a problem hiding this comment.
Code Review
This pull request refactors the build system and CUDA extension loading mechanism of cuLA by splitting the monolithic cula.cudac extension into two separate per-architecture extensions (cula._cudac_sm100 and cula._cudac_sm90) and introducing a lazy-loading proxy module. It also adds support for building fat binary wheels. The review feedback points out that the CULA_SM100_ENABLED and CULA_SM103_ENABLED preprocessor macros were omitted in the new setup.py, which could lead to compilation failures. Additionally, it is recommended to simplify the proxy implementation using standard PEP 562 module-level functions and to broaden exception handling during dynamic imports to catch RuntimeError and OSError.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if not DISABLE_SM100 or not DISABLE_SM103: | ||
| cuda_sources.extend( | ||
| [ | ||
| "csrc/api/kda_sm100.cu", | ||
| "csrc/kda/sm100/kda_fwd_sm100.cu", | ||
| ] | ||
| ) | ||
| if not DISABLE_SM90: | ||
| cuda_sources.extend( | ||
| [ | ||
| "csrc/api/kda_sm90.cu", | ||
| "csrc/kda/sm90/kda_fwd_sm90.cu", | ||
| "csrc/kda/sm90/kda_fwd_sm90_safe_gate.cu", | ||
| ] | ||
| sm100_arch_flags = [] | ||
| if not DISABLE_SM100: | ||
| sm100_arch_flags.extend(["-gencode", "arch=compute_100a,code=sm_100a"]) | ||
| if not DISABLE_SM103: | ||
| sm100_arch_flags.extend(["-gencode", "arch=compute_103a,code=sm_103a"]) |
There was a problem hiding this comment.
The macros CULA_SM100_ENABLED and CULA_SM103_ENABLED are omitted from sm100_arch_flags in the new split extension setup. In the previous monolithic setup, these macros were defined via -DCULA_SM100_ENABLED and -DCULA_SM103_ENABLED when compiling. Omitting them can lead to compilation failures or silently disabled features if any CUDA source or header files conditionally check for these macros.
| if not DISABLE_SM100 or not DISABLE_SM103: | |
| cuda_sources.extend( | |
| [ | |
| "csrc/api/kda_sm100.cu", | |
| "csrc/kda/sm100/kda_fwd_sm100.cu", | |
| ] | |
| ) | |
| if not DISABLE_SM90: | |
| cuda_sources.extend( | |
| [ | |
| "csrc/api/kda_sm90.cu", | |
| "csrc/kda/sm90/kda_fwd_sm90.cu", | |
| "csrc/kda/sm90/kda_fwd_sm90_safe_gate.cu", | |
| ] | |
| sm100_arch_flags = [] | |
| if not DISABLE_SM100: | |
| sm100_arch_flags.extend(["-gencode", "arch=compute_100a,code=sm_100a"]) | |
| if not DISABLE_SM103: | |
| sm100_arch_flags.extend(["-gencode", "arch=compute_103a,code=sm_103a"]) | |
| if not DISABLE_SM100 or not DISABLE_SM103: | |
| sm100_arch_flags = [] | |
| if not DISABLE_SM100: | |
| sm100_arch_flags.extend(["-gencode", "arch=compute_100a,code=sm_100a", "-DCULA_SM100_ENABLED"]) | |
| if not DISABLE_SM103: | |
| sm100_arch_flags.extend(["-gencode", "arch=compute_103a,code=sm_103a", "-DCULA_SM103_ENABLED"]) |
| class _CudacProxy(ModuleType): | ||
| """Lazy proxy that exposes functions from all built arch extensions.""" | ||
|
|
||
| def __init__(self): | ||
| super().__init__(__name__) | ||
| self.__path__ = [] | ||
| self._modules_loaded = False | ||
| self._funcs: dict[str, object] = {} | ||
| self._lock = threading.Lock() | ||
|
|
||
| def _load(self): | ||
| if self._modules_loaded: | ||
| return | ||
| with self._lock: | ||
| if self._modules_loaded: | ||
| return | ||
| loaded_any = False | ||
| errors: dict[str, Exception] = {} | ||
| for ext_name in ("cula._cudac_sm100", "cula._cudac_sm90"): | ||
| try: | ||
| mod = importlib.import_module(ext_name) | ||
| for attr in dir(mod): | ||
| if not attr.startswith("_"): | ||
| self._funcs[attr] = getattr(mod, attr) | ||
| loaded_any = True | ||
| except ImportError as exc: | ||
| errors[ext_name] = exc | ||
| if not loaded_any: | ||
| details = "; ".join(f"{name}: {exc}" for name, exc in errors.items()) | ||
| raise ImportError( | ||
| "None of the cuLA CUDA extensions could be imported. " | ||
| f"Per-extension errors: [{details}]. " | ||
| "Please make sure cuLA is compiled correctly." | ||
| ) | ||
| self.__dict__.update(self._funcs) | ||
| self._modules_loaded = True | ||
|
|
||
| def __getattr__(self, name: str): | ||
| if name.startswith("_"): | ||
| raise AttributeError(name) | ||
| self._load() | ||
| try: | ||
| return self._funcs[name] | ||
| except KeyError: | ||
| raise AttributeError(f"module 'cula.cudac' has no attribute '{name}'") from None | ||
|
|
||
| def __dir__(self): | ||
| self._load() | ||
| return list(self._funcs.keys()) | ||
|
|
||
|
|
||
| _proxy = _CudacProxy() | ||
| _proxy.__dict__.update({k: globals().get(k) for k in ("__spec__", "__file__", "__package__", "__loader__")}) | ||
| sys.modules[__name__] = _proxy |
There was a problem hiding this comment.
Using a custom ModuleType subclass and replacing sys.modules is a legacy pattern that can be replaced with standard PEP 562 module-level __getattr__ and __dir__ functions (supported in Python 3.7+). This is much cleaner, avoids subclassing overhead, and ensures standard module attributes are correctly preserved in __dir__.
Additionally, catching only ImportError during dynamic loading of C++ extensions is risky. Dynamic loading of PyTorch/CUDA extensions can frequently raise RuntimeError (e.g., due to CUDA driver/runtime mismatch) or OSError (e.g., due to missing shared libraries). Catching these exceptions ensures that if one extension fails to load, the other can still be loaded successfully.
_modules_loaded = False
_funcs = {}
_lock = threading.Lock()
def _load():
global _modules_loaded
if _modules_loaded:
return
with _lock:
if _modules_loaded:
return
loaded_any = False
errors = {}
for ext_name in ("cula._cudac_sm100", "cula._cudac_sm90"):
try:
mod = importlib.import_module(ext_name)
for attr in dir(mod):
if not attr.startswith("_"):
_funcs[attr] = getattr(mod, attr)
loaded_any = True
except (ImportError, RuntimeError, OSError) as exc:
errors[ext_name] = exc
if not loaded_any:
details = "; ".join(f"{name}: {exc}" for name, exc in errors.items())
raise ImportError(
"None of the cuLA CUDA extensions could be imported. "
f"Per-extension errors: [{details}]. "
"Please make sure cuLA is compiled correctly."
)
globals().update(_funcs)
_modules_loaded = True
def __getattr__(name: str):
if name.startswith("_"):
raise AttributeError(name)
_load()
try:
return _funcs[name]
except KeyError:
raise AttributeError(f"module 'cula.cudac' has no attribute '{name}'") from None
def __dir__():
_load()
return sorted(k for k in globals().keys() if not k.startswith("_") or k.startswith("__"))
📌 Description
Replace the monolithic
cula.cudacextension with per-arch extensions (cula._cudac_sm90,cula._cudac_sm100) so that SM90 and SM100/SM103 kernels are compiled independently with their own-gencodeflags. This enables building fat-binary wheels containing all architectures without needing the target GPU present at build time.Key changes:
cula/cudac.pyproxy module for backwards-compatible importsCULA_BUILD_ALL_ARCHS=1env var to enable all SM targets--fatflag to build_wheel.sh for CI fat-binary buildsno-local-versionscheme for reproducible wheel filenames__version__🔍 Related Issues
Fix #83
🚀 Pull Request Checklist
Thank you for contributing to cuLA! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
⚡ Performance
Reviewer Notes