[CI] Add GitHub workflow for building and releasing fat wheels by tongke6 · Pull Request #91 · inclusionAI/cuLA

tongke6 · 2026-06-12T03:35:38Z

📌 Description

Replace the monolithic cula.cudac extension with per-arch extensions (cula._cudac_sm90, cula._cudac_sm100) so that SM90 and SM100/SM103 kernels are compiled independently with their own -gencode flags. This enables building fat-binary wheels containing all architectures without needing the target GPU present at build time.

Key changes:

Split pybind.cu into per-file PYBIND11_MODULE definitions
Add cula/cudac.py proxy module for backwards-compatible imports
Add CULA_BUILD_ALL_ARCHS=1 env var to enable all SM targets
Add --fat flag to build_wheel.sh for CI fat-binary builds
Pin dependency versions and use no-local-version scheme for reproducible wheel filenames
Use setuptools_scm for dynamic __version__
Document pre-built wheel installation in README

🔍 Related Issues

Fix #83

🚀 Pull Request Checklist

Thank you for contributing to cuLA! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing.

⚡ Performance

Reviewer Notes

…83) Replace the monolithic `cula.cudac` extension with per-arch extensions (`cula._cudac_sm90`, `cula._cudac_sm100`) so that SM90 and SM100/SM103 kernels are compiled independently with their own `-gencode` flags. This enables building fat-binary wheels containing all architectures without needing the target GPU present at build time. Key changes: - Split pybind.cu into per-file PYBIND11_MODULE definitions - Add `cula/cudac.py` proxy module for backwards-compatible imports - Add `CULA_BUILD_ALL_ARCHS=1` env var to enable all SM targets - Add `--fat` flag to build_wheel.sh for CI fat-binary builds - Pin dependency versions and use `no-local-version` scheme for reproducible wheel filenames - Use setuptools_scm for dynamic `__version__` - Document pre-built wheel installation in README

gemini-code-assist

Code Review

This pull request restructures the build system and CUDA extension packaging for cuLA to support separate per-architecture extensions (_cudac_sm100 and _cudac_sm90) and fat-binary builds. It introduces a lazy-loading proxy module (cula.cudac) to dynamically expose the compiled extension functions, and updates the versioning to use setuptools_scm. Feedback on these changes highlights a thread-safety vulnerability in the lazy-loading proxy that could cause race conditions during concurrent imports, and advises against strict version pinning of runtime dependencies in pyproject.toml to prevent dependency conflicts for downstream users.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Copilot

Pull request overview

This PR refactors cuLA’s CUDA packaging so architecture-specific kernels are built as separate extensions (SM90 vs SM100/SM103) and adds CI automation to build and publish “fat” wheels (multi-arch) via GitHub Releases, while keeping import cula.cudac working via a Python proxy module.

Changes:

Split the monolithic CUDA extension into per-architecture CUDAExtensions and move PYBIND11_MODULE bindings into per-arch .cu entrypoints.
Add CULA_BUILD_ALL_ARCHS=1 support and a --fat option in the wheel build script for CI fat-binary builds.
Introduce a GitHub Actions workflow to build cu129/cu130 wheels for x86_64/aarch64 and attach them to GitHub Releases; update versioning + README install docs accordingly.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`.github/workflows/build-release.yml`	New CI workflow to build and upload CUDA-versioned wheel artifacts and draft a GitHub Release.
`setup.py`	Builds per-arch CUDA extensions (`cula._cudac_sm90`, `cula._cudac_sm100`) and adds `CULA_BUILD_ALL_ARCHS` behavior.
`csrc/api/pybind.cu`	Removes the monolithic binding module entrypoint.
`csrc/api/kda_sm90.cu`	Adds SM90-specific `PYBIND11_MODULE` bindings.
`csrc/api/kda_sm100.cu`	Adds SM100/SM103-specific `PYBIND11_MODULE` bindings.
`cula/cudac.py`	New proxy module to preserve `import cula.cudac` API across split extensions.
`cula/__init__.py`	Switches to setuptools_scm-generated runtime version when available.
`scripts/build_wheel.sh`	Adds `--fat` flag to set `CULA_BUILD_ALL_ARCHS=1` during wheel builds.
`README.md`	Documents installing pre-built wheels from GitHub Releases.
`pyproject.toml`	Uses `setuptools_scm` `no-local-version` scheme and keeps writing `cula/_version.py`.
`tests/conftest.py`	Minor formatting adjustment in test collection marker logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add double-checked locking to _CudacProxy._load() to prevent race conditions in multi-threaded environments. Raise a descriptive ImportError when no CUDA extensions can be loaded instead of silently producing AttributeError later.

The blanket `except ImportError: pass` swallowed the actual failure reason, making it impossible to diagnose missing shared libraries or build issues. Collect each extension's ImportError and include them in the raised message.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

tongke6 · 2026-06-17T15:19:41Z

/gemini review again

gemini-code-assist

Code Review

This pull request refactors the build system and CUDA extension loading mechanism of cuLA by splitting the monolithic cula.cudac extension into two separate per-architecture extensions (cula._cudac_sm100 and cula._cudac_sm90) and introducing a lazy-loading proxy module. It also adds support for building fat binary wheels. The review feedback points out that the CULA_SM100_ENABLED and CULA_SM103_ENABLED preprocessor macros were omitted in the new setup.py, which could lead to compilation failures. Additionally, it is recommended to simplify the proxy implementation using standard PEP 562 module-level functions and to broaden exception handling during dynamic imports to catch RuntimeError and OSError.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-17T15:21:30Z

 if not DISABLE_SM100 or not DISABLE_SM103:
-    cuda_sources.extend(
-        [
-            "csrc/api/kda_sm100.cu",
-            "csrc/kda/sm100/kda_fwd_sm100.cu",
-        ]
-    )
-if not DISABLE_SM90:
-    cuda_sources.extend(
-        [
-            "csrc/api/kda_sm90.cu",
-            "csrc/kda/sm90/kda_fwd_sm90.cu",
-            "csrc/kda/sm90/kda_fwd_sm90_safe_gate.cu",
-        ]
+    sm100_arch_flags = []
+    if not DISABLE_SM100:
+        sm100_arch_flags.extend(["-gencode", "arch=compute_100a,code=sm_100a"])
+    if not DISABLE_SM103:
+        sm100_arch_flags.extend(["-gencode", "arch=compute_103a,code=sm_103a"])


The macros CULA_SM100_ENABLED and CULA_SM103_ENABLED are omitted from sm100_arch_flags in the new split extension setup. In the previous monolithic setup, these macros were defined via -DCULA_SM100_ENABLED and -DCULA_SM103_ENABLED when compiling. Omitting them can lead to compilation failures or silently disabled features if any CUDA source or header files conditionally check for these macros.

Suggested change

if not DISABLE_SM100 or not DISABLE_SM103:

cuda_sources.extend(

[

"csrc/api/kda_sm100.cu",

"csrc/kda/sm100/kda_fwd_sm100.cu",

]

)

if not DISABLE_SM90:

cuda_sources.extend(

[

"csrc/api/kda_sm90.cu",

"csrc/kda/sm90/kda_fwd_sm90.cu",

"csrc/kda/sm90/kda_fwd_sm90_safe_gate.cu",

]

sm100_arch_flags = []

if not DISABLE_SM100:

sm100_arch_flags.extend(["-gencode", "arch=compute_100a,code=sm_100a"])

if not DISABLE_SM103:

sm100_arch_flags.extend(["-gencode", "arch=compute_103a,code=sm_103a"])

if not DISABLE_SM100 or not DISABLE_SM103:

sm100_arch_flags = []

if not DISABLE_SM100:

sm100_arch_flags.extend(["-gencode", "arch=compute_100a,code=sm_100a", "-DCULA_SM100_ENABLED"])

if not DISABLE_SM103:

sm100_arch_flags.extend(["-gencode", "arch=compute_103a,code=sm_103a", "-DCULA_SM103_ENABLED"])

gemini-code-assist · 2026-06-17T15:21:31Z

+class _CudacProxy(ModuleType):
+    """Lazy proxy that exposes functions from all built arch extensions."""
+
+    def __init__(self):
+        super().__init__(__name__)
+        self.__path__ = []
+        self._modules_loaded = False
+        self._funcs: dict[str, object] = {}
+        self._lock = threading.Lock()
+
+    def _load(self):
+        if self._modules_loaded:
+            return
+        with self._lock:
+            if self._modules_loaded:
+                return
+            loaded_any = False
+            errors: dict[str, Exception] = {}
+            for ext_name in ("cula._cudac_sm100", "cula._cudac_sm90"):
+                try:
+                    mod = importlib.import_module(ext_name)
+                    for attr in dir(mod):
+                        if not attr.startswith("_"):
+                            self._funcs[attr] = getattr(mod, attr)
+                    loaded_any = True
+                except ImportError as exc:
+                    errors[ext_name] = exc
+            if not loaded_any:
+                details = "; ".join(f"{name}: {exc}" for name, exc in errors.items())
+                raise ImportError(
+                    "None of the cuLA CUDA extensions could be imported. "
+                    f"Per-extension errors: [{details}]. "
+                    "Please make sure cuLA is compiled correctly."
+                )
+            self.__dict__.update(self._funcs)
+            self._modules_loaded = True
+
+    def __getattr__(self, name: str):
+        if name.startswith("_"):
+            raise AttributeError(name)
+        self._load()
+        try:
+            return self._funcs[name]
+        except KeyError:
+            raise AttributeError(f"module 'cula.cudac' has no attribute '{name}'") from None
+
+    def __dir__(self):
+        self._load()
+        return list(self._funcs.keys())
+
+
+_proxy = _CudacProxy()
+_proxy.__dict__.update({k: globals().get(k) for k in ("__spec__", "__file__", "__package__", "__loader__")})
+sys.modules[__name__] = _proxy


Using a custom ModuleType subclass and replacing sys.modules is a legacy pattern that can be replaced with standard PEP 562 module-level __getattr__ and __dir__ functions (supported in Python 3.7+). This is much cleaner, avoids subclassing overhead, and ensures standard module attributes are correctly preserved in __dir__.

Additionally, catching only ImportError during dynamic loading of C++ extensions is risky. Dynamic loading of PyTorch/CUDA extensions can frequently raise RuntimeError (e.g., due to CUDA driver/runtime mismatch) or OSError (e.g., due to missing shared libraries). Catching these exceptions ensures that if one extension fails to load, the other can still be loaded successfully.

_modules_loaded = False _funcs = {} _lock = threading.Lock() def _load(): global _modules_loaded if _modules_loaded: return with _lock: if _modules_loaded: return loaded_any = False errors = {} for ext_name in ("cula._cudac_sm100", "cula._cudac_sm90"): try: mod = importlib.import_module(ext_name) for attr in dir(mod): if not attr.startswith("_"): _funcs[attr] = getattr(mod, attr) loaded_any = True except (ImportError, RuntimeError, OSError) as exc: errors[ext_name] = exc if not loaded_any: details = "; ".join(f"{name}: {exc}" for name, exc in errors.items()) raise ImportError( "None of the cuLA CUDA extensions could be imported. " f"Per-extension errors: [{details}]. " "Please make sure cuLA is compiled correctly." ) globals().update(_funcs) _modules_loaded = True def __getattr__(name: str): if name.startswith("_"): raise AttributeError(name) _load() try: return _funcs[name] except KeyError: raise AttributeError(f"module 'cula.cudac' has no attribute '{name}'") from None def __dir__(): _load() return sorted(k for k in globals().keys() if not k.startswith("_") or k.startswith("__"))

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread cula/cudac.py Outdated

Comment thread pyproject.toml

tongke6 added 2 commits June 12, 2026 11:38

fix ruff lint errors

7616737

revert version requirements changes

b148c52

tongke6 requested a review from Copilot June 12, 2026 03:53

Copilot started reviewing on behalf of tongke6 June 12, 2026 03:54 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread .github/workflows/build-release.yml Outdated

Comment thread cula/cudac.py Outdated

Comment thread cula/cudac.py Outdated

Comment thread setup.py Outdated

Comment thread README.md Outdated

tongke6 and others added 2 commits June 12, 2026 12:02

Surface per-extension import errors in cudac proxy

c955d47

The blanket `except ImportError: pass` swallowed the actual failure reason, making it impossible to diagnose missing shared libraries or build issues. Collect each extension's ImportError and include them in the raised message.

Copilot started work on behalf of tongke6 June 17, 2026 14:45 View session

Fix build-release matrix with DRY expression mapping

6ccb8fa

Copilot finished work on behalf of tongke6 June 17, 2026 14:48

tongke6 and others added 2 commits June 17, 2026 22:51

Apply suggestions from code review

f700810

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

e0a1e21

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

tongke6 commented Jun 17, 2026

View reviewed changes

Comment thread README.md

Copilot started work on behalf of tongke6 June 17, 2026 15:14 View session

tongke6 marked this pull request as ready for review June 17, 2026 15:14

tongke6 requested review from icavan and zheyang0825 June 17, 2026 15:14

Add README example for building fat wheels

ead4f7e

Copilot finished work on behalf of tongke6 June 17, 2026 15:15

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add GitHub workflow for building and releasing fat wheels#91

[CI] Add GitHub workflow for building and releasing fat wheels#91
tongke6 wants to merge 9 commits into
mainfrom
tk/gh-workflow

tongke6 commented Jun 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tongke6 commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tongke6 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

⚡ Performance

Reviewer Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tongke6 commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tongke6 commented Jun 12, 2026 •

edited

Loading