Skip to content

[CI] Add GitHub workflow for building and releasing fat wheels#91

Open
tongke6 wants to merge 9 commits into
mainfrom
tk/gh-workflow
Open

[CI] Add GitHub workflow for building and releasing fat wheels#91
tongke6 wants to merge 9 commits into
mainfrom
tk/gh-workflow

Conversation

@tongke6

@tongke6 tongke6 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

📌 Description

Replace the monolithic cula.cudac extension with per-arch extensions (cula._cudac_sm90, cula._cudac_sm100) so that SM90 and SM100/SM103 kernels are compiled independently with their own -gencode flags. This enables building fat-binary wheels containing all architectures without needing the target GPU present at build time.

Key changes:

  • Split pybind.cu into per-file PYBIND11_MODULE definitions
  • Add cula/cudac.py proxy module for backwards-compatible imports
  • Add CULA_BUILD_ALL_ARCHS=1 env var to enable all SM targets
  • Add --fat flag to build_wheel.sh for CI fat-binary builds
  • Pin dependency versions and use no-local-version scheme for reproducible wheel filenames
  • Use setuptools_scm for dynamic __version__
  • Document pre-built wheel installation in README

🔍 Related Issues

Fix #83

🚀 Pull Request Checklist

Thank you for contributing to cuLA! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing.

⚡ Performance

Reviewer Notes

…83)

Replace the monolithic `cula.cudac` extension with per-arch extensions
(`cula._cudac_sm90`, `cula._cudac_sm100`) so that SM90 and SM100/SM103
kernels are compiled independently with their own `-gencode` flags. This
enables building fat-binary wheels containing all architectures without
needing the target GPU present at build time.

Key changes:
- Split pybind.cu into per-file PYBIND11_MODULE definitions
- Add `cula/cudac.py` proxy module for backwards-compatible imports
- Add `CULA_BUILD_ALL_ARCHS=1` env var to enable all SM targets
- Add `--fat` flag to build_wheel.sh for CI fat-binary builds
- Pin dependency versions and use `no-local-version` scheme for
  reproducible wheel filenames
- Use setuptools_scm for dynamic `__version__`
- Document pre-built wheel installation in README

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request restructures the build system and CUDA extension packaging for cuLA to support separate per-architecture extensions (_cudac_sm100 and _cudac_sm90) and fat-binary builds. It introduces a lazy-loading proxy module (cula.cudac) to dynamically expose the compiled extension functions, and updates the versioning to use setuptools_scm. Feedback on these changes highlights a thread-safety vulnerability in the lazy-loading proxy that could cause race conditions during concurrent imports, and advises against strict version pinning of runtime dependencies in pyproject.toml to prevent dependency conflicts for downstream users.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread cula/cudac.py Outdated
Comment thread pyproject.toml

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors cuLA’s CUDA packaging so architecture-specific kernels are built as separate extensions (SM90 vs SM100/SM103) and adds CI automation to build and publish “fat” wheels (multi-arch) via GitHub Releases, while keeping import cula.cudac working via a Python proxy module.

Changes:

  • Split the monolithic CUDA extension into per-architecture CUDAExtensions and move PYBIND11_MODULE bindings into per-arch .cu entrypoints.
  • Add CULA_BUILD_ALL_ARCHS=1 support and a --fat option in the wheel build script for CI fat-binary builds.
  • Introduce a GitHub Actions workflow to build cu129/cu130 wheels for x86_64/aarch64 and attach them to GitHub Releases; update versioning + README install docs accordingly.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
.github/workflows/build-release.yml New CI workflow to build and upload CUDA-versioned wheel artifacts and draft a GitHub Release.
setup.py Builds per-arch CUDA extensions (cula._cudac_sm90, cula._cudac_sm100) and adds CULA_BUILD_ALL_ARCHS behavior.
csrc/api/pybind.cu Removes the monolithic binding module entrypoint.
csrc/api/kda_sm90.cu Adds SM90-specific PYBIND11_MODULE bindings.
csrc/api/kda_sm100.cu Adds SM100/SM103-specific PYBIND11_MODULE bindings.
cula/cudac.py New proxy module to preserve import cula.cudac API across split extensions.
cula/__init__.py Switches to setuptools_scm-generated runtime version when available.
scripts/build_wheel.sh Adds --fat flag to set CULA_BUILD_ALL_ARCHS=1 during wheel builds.
README.md Documents installing pre-built wheels from GitHub Releases.
pyproject.toml Uses setuptools_scm no-local-version scheme and keeps writing cula/_version.py.
tests/conftest.py Minor formatting adjustment in test collection marker logic.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/build-release.yml Outdated
Comment thread cula/cudac.py Outdated
Comment thread cula/cudac.py Outdated
Comment thread setup.py Outdated
Comment thread README.md Outdated
tongke6 and others added 2 commits June 12, 2026 12:02
Add double-checked locking to _CudacProxy._load() to prevent race
conditions in multi-threaded environments. Raise a descriptive
ImportError when no CUDA extensions can be loaded instead of silently
producing AttributeError later.
The blanket `except ImportError: pass` swallowed the actual failure
reason, making it impossible to diagnose missing shared libraries or
build issues. Collect each extension's ImportError and include them
in the raised message.
tongke6 and others added 2 commits June 17, 2026 22:51
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Comment thread README.md
@tongke6 tongke6 marked this pull request as ready for review June 17, 2026 15:14
@tongke6 tongke6 requested review from icavan and zheyang0825 June 17, 2026 15:14
@tongke6

tongke6 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

/gemini review again

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the build system and CUDA extension loading mechanism of cuLA by splitting the monolithic cula.cudac extension into two separate per-architecture extensions (cula._cudac_sm100 and cula._cudac_sm90) and introducing a lazy-loading proxy module. It also adds support for building fat binary wheels. The review feedback points out that the CULA_SM100_ENABLED and CULA_SM103_ENABLED preprocessor macros were omitted in the new setup.py, which could lead to compilation failures. Additionally, it is recommended to simplify the proxy implementation using standard PEP 562 module-level functions and to broaden exception handling during dynamic imports to catch RuntimeError and OSError.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread setup.py
Comment on lines 164 to +169
if not DISABLE_SM100 or not DISABLE_SM103:
cuda_sources.extend(
[
"csrc/api/kda_sm100.cu",
"csrc/kda/sm100/kda_fwd_sm100.cu",
]
)
if not DISABLE_SM90:
cuda_sources.extend(
[
"csrc/api/kda_sm90.cu",
"csrc/kda/sm90/kda_fwd_sm90.cu",
"csrc/kda/sm90/kda_fwd_sm90_safe_gate.cu",
]
sm100_arch_flags = []
if not DISABLE_SM100:
sm100_arch_flags.extend(["-gencode", "arch=compute_100a,code=sm_100a"])
if not DISABLE_SM103:
sm100_arch_flags.extend(["-gencode", "arch=compute_103a,code=sm_103a"])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The macros CULA_SM100_ENABLED and CULA_SM103_ENABLED are omitted from sm100_arch_flags in the new split extension setup. In the previous monolithic setup, these macros were defined via -DCULA_SM100_ENABLED and -DCULA_SM103_ENABLED when compiling. Omitting them can lead to compilation failures or silently disabled features if any CUDA source or header files conditionally check for these macros.

Suggested change
if not DISABLE_SM100 or not DISABLE_SM103:
cuda_sources.extend(
[
"csrc/api/kda_sm100.cu",
"csrc/kda/sm100/kda_fwd_sm100.cu",
]
)
if not DISABLE_SM90:
cuda_sources.extend(
[
"csrc/api/kda_sm90.cu",
"csrc/kda/sm90/kda_fwd_sm90.cu",
"csrc/kda/sm90/kda_fwd_sm90_safe_gate.cu",
]
sm100_arch_flags = []
if not DISABLE_SM100:
sm100_arch_flags.extend(["-gencode", "arch=compute_100a,code=sm_100a"])
if not DISABLE_SM103:
sm100_arch_flags.extend(["-gencode", "arch=compute_103a,code=sm_103a"])
if not DISABLE_SM100 or not DISABLE_SM103:
sm100_arch_flags = []
if not DISABLE_SM100:
sm100_arch_flags.extend(["-gencode", "arch=compute_100a,code=sm_100a", "-DCULA_SM100_ENABLED"])
if not DISABLE_SM103:
sm100_arch_flags.extend(["-gencode", "arch=compute_103a,code=sm_103a", "-DCULA_SM103_ENABLED"])

Comment thread cula/cudac.py
Comment on lines +29 to +82
class _CudacProxy(ModuleType):
"""Lazy proxy that exposes functions from all built arch extensions."""

def __init__(self):
super().__init__(__name__)
self.__path__ = []
self._modules_loaded = False
self._funcs: dict[str, object] = {}
self._lock = threading.Lock()

def _load(self):
if self._modules_loaded:
return
with self._lock:
if self._modules_loaded:
return
loaded_any = False
errors: dict[str, Exception] = {}
for ext_name in ("cula._cudac_sm100", "cula._cudac_sm90"):
try:
mod = importlib.import_module(ext_name)
for attr in dir(mod):
if not attr.startswith("_"):
self._funcs[attr] = getattr(mod, attr)
loaded_any = True
except ImportError as exc:
errors[ext_name] = exc
if not loaded_any:
details = "; ".join(f"{name}: {exc}" for name, exc in errors.items())
raise ImportError(
"None of the cuLA CUDA extensions could be imported. "
f"Per-extension errors: [{details}]. "
"Please make sure cuLA is compiled correctly."
)
self.__dict__.update(self._funcs)
self._modules_loaded = True

def __getattr__(self, name: str):
if name.startswith("_"):
raise AttributeError(name)
self._load()
try:
return self._funcs[name]
except KeyError:
raise AttributeError(f"module 'cula.cudac' has no attribute '{name}'") from None

def __dir__(self):
self._load()
return list(self._funcs.keys())


_proxy = _CudacProxy()
_proxy.__dict__.update({k: globals().get(k) for k in ("__spec__", "__file__", "__package__", "__loader__")})
sys.modules[__name__] = _proxy

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a custom ModuleType subclass and replacing sys.modules is a legacy pattern that can be replaced with standard PEP 562 module-level __getattr__ and __dir__ functions (supported in Python 3.7+). This is much cleaner, avoids subclassing overhead, and ensures standard module attributes are correctly preserved in __dir__.

Additionally, catching only ImportError during dynamic loading of C++ extensions is risky. Dynamic loading of PyTorch/CUDA extensions can frequently raise RuntimeError (e.g., due to CUDA driver/runtime mismatch) or OSError (e.g., due to missing shared libraries). Catching these exceptions ensures that if one extension fails to load, the other can still be loaded successfully.

_modules_loaded = False
_funcs = {}
_lock = threading.Lock()

def _load():
    global _modules_loaded
    if _modules_loaded:
        return
    with _lock:
        if _modules_loaded:
            return
        loaded_any = False
        errors = {}
        for ext_name in ("cula._cudac_sm100", "cula._cudac_sm90"):
            try:
                mod = importlib.import_module(ext_name)
                for attr in dir(mod):
                    if not attr.startswith("_"):
                        _funcs[attr] = getattr(mod, attr)
                loaded_any = True
            except (ImportError, RuntimeError, OSError) as exc:
                errors[ext_name] = exc
        if not loaded_any:
            details = "; ".join(f"{name}: {exc}" for name, exc in errors.items())
            raise ImportError(
                "None of the cuLA CUDA extensions could be imported. "
                f"Per-extension errors: [{details}]. "
                "Please make sure cuLA is compiled correctly."
            )
        globals().update(_funcs)
        _modules_loaded = True

def __getattr__(name: str):
    if name.startswith("_"):
        raise AttributeError(name)
    _load()
    try:
        return _funcs[name]
    except KeyError:
        raise AttributeError(f"module 'cula.cudac' has no attribute '{name}'") from None

def __dir__():
    _load()
    return sorted(k for k in globals().keys() if not k.startswith("_") or k.startswith("__"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Refactor] Optimize CUDA Packaging: Fat Binaries & Dynamic Wheel Resolution

4 participants