Skip to content

fix(backend): detect AMD GPU before setting HSA_OVERRIDE_GFX_VERSION#785

Open
singlaamitesh wants to merge 1 commit into
jamiepine:mainfrom
singlaamitesh:fix/rocm-hsa-override-gfx-version
Open

fix(backend): detect AMD GPU before setting HSA_OVERRIDE_GFX_VERSION#785
singlaamitesh wants to merge 1 commit into
jamiepine:mainfrom
singlaamitesh:fix/rocm-hsa-override-gfx-version

Conversation

@singlaamitesh

@singlaamitesh singlaamitesh commented Jun 23, 2026

Copy link
Copy Markdown

Summary

Fixes #469. Previously, HSA_OVERRIDE_GFX_VERSION=10.3.0 was unconditionally set for all AMD GPUs, which caused suboptimal performance on RDNA 3/4 GPUs (gfx11xx/gfx12xx) that have native ROCm support.

Changes

  • Uses rocminfo to detect the GPU model before setting the override
  • Only sets HSA_OVERRIDE_GFX_VERSION=10.3.0 for RDNA 2 and older (gfx10xx and below)
  • RDNA 3+ GPUs are left untouched (native ROCm support)
  • Gracefully handles missing rocminfo (no override set)

Testing

  • Tested on RDNA 4 (gfx1201) — override correctly skipped
  • Tested on system without rocminfo — no override set (safe fallback)

Summary by CodeRabbit

  • Bug Fixes
    • Improved GPU compatibility handling by detecting available GPU capabilities at startup instead of applying a fixed graphics override. The app now reads ROCm information to determine whether an override is necessary for older devices, applies it when required, and skips it on newer hardware where native support is available. If ROCm information can’t be retrieved or parsed, the app leaves the setting unchanged and continues safely.

Copilot AI review requested due to automatic review settings June 23, 2026 07:56

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

backend/app.py replaces the unconditional HSA_OVERRIDE_GFX_VERSION=10.3.0 assignment with a rocminfo-based detection step. It adds re and subprocess imports, runs rocminfo with a 5-second timeout, parses all gfx version tokens, selects the lowest numeric value, and sets the override only for GPUs with a numeric gfx version below 1100. Errors fall back to leaving the variable unset.

Changes

Conditional ROCm GPU Version Override

Layer / File(s) Summary
rocminfo-based GPU detection and conditional HSA override
backend/app.py
Adds re and subprocess imports. Replaces the blanket HSA_OVERRIDE_GFX_VERSION=10.3.0 assignment with a block that invokes rocminfo (5s timeout, captured stdout/stderr), extracts all gfx<digits> tokens via regex, selects the lowest numeric gfx value, and applies the override only when that value is < 1100; logs at info for detection/skip decisions and info for rocminfo failures, leaving the variable unset on error.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 A rabbit once set every GPU the same,
But RDNA 4 said, "That's a shame!"
Now rocminfo is queried with care,
Old GPUs get the flag — new ones: "don't dare!"
Each card gets the version it truly deserves. 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: implementing GPU detection before setting HSA_OVERRIDE_GFX_VERSION, which is the primary objective of the PR.
Linked Issues check ✅ Passed The PR implements the dynamic GPU detection solution from issue #469, correctly applying the HSA override only for GPUs older than gfx1100 while preserving performance for RDNA 3/4 GPUs.
Out of Scope Changes check ✅ Passed All changes in backend/app.py are directly related to implementing dynamic GPU detection for the HSA_OVERRIDE_GFX_VERSION variable, with no unrelated modifications present.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
backend/app.py (2)

77-79: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Consider logging rocminfo unavailability at a higher level.

When rocminfo is unavailable or fails (line 79), the code logs at DEBUG level and silently leaves the override unset. For users with older AMD GPUs who need the override, this silent fallback might make troubleshooting difficult—they won't see any indication in default logs that GPU detection failed.

Consider INFO or WARNING level for rocminfo unavailability, especially since downstream troubleshooting docs (context snippet 2) instruct users to verify that HSA_OVERRIDE_GFX_VERSION is set:

-        logger.debug("Could not detect AMD GPU via rocminfo: %s", e)
+        logger.info("Could not detect AMD GPU via rocminfo, skipping automatic HSA_OVERRIDE_GFX_VERSION configuration: %s", e)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app.py` around lines 77 - 79, The rocminfo GPU detection failure is
currently logged at DEBUG level, which users won't see by default when
troubleshooting GPU detection issues. In the exception handler that catches
FileNotFoundError, subprocess.TimeoutExpired, and Exception when rocminfo fails,
change the logger.debug call to logger.info or logger.warning to ensure users
can see this important failure message in standard logs without needing to
enable debug logging.

76-76: 📐 Maintainability & Code Quality | 🔵 Trivial

Consider mixed-GPU systems with different AMD architectures.

The break statement on line 76 stops detection after finding the first GPU. In systems with multiple AMD GPUs of different generations (e.g., RDNA 2 and RDNA 4), only the first detected GPU determines whether to set HSA_OVERRIDE_GFX_VERSION. Since this environment variable is process-global, this could lead to:

  • An RDNA 2 card (needing the override) being skipped if RDNA 4 is detected first
  • An RDNA 4 card receiving an unnecessary override if RDNA 2 is detected first

Multi-GPU configurations are a documented issue area in the project. If you want to better support mixed-architecture setups, consider iterating through all detected GPUs to make a more informed decision (e.g., only set the override if all GPUs require it, or document the current "first GPU" strategy).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app.py` at line 76, The break statement prematurely exits GPU
detection after finding the first GPU, which causes issues in mixed-architecture
AMD GPU systems where different generations may have different requirements for
HSA_OVERRIDE_GFX_VERSION. Remove the early break and instead iterate through all
detected GPUs to gather information about all present architectures. Once all
GPUs are evaluated, make a single decision about setting
HSA_OVERRIDE_GFX_VERSION based on the complete GPU inventory (such as setting
the override only if all GPUs require it, or documenting that the "first GPU"
strategy is intentional). This ensures the process-global environment variable
is set appropriately for the entire mixed-GPU configuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app.py`:
- Around line 56-58: The GPU detection logic has a case-sensitivity
inconsistency where the conditional check on line 56 uses `line.lower()` but the
regex pattern application on line 58 uses the original `line` variable. Refactor
the code to apply both the string check and the regex pattern to the same cased
version of the line. Either apply both operations to `line.lower()` (the
preferred approach) or use the case-insensitive regex flag `re.IGNORECASE` in
the `re.search()` call to ensure consistent case handling throughout the GPU
detection block.

---

Nitpick comments:
In `@backend/app.py`:
- Around line 77-79: The rocminfo GPU detection failure is currently logged at
DEBUG level, which users won't see by default when troubleshooting GPU detection
issues. In the exception handler that catches FileNotFoundError,
subprocess.TimeoutExpired, and Exception when rocminfo fails, change the
logger.debug call to logger.info or logger.warning to ensure users can see this
important failure message in standard logs without needing to enable debug
logging.
- Line 76: The break statement prematurely exits GPU detection after finding the
first GPU, which causes issues in mixed-architecture AMD GPU systems where
different generations may have different requirements for
HSA_OVERRIDE_GFX_VERSION. Remove the early break and instead iterate through all
detected GPUs to gather information about all present architectures. Once all
GPUs are evaluated, make a single decision about setting
HSA_OVERRIDE_GFX_VERSION based on the complete GPU inventory (such as setting
the override only if all GPUs require it, or documenting that the "first GPU"
strategy is intentional). This ensures the process-global environment variable
is set appropriately for the entire mixed-GPU configuration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 587b631e-4b2b-4861-a0f4-1d7164bf61ae

📥 Commits

Reviewing files that changed from the base of the PR and between b35b909 and 070335f.

📒 Files selected for processing (1)
  • backend/app.py

Comment thread backend/app.py Outdated
Previously, HSA_OVERRIDE_GFX_VERSION=10.3.0 was unconditionally set for
all AMD GPUs, which caused suboptimal performance on RDNA 3/4 GPUs
(gfx11xx/gfx12xx) that have native ROCm support.

Now uses rocminfo to detect all GPUs and only sets the override for
systems where the oldest GPU needs it (RDNA 2 and older, gfx10xx and
below). Newer GPUs are left untouched.

Addresses CodeRabbit review:
- Case-insensitive regex matching on lowercased line
- Log level changed to INFO for rocminfo failures
- Multi-GPU support: iterates all GPUs, uses oldest for decision

Fixes jamiepine#469

Signed-off-by: Amitesh Gupta
Signed-off-by: singlaamitesh <singlaamitesh@gmail.com>
@singlaamitesh singlaamitesh force-pushed the fix/rocm-hsa-override-gfx-version branch from 070335f to c22a70b Compare June 23, 2026 09:59

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app.py`:
- Around line 53-54: The code currently only logs and processes the success case
when returncode equals 0, but silently skips auto-detection if rocminfo exits
with a non-zero return code. Add an else clause or error handling after the if
statement checking returncode == 0 to log the failure case with details about
why rocminfo failed, so AMD compatibility issues are visible in logs rather than
silently falling back without any diagnostic information.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc907544-58bd-433f-b85b-6cbd45250beb

📥 Commits

Reviewing files that changed from the base of the PR and between 070335f and c22a70b.

📒 Files selected for processing (1)
  • backend/app.py

Comment thread backend/app.py
Comment on lines +53 to +54
if result.returncode == 0:
# Collect all GPUs found in rocminfo output

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

Add logging for non-zero rocminfo exits to avoid silent fallback.

Line 53 only processes the success case (returncode == 0). If rocminfo is present but exits non-zero, auto-detection is skipped without any log, which makes AMD compatibility regressions hard to diagnose.

🔧 Suggested patch
         if result.returncode == 0:
             # Collect all GPUs found in rocminfo output
             gfx_versions = []
             for line in result.stdout.splitlines():
                 line_lower = line.lower()
                 if "gfx" in line_lower:
                     match = re.search(r"(gfx\d+)", line_lower)
                     if match:
                         gfx_versions.append(match.group(1))

             if gfx_versions:
                 # Check if any GPU needs the override (RDNA 2 and older)
                 # Use the oldest GPU (lowest gfx number) for the decision
                 try:
                     gfx_nums = []
                     for v in gfx_versions:
                         m = re.search(r"\d+", v)
                         if m:
                             gfx_nums.append(int(m.group()))
                     if gfx_nums:
                         oldest_num = min(gfx_nums)
                         oldest_gfx = gfx_versions[gfx_nums.index(oldest_num)]
                         if oldest_num < 1100:
                             os.environ["HSA_OVERRIDE_GFX_VERSION"] = "10.3.0"
                             logger.info(
                                 "AMD GPU detected (%s), setting HSA_OVERRIDE_GFX_VERSION=10.3.0 for compatibility. All GPUs: %s",
                                 oldest_gfx,
                                 ", ".join(gfx_versions),
                             )
                         else:
                             logger.info(
                                 "AMD GPU detected (%s), native ROCm support available, skipping HSA_OVERRIDE_GFX_VERSION. All GPUs: %s",
                                 oldest_gfx,
                                 ", ".join(gfx_versions),
                             )
                 except (ValueError, AttributeError) as e:
                     logger.info("Could not parse GPU version from rocminfo output: %s", e)
+        else:
+            logger.info(
+                "rocminfo exited with code %s, skipping automatic HSA_OVERRIDE_GFX_VERSION configuration: %s",
+                result.returncode,
+                (result.stderr or "").strip(),
+            )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app.py` around lines 53 - 54, The code currently only logs and
processes the success case when returncode equals 0, but silently skips
auto-detection if rocminfo exits with a non-zero return code. Add an else clause
or error handling after the if statement checking returncode == 0 to log the
failure case with details about why rocminfo failed, so AMD compatibility issues
are visible in logs rather than silently falling back without any diagnostic
information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ROCm HSA_OVERRIDE_GFX_VERSION unnecessary/harmful for RDNA 3 / 4 GPUs

2 participants