fix(backend): detect AMD GPU before setting HSA_OVERRIDE_GFX_VERSION#785
fix(backend): detect AMD GPU before setting HSA_OVERRIDE_GFX_VERSION#785singlaamitesh wants to merge 1 commit into
Conversation
📝 WalkthroughWalkthrough
ChangesConditional ROCm GPU Version Override
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
backend/app.py (2)
77-79: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueConsider logging rocminfo unavailability at a higher level.
When
rocminfois unavailable or fails (line 79), the code logs atDEBUGlevel and silently leaves the override unset. For users with older AMD GPUs who need the override, this silent fallback might make troubleshooting difficult—they won't see any indication in default logs that GPU detection failed.Consider
INFOorWARNINGlevel for rocminfo unavailability, especially since downstream troubleshooting docs (context snippet 2) instruct users to verify thatHSA_OVERRIDE_GFX_VERSIONis set:- logger.debug("Could not detect AMD GPU via rocminfo: %s", e) + logger.info("Could not detect AMD GPU via rocminfo, skipping automatic HSA_OVERRIDE_GFX_VERSION configuration: %s", e)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app.py` around lines 77 - 79, The rocminfo GPU detection failure is currently logged at DEBUG level, which users won't see by default when troubleshooting GPU detection issues. In the exception handler that catches FileNotFoundError, subprocess.TimeoutExpired, and Exception when rocminfo fails, change the logger.debug call to logger.info or logger.warning to ensure users can see this important failure message in standard logs without needing to enable debug logging.
76-76: 📐 Maintainability & Code Quality | 🔵 TrivialConsider mixed-GPU systems with different AMD architectures.
The
breakstatement on line 76 stops detection after finding the first GPU. In systems with multiple AMD GPUs of different generations (e.g., RDNA 2 and RDNA 4), only the first detected GPU determines whether to setHSA_OVERRIDE_GFX_VERSION. Since this environment variable is process-global, this could lead to:
- An RDNA 2 card (needing the override) being skipped if RDNA 4 is detected first
- An RDNA 4 card receiving an unnecessary override if RDNA 2 is detected first
Multi-GPU configurations are a documented issue area in the project. If you want to better support mixed-architecture setups, consider iterating through all detected GPUs to make a more informed decision (e.g., only set the override if all GPUs require it, or document the current "first GPU" strategy).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app.py` at line 76, The break statement prematurely exits GPU detection after finding the first GPU, which causes issues in mixed-architecture AMD GPU systems where different generations may have different requirements for HSA_OVERRIDE_GFX_VERSION. Remove the early break and instead iterate through all detected GPUs to gather information about all present architectures. Once all GPUs are evaluated, make a single decision about setting HSA_OVERRIDE_GFX_VERSION based on the complete GPU inventory (such as setting the override only if all GPUs require it, or documenting that the "first GPU" strategy is intentional). This ensures the process-global environment variable is set appropriately for the entire mixed-GPU configuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/app.py`:
- Around line 56-58: The GPU detection logic has a case-sensitivity
inconsistency where the conditional check on line 56 uses `line.lower()` but the
regex pattern application on line 58 uses the original `line` variable. Refactor
the code to apply both the string check and the regex pattern to the same cased
version of the line. Either apply both operations to `line.lower()` (the
preferred approach) or use the case-insensitive regex flag `re.IGNORECASE` in
the `re.search()` call to ensure consistent case handling throughout the GPU
detection block.
---
Nitpick comments:
In `@backend/app.py`:
- Around line 77-79: The rocminfo GPU detection failure is currently logged at
DEBUG level, which users won't see by default when troubleshooting GPU detection
issues. In the exception handler that catches FileNotFoundError,
subprocess.TimeoutExpired, and Exception when rocminfo fails, change the
logger.debug call to logger.info or logger.warning to ensure users can see this
important failure message in standard logs without needing to enable debug
logging.
- Line 76: The break statement prematurely exits GPU detection after finding the
first GPU, which causes issues in mixed-architecture AMD GPU systems where
different generations may have different requirements for
HSA_OVERRIDE_GFX_VERSION. Remove the early break and instead iterate through all
detected GPUs to gather information about all present architectures. Once all
GPUs are evaluated, make a single decision about setting
HSA_OVERRIDE_GFX_VERSION based on the complete GPU inventory (such as setting
the override only if all GPUs require it, or documenting that the "first GPU"
strategy is intentional). This ensures the process-global environment variable
is set appropriately for the entire mixed-GPU configuration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
Previously, HSA_OVERRIDE_GFX_VERSION=10.3.0 was unconditionally set for all AMD GPUs, which caused suboptimal performance on RDNA 3/4 GPUs (gfx11xx/gfx12xx) that have native ROCm support. Now uses rocminfo to detect all GPUs and only sets the override for systems where the oldest GPU needs it (RDNA 2 and older, gfx10xx and below). Newer GPUs are left untouched. Addresses CodeRabbit review: - Case-insensitive regex matching on lowercased line - Log level changed to INFO for rocminfo failures - Multi-GPU support: iterates all GPUs, uses oldest for decision Fixes jamiepine#469 Signed-off-by: Amitesh Gupta Signed-off-by: singlaamitesh <singlaamitesh@gmail.com>
070335f to
c22a70b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/app.py`:
- Around line 53-54: The code currently only logs and processes the success case
when returncode equals 0, but silently skips auto-detection if rocminfo exits
with a non-zero return code. Add an else clause or error handling after the if
statement checking returncode == 0 to log the failure case with details about
why rocminfo failed, so AMD compatibility issues are visible in logs rather than
silently falling back without any diagnostic information.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| if result.returncode == 0: | ||
| # Collect all GPUs found in rocminfo output |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Add logging for non-zero rocminfo exits to avoid silent fallback.
Line 53 only processes the success case (returncode == 0). If rocminfo is present but exits non-zero, auto-detection is skipped without any log, which makes AMD compatibility regressions hard to diagnose.
🔧 Suggested patch
if result.returncode == 0:
# Collect all GPUs found in rocminfo output
gfx_versions = []
for line in result.stdout.splitlines():
line_lower = line.lower()
if "gfx" in line_lower:
match = re.search(r"(gfx\d+)", line_lower)
if match:
gfx_versions.append(match.group(1))
if gfx_versions:
# Check if any GPU needs the override (RDNA 2 and older)
# Use the oldest GPU (lowest gfx number) for the decision
try:
gfx_nums = []
for v in gfx_versions:
m = re.search(r"\d+", v)
if m:
gfx_nums.append(int(m.group()))
if gfx_nums:
oldest_num = min(gfx_nums)
oldest_gfx = gfx_versions[gfx_nums.index(oldest_num)]
if oldest_num < 1100:
os.environ["HSA_OVERRIDE_GFX_VERSION"] = "10.3.0"
logger.info(
"AMD GPU detected (%s), setting HSA_OVERRIDE_GFX_VERSION=10.3.0 for compatibility. All GPUs: %s",
oldest_gfx,
", ".join(gfx_versions),
)
else:
logger.info(
"AMD GPU detected (%s), native ROCm support available, skipping HSA_OVERRIDE_GFX_VERSION. All GPUs: %s",
oldest_gfx,
", ".join(gfx_versions),
)
except (ValueError, AttributeError) as e:
logger.info("Could not parse GPU version from rocminfo output: %s", e)
+ else:
+ logger.info(
+ "rocminfo exited with code %s, skipping automatic HSA_OVERRIDE_GFX_VERSION configuration: %s",
+ result.returncode,
+ (result.stderr or "").strip(),
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@backend/app.py` around lines 53 - 54, The code currently only logs and
processes the success case when returncode equals 0, but silently skips
auto-detection if rocminfo exits with a non-zero return code. Add an else clause
or error handling after the if statement checking returncode == 0 to log the
failure case with details about why rocminfo failed, so AMD compatibility issues
are visible in logs rather than silently falling back without any diagnostic
information.
Summary
Fixes #469. Previously,
HSA_OVERRIDE_GFX_VERSION=10.3.0was unconditionally set for all AMD GPUs, which caused suboptimal performance on RDNA 3/4 GPUs (gfx11xx/gfx12xx) that have native ROCm support.Changes
rocminfoto detect the GPU model before setting the overrideHSA_OVERRIDE_GFX_VERSION=10.3.0for RDNA 2 and older (gfx10xx and below)rocminfo(no override set)Testing
Summary by CodeRabbit