devices: add retry wrapper on VFIO ops#3185
Conversation
|
This PR modifies files containing For more on why we check whole files, instead of just diffs, check out the Rustonomicon |
There was a problem hiding this comment.
Pull request overview
This PR introduces a reusable retry wrapper for VFIO operations to better tolerate transient failures (notably during restore/servicing flows), and wires it into the VFIO device bring-up path.
Changes:
- Refactors
vfio_sys::GroupVFIO operations (open_device,set_keep_alive) to be synchronous and adds a new asyncVfioRetryhelper for retry-with-delay behavior. - Uses
VfioRetryinuser_driver’s VFIO restore path to retry specific transient error cases (e.g.,ENODEV,NotFound). - Enables an OpenHCL Linux-direct integration test variant that exercises a heavier NVMe servicing scenario.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| vmm_tests/vmm_tests/tests/tests/x86_64/openhcl_linux_direct.rs | Enables a Linux-direct OpenHCL test that stresses servicing with many NVMe devices. |
| vm/devices/user_driver/vfio_sys/src/lib.rs | Removes per-method retry logic, makes ops sync, and adds VfioRetry retry wrapper. |
| vm/devices/user_driver/vfio_sys/Cargo.toml | Adds tracelimit dependency for ratelimited retry logging. |
| vm/devices/user_driver/src/vfio.rs | Wraps VFIO group/device open + keepalive calls with VfioRetry and error classifiers. |
| vm/devices/user_driver/Cargo.toml | Adds optional nix dependency under the vfio feature to inspect Errno. |
| Cargo.lock | Updates lockfile for the new dependency edges (tracelimit, nix). |
| tracelimit::warn_ratelimited!( | ||
| operation = context, | ||
| attempt, | ||
| "retrying after transient error: {err}" | ||
| ); |
There was a problem hiding this comment.
VfioRetry::retry logs a ratelimited warning on retries, but the log fields don’t include any identifier for the VFIO device/group being operated on. The earlier per-callsite retry logic logged pci_id, and losing that context will make these warnings hard to action when multiple VFIO devices are present. Consider extending retry to accept an optional identifier (e.g., pci_id/path) and include it as a structured field in the warning, or ensure callers enter a span that carries pci_id before invoking retry so the emitted warnings are attributable.
|
have we identified why these transient failures occur? |
|
I guess i'm wondering if this is a kernel issue, we can workaround it usermode sure, but we should fix the underlying kernel issue. Is anyone looking at that? |
I know that I'm not looking at it - hold on, I'll ask around |
nobody on the kernel side is looking at it today, I've reached out and started a thread offline |
Implement a wrapper struct on VFIO ops that retries failed driver operations. This is useful for transient issues in the restore path of VFIO devices