Skip to content

devices: add retry wrapper on VFIO ops#3185

Open
babayet2 wants to merge 1 commit intomicrosoft:mainfrom
babayet2:retry-errata
Open

devices: add retry wrapper on VFIO ops#3185
babayet2 wants to merge 1 commit intomicrosoft:mainfrom
babayet2:retry-errata

Conversation

@babayet2
Copy link
Copy Markdown
Collaborator

@babayet2 babayet2 commented Apr 2, 2026

Implement a wrapper struct on VFIO ops that retries failed driver operations. This is useful for transient issues in the restore path of VFIO devices

@babayet2 babayet2 requested a review from a team as a code owner April 2, 2026 23:48
Copilot AI review requested due to automatic review settings April 2, 2026 23:48
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

⚠️ Unsafe Code Detected

This PR modifies files containing unsafe Rust code. Extra scrutiny is required during review.

For more on why we check whole files, instead of just diffs, check out the Rustonomicon

@github-actions github-actions bot added the unsafe Related to unsafe code label Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a reusable retry wrapper for VFIO operations to better tolerate transient failures (notably during restore/servicing flows), and wires it into the VFIO device bring-up path.

Changes:

  • Refactors vfio_sys::Group VFIO operations (open_device, set_keep_alive) to be synchronous and adds a new async VfioRetry helper for retry-with-delay behavior.
  • Uses VfioRetry in user_driver’s VFIO restore path to retry specific transient error cases (e.g., ENODEV, NotFound).
  • Enables an OpenHCL Linux-direct integration test variant that exercises a heavier NVMe servicing scenario.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
vmm_tests/vmm_tests/tests/tests/x86_64/openhcl_linux_direct.rs Enables a Linux-direct OpenHCL test that stresses servicing with many NVMe devices.
vm/devices/user_driver/vfio_sys/src/lib.rs Removes per-method retry logic, makes ops sync, and adds VfioRetry retry wrapper.
vm/devices/user_driver/vfio_sys/Cargo.toml Adds tracelimit dependency for ratelimited retry logging.
vm/devices/user_driver/src/vfio.rs Wraps VFIO group/device open + keepalive calls with VfioRetry and error classifiers.
vm/devices/user_driver/Cargo.toml Adds optional nix dependency under the vfio feature to inspect Errno.
Cargo.lock Updates lockfile for the new dependency edges (tracelimit, nix).

Comment on lines +242 to 246
tracelimit::warn_ratelimited!(
operation = context,
attempt,
"retrying after transient error: {err}"
);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VfioRetry::retry logs a ratelimited warning on retries, but the log fields don’t include any identifier for the VFIO device/group being operated on. The earlier per-callsite retry logic logged pci_id, and losing that context will make these warnings hard to action when multiple VFIO devices are present. Consider extending retry to accept an optional identifier (e.g., pci_id/path) and include it as a structured field in the warning, or ensure callers enter a span that carries pci_id before invoking retry so the emitted warnings are attributable.

Copilot uses AI. Check for mistakes.
@chris-oo
Copy link
Copy Markdown
Member

chris-oo commented Apr 3, 2026

have we identified why these transient failures occur?

@babayet2
Copy link
Copy Markdown
Collaborator Author

babayet2 commented Apr 3, 2026

have we identified why these transient failures occur?

@chris-oo yeah Matt has, I should have linked the issue
#3123

there's reportedly a race between when the device becomes available in userspace and fully initialized

@chris-oo
Copy link
Copy Markdown
Member

chris-oo commented Apr 3, 2026

I guess i'm wondering if this is a kernel issue, we can workaround it usermode sure, but we should fix the underlying kernel issue. Is anyone looking at that?

@babayet2
Copy link
Copy Markdown
Collaborator Author

babayet2 commented Apr 3, 2026

I guess i'm wondering if this is a kernel issue, we can workaround it usermode sure, but we should fix the underlying kernel issue. Is anyone looking at that?

I know that I'm not looking at it - hold on, I'll ask around

@babayet2
Copy link
Copy Markdown
Collaborator Author

babayet2 commented Apr 3, 2026

I guess i'm wondering if this is a kernel issue, we can workaround it usermode sure, but we should fix the underlying kernel issue. Is anyone looking at that?

nobody on the kernel side is looking at it today, I've reached out and started a thread offline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

unsafe Related to unsafe code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants