devices: add retry wrapper on VFIO ops by babayet2 · Pull Request #3185 · microsoft/openvmm

babayet2 · 2026-04-02T23:48:43Z

Implement a wrapper struct on VFIO ops that retries failed driver operations. This is useful for transient issues in the restore path of VFIO devices

github-actions · 2026-04-02T23:49:00Z

⚠️ Unsafe Code Detected

This PR modifies files containing unsafe Rust code. Extra scrutiny is required during review.

For more on why we check whole files, instead of just diffs, check out the Rustonomicon

Copilot

Pull request overview

This PR introduces a reusable retry wrapper for VFIO operations to better tolerate transient failures (notably during restore/servicing flows), and wires it into the VFIO device bring-up path.

Changes:

Refactors vfio_sys::Group VFIO operations (open_device, set_keep_alive) to be synchronous and adds a new async VfioRetry helper for retry-with-delay behavior.
Uses VfioRetry in user_driver’s VFIO restore path to retry specific transient error cases (e.g., ENODEV, NotFound).
Enables an OpenHCL Linux-direct integration test variant that exercises a heavier NVMe servicing scenario.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
vmm_tests/vmm_tests/tests/tests/x86_64/openhcl_linux_direct.rs	Enables a Linux-direct OpenHCL test that stresses servicing with many NVMe devices.
vm/devices/user_driver/vfio_sys/src/lib.rs	Removes per-method retry logic, makes ops sync, and adds `VfioRetry` retry wrapper.
vm/devices/user_driver/vfio_sys/Cargo.toml	Adds `tracelimit` dependency for ratelimited retry logging.
vm/devices/user_driver/src/vfio.rs	Wraps VFIO group/device open + keepalive calls with `VfioRetry` and error classifiers.
vm/devices/user_driver/Cargo.toml	Adds optional `nix` dependency under the `vfio` feature to inspect `Errno`.
Cargo.lock	Updates lockfile for the new dependency edges (`tracelimit`, `nix`).

Copilot · 2026-04-02T23:54:42Z

vm/devices/user_driver/vfio_sys/src/lib.rs

+                    tracelimit::warn_ratelimited!(
+                        operation = context,
+                        attempt,
+                        "retrying after transient error: {err}"
                    );


VfioRetry::retry logs a ratelimited warning on retries, but the log fields don’t include any identifier for the VFIO device/group being operated on. The earlier per-callsite retry logic logged pci_id, and losing that context will make these warnings hard to action when multiple VFIO devices are present. Consider extending retry to accept an optional identifier (e.g., pci_id/path) and include it as a structured field in the warning, or ensure callers enter a span that carries pci_id before invoking retry so the emitted warnings are attributable.

chris-oo · 2026-04-03T21:14:46Z

have we identified why these transient failures occur?

babayet2 · 2026-04-03T22:24:23Z

have we identified why these transient failures occur?

@chris-oo yeah Matt has, I should have linked the issue
#3123

there's reportedly a race between when the device becomes available in userspace and fully initialized

chris-oo · 2026-04-03T22:37:07Z

I guess i'm wondering if this is a kernel issue, we can workaround it usermode sure, but we should fix the underlying kernel issue. Is anyone looking at that?

babayet2 · 2026-04-03T22:58:23Z

I guess i'm wondering if this is a kernel issue, we can workaround it usermode sure, but we should fix the underlying kernel issue. Is anyone looking at that?

I know that I'm not looking at it - hold on, I'll ask around

babayet2 · 2026-04-03T23:23:26Z

I guess i'm wondering if this is a kernel issue, we can workaround it usermode sure, but we should fix the underlying kernel issue. Is anyone looking at that?

nobody on the kernel side is looking at it today, I've reached out and started a thread offline

retry wrapper on VFIO ops + reintroduce nvme test

a7a7f94

babayet2 requested a review from a team as a code owner April 2, 2026 23:48

Copilot AI review requested due to automatic review settings April 2, 2026 23:48

github-actions bot added the unsafe Related to unsafe code label Apr 2, 2026

Copilot started reviewing on behalf of babayet2 April 2, 2026 23:49 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

devices: add retry wrapper on VFIO ops#3185

devices: add retry wrapper on VFIO ops#3185
babayet2 wants to merge 1 commit intomicrosoft:mainfrom
babayet2:retry-errata

babayet2 commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

chris-oo commented Apr 3, 2026

Uh oh!

babayet2 commented Apr 3, 2026

Uh oh!

chris-oo commented Apr 3, 2026

Uh oh!

babayet2 commented Apr 3, 2026

Uh oh!

babayet2 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

babayet2 commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

chris-oo commented Apr 3, 2026

Uh oh!

babayet2 commented Apr 3, 2026

Uh oh!

chris-oo commented Apr 3, 2026

Uh oh!

babayet2 commented Apr 3, 2026

Uh oh!

babayet2 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants