mana_driver: vf reconfiguration revokes vtl0 vf faster#3164
mana_driver: vf reconfiguration revokes vtl0 vf faster#3164erfrimod wants to merge 2 commits intomicrosoft:mainfrom
Conversation
|
Traces from my lab machine. Note the time between EQE and VF in the Guest is now about 1.39888 seconds. |
There was a problem hiding this comment.
Pull request overview
Improves VF Reconfiguration handling for the MANA driver / Underhill NetVSP path by treating the HWC channel as unavailable immediately after the VF reconfig EQE, avoiding long teardown timeouts and reducing noisy failing teardown attempts.
Changes:
- Set
hwc_timeout_in_msto0on VF reconfiguration EQE to make subsequent HWC waits fail fast, and gate timeout reporting to avoid extra work when timeout is0. - Update driver teardown behavior to skip HWC resource teardown after
hwc_failureis detected. - Update
test_gdma_reconfig_vfto validate the new “timeout becomes 0 after EQE 135” behavior and that teardown fails fast. - In VF reconfig handling, remove the VTL0 VF via
remove_vtl0_vf()and clear saved datapath/filter state.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| vm/devices/net/mana_driver/src/tests.rs | Extends VF reconfig test to assert timeout transitions to 0 and hwc_failure behavior during deregister. |
| vm/devices/net/mana_driver/src/resources.rs | Skips HWC teardown commands when hwc_failure is set to avoid repeated failing requests/log spam. |
| vm/devices/net/mana_driver/src/gdma_driver.rs | Sets timeout to 0 on VF reconfig EQE; adds early timeout exit; preserves 0 timeout through deregister; exposes hwc_failure() and test getter. |
| openhcl/underhill_core/src/emuplat/netvsp.rs | Switches VF reconfig VTL0 VF removal path to remove_vtl0_vf() and clears saved direction_to_vtl0 state. |
| // When HWC has already failed, skip sending teardown commands for HWC resources: | ||
| // DmaRegion, Eq, BnicQueue. HWC requests all fail: "Previous hardware failure". | ||
| // Device should reclaim resources on its own reset. |
There was a problem hiding this comment.
I would recommend moving this comment down to the '_ if skip_hwc ...' code because this one line will be easy to miss when skimming the code, so the comment will help draw attention to it
ben-zen
left a comment
There was a problem hiding this comment.
Looks good to me with the change Brian suggested to the comment location. That's a subtle choice, calling it out with a comment makes the logic clear.
127b33f to
fffdb4f
Compare
| tracing::info!( | ||
| count = self.resources.len(), |
There was a problem hiding this comment.
The log field count = self.resources.len() includes MemoryBlock entries, but the message says "skipping HWC resource teardown". Consider either counting only the HWC-backed resources or renaming the field/message so the telemetry can be interpreted accurately during failures.
| tracing::info!( | |
| count = self.resources.len(), | |
| let hwc_resource_count = self | |
| .resources | |
| .iter() | |
| .filter(|r| { | |
| matches!( | |
| r, | |
| Resource::DmaRegion { .. } | |
| | Resource::Eq { .. } | |
| | Resource::BnicQueue { .. } | |
| ) | |
| }) | |
| .count(); | |
| tracing::info!( | |
| hwc_resource_count = hwc_resource_count, |
| self.report_hwc_timeout(wait_failed, interrupt_loss, eqe_wait_result.elapsed as u32) | ||
| // Don't report the timeout once VF reconfiguration is pending, | ||
| // since the SoC will not respond. | ||
| if !self.vf_reconfiguration_pending { |
There was a problem hiding this comment.
It would be good to keep a consistent check. In other places we are checking for hwc_failure. We should do the same here.
There was a problem hiding this comment.
And, if there is hwc_failure, why not return an error?
There was a problem hiding this comment.
We would lose reports if the wait times out (hwc_failure set), then eqe is found, and wait_failed is false. Soc is alive, but slow. This case is reflected in the check at the end of the function where if a reconfig is not pending, hwc_failure is set back to false. On one hand, making the change is entirely safe because all we lose is a log sent to soc. On the other hand, I think the intent of the log is to help the soc diagnose when responses are slow and timing out.
Edit: It's a little frustrating the hwc_failure is set to true and then back. A stronger design might have multiple states, or maybe a bool to track hwc_timeout that could get cleared...
| // When HWC has already failed, skip sending teardown commands for HWC resources: | ||
| // DmaRegion, Eq, BnicQueue. HWC requests all fail: "Previous hardware failure". | ||
| // Device should reclaim resources on its own reset. | ||
| _ if skip_hwc => continue, |
There was a problem hiding this comment.
I am not sure we need to do anything here. GDMA driver has the logic to handle what to do when HWC has been marked for failure (during reconfig) and will handle this. So, for example, when this code makes a call to disable EQ or DMA region, gdma driver will error out if HWC failure is set to true (which will be in the case of reconfig)
There was a problem hiding this comment.
When testing on my lab machine, this check removed a dozen or so ignorable "previous hardware failure" traces. They are expected, but I would prefer future failure triage doesn't see them.
When VF Reconfiguration attempts to revoke the VTL0 VF, it can get stuck attempting to send HWC commands which have no chance of succeeding. This is causing try_notify_guest_and_revoke_vtl0_vf() to timeout, leaving the Guest in an inconsistent state that can cause the Reconfiguration to fail to restore the VTL0 VF.
hwc_failureto signal that the HWC channel is gone. Avoids waiting for timeouts on HWC commands.