[release-4.22] NVIDIA-596: Enable dpu healthcheck#3009
Conversation
Add configurable DPU node lease health monitoring to detect when the DPU-side OVN-Kubernetes component is down or not installed. Without this, pods are scheduled to DPU-accelerated nodes regardless of DPU readiness, causing silent 2-minute CNI ADD timeouts with no visibility or automated remediation. DPU lease configuration: - Read dpu-node-lease-renew-interval and dpu-node-lease-duration from the hardware-offload-config ConfigMap (defaults: 10s / 40s). - Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION env vars into ovnkube-controller for dpu-host/dpu node modes. - Script-lib translates env vars into --dpu-node-lease-renew-interval and --dpu-node-lease-duration CLI flags for ovnkube-node. - Setting renew-interval to 0 disables the health check; duration must always be > 0 (required by ovn-kubernetes). - Lease namespace is derived via downward API (fieldRef). Jira: https://issues.redhat.com/browse/NVIDIA-596 Made-with: Cursor Signed-off-by: Igal Tsoiref <itsoiref@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Igal Tsoiref <itsoiref@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
@tsorya: This pull request references NVIDIA-596 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tsorya The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/payload 4.22 ci blocking |
|
@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/65638f00-5190-11f1-9814-cded5d8a1eab-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/65638f00-5190-11f1-9814-cded5d8a1eab-1 |
|
/payload 4.22 ci blocking |
|
@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/daa71920-5263-11f1-9827-2279928092a2-0 trigger 13 job(s) of type blocking for the nightly release of OCP 4.22
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/daa71920-5263-11f1-9827-2279928092a2-1 |
|
/retest-required |
|
/hold |
any issue open on multus side? or cilium ? |
I can see cilium/cilium#30363 on cniVersion bumping for cilium. |
|
/hold |
|
This works for me openshift/release#79593 - overriding the CNI version in Cilium config. |
4071939 to
8796e18
Compare
|
@tsorya: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This is a manual cherry-pick of #2941 to release-4.22.
Changes
Cherry-picks both commits from #2941:
NVIDIA-596: Enable DPU healthcheck — Adds configurable DPU node lease health monitoring (
dpu-node-lease-renew-interval,dpu-node-lease-duration) read from thehardware-offload-configConfigMap. Env varsOVNKUBE_NODE_LEASE_RENEW_INTERVAL/OVNKUBE_NODE_LEASE_DURATIONare injected into ovnkube-node for dpu-host/dpu modes. Script-lib translates them into--dpu-node-lease-renew-interval/--dpu-node-lease-durationCLI flags. Setting renew-interval to 0 disables the health check.NVIDIA-616: Bump Multus CNI API version to 1.1.0 — Required for CNI STATUS and GC verbs used by the DPU health check. Backward compatible with 0.3.1.
Conflict Resolution
One conflict in
bindata/network/ovn-kubernetes/common/008-script-lib.yamlat the end of thestart-ovnkube-node()function: release-4.22 still has${egress_features_enable_flag}and${multi_external_gateway_enable_flag}(which were removed on master by #2944). Resolution keeps these existing lines and appends the new${dpu_lease_flags}.This is compatible with #2997 (cherry-pick of #2944) which is pending on release-4.22 — when #2997 merges, it will cleanly remove the egress/gateway lines while
${dpu_lease_flags}remains.