OCPBUGS-67161: Replace HTTP backend liveness check with admin socket check#737
Conversation
|
@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
9b63bde to
6329b86
Compare
|
/jira refresh |
|
@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
13 similar comments
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/assign |
|
@alebedev87 I didn't notice the failed readiness probes in the test. I try to create a cluster by bot, but fail. I will try to test it tomorrow. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcmoraisjr The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…check Use HAProxy admin socket `show version` command for the liveness probe instead of sending an HTTP request to the backend. This directly tests whether the HAProxy process is alive and responsive, rather than testing through the data plane. The HTTP-based liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the liveness probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running. The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load. The readiness probe continues to use the HTTP backend check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the admin socket URL definition to the top of the Run method and reuse it for the Prometheus collector ScrapeURI default, the liveness probe, and the ConfigManager connection info. Remove the hardcoded default from the haproxy metrics package. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5c31f66 to
4b76ac6
Compare
|
@jcmoraisjr: Thanks for the lgtm! I fixed up the last commit (which changes the admin command from |
|
/lgtm |
|
tested it with 4.22.0-0-2026-03-06-015102-test-ci-ln-3bs1pmk-latest |
|
/verified by @ShudiLi |
|
@ShudiLi: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
/retest |
2 similar comments
|
/retest |
|
/retest |
|
Latest failure of The two runs before failed on the same This seems to be related to this CCO PR which was recently related and that's why it was not seen in the latest failure. There was a successful run of the same commit though. Both analysis look reasonable, I fail to see an evident link to this PR, let's give it another chance. /test e2e-aws-serial-2of2 |
|
/test e2e-aws-serial-2of2 |
|
/test e2e-aws-serial-1of2 |
|
/test e2e-aws-serial-1of2 (failed to install just now) |
|
/override ci/prow/e2e-aws-serial-1of2 |
|
@alebedev87: Overrode contexts on behalf of alebedev87: ci/prow/e2e-aws-serial-1of2 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@alebedev87: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@alebedev87: Jira Issue Verification Checks: Jira Issue OCPBUGS-67161 Jira Issue OCPBUGS-67161 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Summary
show versioncommandmaxconnlimit. Whenmaxconnis reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still runningmaxconn, so the liveness probe remains reliable under high connection loadManual test
Using https://github.com/mparram/test-backend.
Standard router image:
Router image with the fix (and
show infologs):If haproxy process is dead the liveness probe kicks in as expected: