Skip to content

OCPBUGS-67161: Replace HTTP backend liveness check with admin socket check#737

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
alebedev87:OCPBUGS-67161-liveness-probe-show-info
Mar 10, 2026
Merged

OCPBUGS-67161: Replace HTTP backend liveness check with admin socket check#737
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
alebedev87:OCPBUGS-67161-liveness-probe-show-info

Conversation

@alebedev87
Copy link
Contributor

@alebedev87 alebedev87 commented Feb 24, 2026

Summary

  • Replace the HTTP-based HAProxy liveness check with an admin socket show version command
  • The HTTP liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running
  • The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load
  • The readiness probe continues to use the HTTP backend check

Manual test

Using https://github.com/mparram/test-backend.

Standard router image:

# 1 pod sends ~500 req/s
$ oc -n test-client get pods
NAME                           READY   STATUS    RESTARTS   AGE
test-client-58c4687f55-5skdb   1/1     Running   0          9m13s
test-client-58c4687f55-bqffc   1/1     Running   0          9m13s
test-client-58c4687f55-gvsl7   1/1     Running   0          9m13s
test-client-58c4687f55-kpvm9   1/1     Running   0          9m13s
test-client-58c4687f55-wkxpp   1/1     Running   0          9m13s

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep image:
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep -A1 MAX_CONN
    - name: ROUTER_MAX_CONNECTIONS
      value: "2000"

# Router pods are restarting
$ oc -n openshift-ingress  get pods
NAME                             READY   STATUS    RESTARTS        AGE
router-default-d5db46b5d-2q7mh   1/1     Running   2 (8m11s ago)   11m
router-default-d5db46b5d-p6k9w   1/1     Running   4 (7m31s ago)   11m

$ oc -n openshift-ingress logs router-default-d5db46b5d-p6k9w -p
. . .
I0224 17:45:24.118761       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:27.608009       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:34.118481       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:37.607969       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:43.161206       1 template.go:844] "msg"="Shutdown requested, waiting 45s for new connections to cease" "logger"="router"
I0224 17:45:44.119374       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49768->127.0.0.1:80: i/o timeout
I0224 17:45:47.607867       1 healthz.go:311] backend-proxy-http,process-running check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49782->127.0.0.1:80: i/o timeout
[-]process-running failed: process is terminating

Router image with the fix (and show info logs):

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep -A1 MAX_CONN
    - name: ROUTER_MAX_CONNECTIONS
      value: "2000"

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep image:
    image: quay.io/alebedev/router:2.24.173
    image: quay.io/alebedev/router:2.24.173

# readiness probe goes on/off
$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m9s
router-default-7fc6b96c5b-thbsm   1/1     Running   0          4m10s

$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m41s
router-default-7fc6b96c5b-thbsm   0/1     Running   0          4m42s

# while liveness probe keeps running ok (no admin-socket healthz failures)
$ oc -n openshift-ingress logs router-default-7fc6b96c5b-qbnzp | grep -c admin-socket
0

$ oc -n openshift-ingress logs router-default-7fc6b96c5b-thbsm | grep CurrConns
CurrConns: 0
CurrConns: 1
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 6
CurrConns: 7
CurrConns: 8
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 931
CurrConns: 2000
CurrConns: 2000
CurrConns: 1998
CurrConns: 1994
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 824
CurrConns: 359
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 1810
CurrConns: 9
CurrConns: 10
CurrConns: 11
CurrConns: 12
CurrConns: 13

If haproxy process is dead the liveness probe kicks in as expected:

$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS       AGE
router-default-66b9cf9587-r7s2b   1/1     Running   1 (2m9s ago)   11m

$ oc -n openshift-ingress logs router-default-66b9cf9587-r7s2b -p | grep admin-socket
I0305 14:18:14.175477       1 healthz.go:311] admin-socket check failed: healthz
[-]admin-socket failed: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
I0305 14:18:24.175643       1 healthz.go:311] admin-socket check failed: healthz
[-]admin-socket failed: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
I0305 14:18:34.175771       1 healthz.go:311] admin-socket check failed: healthz
[-]admin-socket failed: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 24, 2026
@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Replace the HTTP-based HAProxy liveness check with an admin socket show info command
  • The HTTP liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running
  • The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load
  • The readiness probe continues to use the HTTP backend check

Test plan

  • go build ./... passes
  • go test ./pkg/router/metrics/... ./pkg/cmd/infra/router/... passes
  • Deploy to a cluster and verify curl localhost:1936/healthz returns 200 when HAProxy is running
  • Verify the liveness probe restarts the container if HAProxy becomes unresponsive

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@alebedev87 alebedev87 force-pushed the OCPBUGS-67161-liveness-probe-show-info branch from 9b63bde to 6329b86 Compare February 24, 2026 20:58
@alebedev87
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 24, 2026
@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from ShudiLi February 24, 2026 21:02
@alebedev87
Copy link
Contributor Author

/retest

@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

Details

In response to this:

Summary

  • Replace the HTTP-based HAProxy liveness check with an admin socket show info command
  • The HTTP liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running
  • The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load
  • The readiness probe continues to use the HTTP backend check

Manual test

Using https://github.com/mparram/test-backend.

Standard router image:

# 1 pod sends ~500 req/s
$ oc -n test-client get pods
NAME                           READY   STATUS    RESTARTS   AGE
test-client-58c4687f55-5skdb   1/1     Running   0          9m13s
test-client-58c4687f55-bqffc   1/1     Running   0          9m13s
test-client-58c4687f55-gvsl7   1/1     Running   0          9m13s
test-client-58c4687f55-kpvm9   1/1     Running   0          9m13s
test-client-58c4687f55-wkxpp   1/1     Running   0          9m13s

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep image:
   image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196
   image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep -A1 MAX_CONN
   - name: ROUTER_MAX_CONNECTIONS
     value: "2000"

# Router pods are restarting
$ oc -n openshift-ingress  get pods
NAME                             READY   STATUS    RESTARTS        AGE
router-default-d5db46b5d-2q7mh   1/1     Running   2 (8m11s ago)   11m
router-default-d5db46b5d-p6k9w   1/1     Running   4 (7m31s ago)   11m

$ oc -n openshift-ingress logs router-default-d5db46b5d-p6k9w -p
. . .
I0224 17:45:24.118761       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:27.608009       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:34.118481       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:37.607969       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:43.161206       1 template.go:844] "msg"="Shutdown requested, waiting 45s for new connections to cease" "logger"="router"
I0224 17:45:44.119374       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49768->127.0.0.1:80: i/o timeout
I0224 17:45:47.607867       1 healthz.go:311] backend-proxy-http,process-running check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49782->127.0.0.1:80: i/o timeout
[-]process-running failed: process is terminating

Router image with the fix:

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep -A1 MAX_CONN
   - name: ROUTER_MAX_CONNECTIONS
     value: "2000"

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep image:
   image: quay.io/alebedev/router:2.24.173
   image: quay.io/alebedev/router:2.24.173

# readiness probe goes on/off
$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m9s
router-default-7fc6b96c5b-thbsm   1/1     Running   0          4m10s

$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m41s
router-default-7fc6b96c5b-thbsm   0/1     Running   0          4m42s

# while liveness probe keeps running ok (no admin-socket healthz failures)
$ oc -n openshift-ingress logs router-default-7fc6b96c5b-qbnzp | grep -c admin-socket
0

$ oc -n openshift-ingress logs router-default-7fc6b96c5b-thbsm | grep CurrConns
CurrConns: 0
CurrConns: 1
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 6
CurrConns: 7
CurrConns: 8
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 931
CurrConns: 2000
CurrConns: 2000
CurrConns: 1998
CurrConns: 1994
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 824
CurrConns: 359
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 1810
CurrConns: 9
CurrConns: 10
CurrConns: 11
CurrConns: 12
CurrConns: 13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@alebedev87
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

13 similar comments
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@alebedev87: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

@jcmoraisjr
Copy link
Member

/assign

@ShudiLi
Copy link
Member

ShudiLi commented Mar 5, 2026

@alebedev87 I didn't notice the failed readiness probes in the test. I try to create a cluster by bot, but fail. I will try to test it tomorrow.

@jcmoraisjr
Copy link
Member

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcmoraisjr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 5, 2026
alebedev87 and others added 2 commits March 5, 2026 13:48
…check

Use HAProxy admin socket `show version` command for the liveness probe
instead of sending an HTTP request to the backend. This directly tests
whether the HAProxy process is alive and responsive, rather than
testing through the data plane.

The HTTP-based liveness check counts against HAProxy's maxconn limit.
When maxconn is reached due to client traffic, the liveness probe HTTP
request gets queued or rejected, causing probe failures and unnecessary
container restarts even though HAProxy is still running. The admin
socket is not subject to maxconn, so the liveness probe remains
reliable under high connection load.

The readiness probe continues to use the HTTP backend check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the admin socket URL definition to the top of the Run method
and reuse it for the Prometheus collector ScrapeURI default, the
liveness probe, and the ConfigManager connection info. Remove the
hardcoded default from the haproxy metrics package.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot
Copy link
Contributor

@alebedev87: This pull request references Jira Issue OCPBUGS-67161, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

Details

In response to this:

Summary

  • Replace the HTTP-based HAProxy liveness check with an admin socket show version command
  • The HTTP liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running
  • The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load
  • The readiness probe continues to use the HTTP backend check

Manual test

Using https://github.com/mparram/test-backend.

Standard router image:

# 1 pod sends ~500 req/s
$ oc -n test-client get pods
NAME                           READY   STATUS    RESTARTS   AGE
test-client-58c4687f55-5skdb   1/1     Running   0          9m13s
test-client-58c4687f55-bqffc   1/1     Running   0          9m13s
test-client-58c4687f55-gvsl7   1/1     Running   0          9m13s
test-client-58c4687f55-kpvm9   1/1     Running   0          9m13s
test-client-58c4687f55-wkxpp   1/1     Running   0          9m13s

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep image:
   image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196
   image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep -A1 MAX_CONN
   - name: ROUTER_MAX_CONNECTIONS
     value: "2000"

# Router pods are restarting
$ oc -n openshift-ingress  get pods
NAME                             READY   STATUS    RESTARTS        AGE
router-default-d5db46b5d-2q7mh   1/1     Running   2 (8m11s ago)   11m
router-default-d5db46b5d-p6k9w   1/1     Running   4 (7m31s ago)   11m

$ oc -n openshift-ingress logs router-default-d5db46b5d-p6k9w -p
. . .
I0224 17:45:24.118761       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:27.608009       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:34.118481       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:37.607969       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:43.161206       1 template.go:844] "msg"="Shutdown requested, waiting 45s for new connections to cease" "logger"="router"
I0224 17:45:44.119374       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49768->127.0.0.1:80: i/o timeout
I0224 17:45:47.607867       1 healthz.go:311] backend-proxy-http,process-running check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49782->127.0.0.1:80: i/o timeout
[-]process-running failed: process is terminating

Router image with the fix (and show info logs):

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep -A1 MAX_CONN
   - name: ROUTER_MAX_CONNECTIONS
     value: "2000"

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep image:
   image: quay.io/alebedev/router:2.24.173
   image: quay.io/alebedev/router:2.24.173

# readiness probe goes on/off
$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m9s
router-default-7fc6b96c5b-thbsm   1/1     Running   0          4m10s

$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m41s
router-default-7fc6b96c5b-thbsm   0/1     Running   0          4m42s

# while liveness probe keeps running ok (no admin-socket healthz failures)
$ oc -n openshift-ingress logs router-default-7fc6b96c5b-qbnzp | grep -c admin-socket
0

$ oc -n openshift-ingress logs router-default-7fc6b96c5b-thbsm | grep CurrConns
CurrConns: 0
CurrConns: 1
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 6
CurrConns: 7
CurrConns: 8
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 931
CurrConns: 2000
CurrConns: 2000
CurrConns: 1998
CurrConns: 1994
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 824
CurrConns: 359
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 1810
CurrConns: 9
CurrConns: 10
CurrConns: 11
CurrConns: 12
CurrConns: 13

If haproxy process is killed the liveness probe kicks in as expected:

$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS       AGE
router-default-66b9cf9587-r7s2b   1/1     Running   1 (2m9s ago)   11m

$ oc -n openshift-ingress logs router-default-66b9cf9587-r7s2b -p | grep admin-socket
I0305 14:18:14.175477       1 healthz.go:311] admin-socket check failed: healthz
[-]admin-socket failed: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
I0305 14:18:24.175643       1 healthz.go:311] admin-socket check failed: healthz
[-]admin-socket failed: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
I0305 14:18:34.175771       1 healthz.go:311] admin-socket check failed: healthz
[-]admin-socket failed: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@alebedev87 alebedev87 force-pushed the OCPBUGS-67161-liveness-probe-show-info branch from 5c31f66 to 4b76ac6 Compare March 5, 2026 14:21
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026
@alebedev87
Copy link
Contributor Author

@jcmoraisjr: Thanks for the lgtm! I fixed up the last commit (which changes the admin command from show info to show version) into the first one. No changes were made to the code. Can you please retag the PR?

@jcmoraisjr
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026
@ShudiLi
Copy link
Member

ShudiLi commented Mar 6, 2026

tested it with 4.22.0-0-2026-03-06-015102-test-ci-ln-3bs1pmk-latest

1.
% oc get clusterversion
NAME      VERSION                                                AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.22.0-0-2026-03-06-015102-test-ci-ln-3bs1pmk-latest   True        False         40m     Cluster version is 4.22.0-0-2026-03-06-015102-test-ci-ln-3bs1pmk-latest

2.
% oc get route
NAME          HOST/PORT                                                             PATH   SERVICES      PORT          TERMINATION   WILDCARD
unsec-apach   unsec-apach-default.apps.ci-ln-3bs1pmk-76ef8.aws-4.ci.openshift.org          unsec-apach   unsec-apach                 None
% oc get ep   
Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice
NAME          ENDPOINTS                                                        AGE
kubernetes    10.0.37.209:6443,10.0.51.149:6443,10.0.67.142:6443               75m
unsec-apach   10.128.2.22:8080,10.128.2.23:8080,10.128.2.24:8080 + 7 more...   23m

3. let 10 pods to send flood traffic
sh-4.4# hey -n 50000 -c 30000 http://unsec-apach-default.apps.ci-ln-3bs1pmk-76ef8.aws-4.ci.openshift.org

Summary:
  Total:	0.9707 secs
  Slowest:	0.8607 secs
  Fastest:	0.4919 secs
  Average:	0.7507 secs
  Requests/sec:	30906.7499
  
  Total data:	18382 bytes
  Size/request:	14 bytes

Response time histogram:
  0.492 [1]	|
  0.529 [9]	|■
  0.566 [40]	|■■■■■
  0.603 [86]	|■■■■■■■■■■■
  0.639 [19]	|■■
  0.676 [79]	|■■■■■■■■■■
  0.713 [189]	|■■■■■■■■■■■■■■■■■■■■■■■■
  0.750 [168]	|■■■■■■■■■■■■■■■■■■■■■
  0.787 [116]	|■■■■■■■■■■■■■■■
  0.824 [317]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.861 [289]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 0.5981 secs
  25% in 0.6993 secs
  50% in 0.7763 secs
  75% in 0.8200 secs
  90% in 0.8361 secs
  95% in 0.8507 secs
  99% in 0.8580 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.1869 secs, 0.4919 secs, 0.8607 secs
  DNS-lookup:	0.2913 secs, 0.0000 secs, 0.8206 secs
  req write:	0.0122 secs, 0.0000 secs, 0.4529 secs
  resp wait:	0.1503 secs, 0.0663 secs, 0.3534 secs
  resp read:	0.0076 secs, 0.0000 secs, 0.4877 secs

Status code distribution:
  [200]	1313 responses

Error distribution:
  [9337]	Get "http://unsec-apach-default.apps.ci-ln-3bs1pmk-76ef8.aws-4.ci.openshift.org": dial tcp 34.223.173.39:80: socket: too many open files
  [17337]	Get "http://unsec-apach-default.apps.ci-ln-3bs1pmk-76ef8.aws-4.ci.openshift.org": dial tcp: lookup unsec-apach-default.apps.ci-ln-3bs1pmk-76ef8.aws-4.ci.openshift.org on 172.30.0.10:53: dial udp 172.30.0.10:53: socket: too many open files
  [2013]	Get "http://unsec-apach-default.apps.ci-ln-3bs1pmk-76ef8.aws-4.ci.openshift.org": dial tcp: lookup unsec-apach-default.apps.ci-ln-3bs1pmk-76ef8.aws-4.ci.openshift.org on 172.30.0.10:53: no such host

sh-4.4# hey -n 50000 -c 30000 http://unsec-apach-default.apps.ci-ln-3bs1pmk-76ef8.aws-4.ci.openshift.org

Summary:
  Total:	1.2640 secs
  Slowest:	1.2622 secs
  Fastest:	0.6246 secs
  Average:	1.0410 secs
  Requests/sec:	23735.0323
  
  Total data:	71582 bytes
  Size/request:	14 bytes

Response time histogram:
  0.625 [1]	|
  0.688 [220]	|■■■■■■
  0.752 [0]	|
  0.816 [3]	|
  0.880 [186]	|■■■■■
  0.943 [254]	|■■■■■■■
  1.007 [733]	|■■■■■■■■■■■■■■■■■■■■■
  1.071 [1402]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1.135 [1405]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1.198 [742]	|■■■■■■■■■■■■■■■■■■■■■
  1.262 [167]	|■■■■■


Latency distribution:
  10% in 0.9042 secs
  25% in 0.9994 secs
  50% in 1.0636 secs
  75% in 1.1111 secs
  90% in 1.1619 secs
  95% in 1.1887 secs
  99% in 1.2260 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0596 secs, 0.6246 secs, 1.2622 secs
  DNS-lookup:	0.0860 secs, 0.0000 secs, 0.5801 secs
  req write:	0.0212 secs, 0.0000 secs, 0.7431 secs
  resp wait:	0.0762 secs, 0.0013 secs, 0.4120 secs
  resp read:	0.0047 secs, 0.0000 secs, 0.2996 secs

Status code distribution:
  [200]	5113 responses

Error distribution:
  [12508]	Get "http://unsec-apach-default.apps.ci-ln-3bs1pmk-76ef8.aws-4.ci.openshift.org": dial tcp 34.223.173.39:80: socket: too many open files
  [12379]	Get "http://unsec-apach-default.apps.ci-ln-3bs1pmk-76ef8.aws-4.ci.openshift.org": dial tcp: lookup unsec-apach-default.apps.ci-ln-3bs1pmk-76ef8.aws-4.ci.openshift.org on 172.30.0.10:53: dial udp 172.30.0.10:53: socket: too many open files

sh-4.4# 

4. check the log, the router pod doesn't reload or restart, no issues of the health-check as well
% oc -n openshift-ingress logs router-default-5c54fb6599-pw6sq
I0306 02:48:19.689300       1 template.go:561] "msg"="starting router" "logger"="router" "version"="majorFromGit: \nminorFromGit: \ncommitFromGit: 13e30fab\nversionFromGit: v0.0.0-unknown\ngitTreeState: dirty\nbuildDate: 2026-03-06T01:44:57Z\n"
I0306 02:48:19.689905       1 envvar.go:172] "Feature gate default state" feature="InOrderInformers" enabled=true
I0306 02:48:19.689918       1 envvar.go:172] "Feature gate default state" feature="InOrderInformersBatchProcess" enabled=true
I0306 02:48:19.689923       1 envvar.go:172] "Feature gate default state" feature="InformerResourceVersion" enabled=true
I0306 02:48:19.689926       1 envvar.go:172] "Feature gate default state" feature="WatchListClient" enabled=true
I0306 02:48:19.689929       1 envvar.go:172] "Feature gate default state" feature="ClientsAllowCBOR" enabled=false
I0306 02:48:19.689932       1 envvar.go:172] "Feature gate default state" feature="ClientsPreferCBOR" enabled=false
I0306 02:48:19.691551       1 metrics.go:156] "msg"="router health and metrics port listening on HTTP and HTTPS" "address"="0.0.0.0:1936" "logger"="metrics"
I0306 02:48:19.694943       1 router.go:214] "msg"="creating a new template router" "logger"="template" "writeDir"="/var/lib/haproxy"
I0306 02:48:19.694986       1 router.go:298] "msg"="router will coalesce reloads within an interval of each other" "interval"="5s" "logger"="template"
I0306 02:48:19.695332       1 router.go:368] "msg"="watching for changes" "logger"="template" "path"="/etc/pki/tls/private"
I0306 02:48:19.695378       1 router.go:283] "msg"="router is including routes in all namespaces" "logger"="router"
I0306 02:48:19.708917       1 reflector.go:446] "Caches populated" type="*v1.EndpointSlice" reflector="github.com/openshift/router/pkg/router/controller/factory/factory.go:124"
I0306 02:48:19.709499       1 reflector.go:446] "Caches populated" type="*v1.Service" reflector="github.com/openshift/router/pkg/router/template/service_lookup.go:33"
I0306 02:48:19.709982       1 reflector.go:446] "Caches populated" type="*v1.Route" reflector="github.com/openshift/router/pkg/router/controller/factory/factory.go:124"
E0306 02:48:19.801185       1 haproxy.go:418] "Unhandled Error" err="can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory" logger="UnhandledError"
I0306 02:48:19.859726       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
I0306 03:10:06.819743       1 router.go:665] "msg"="router reloaded" "logger"="template" "output"=" - Checking http://localhost:80 using PROXY protocol ...\n - Health check ok : 0 retry attempt(s).\n"
%

@ShudiLi
Copy link
Member

ShudiLi commented Mar 6, 2026

/verified by @ShudiLi

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 6, 2026
@openshift-ci-robot
Copy link
Contributor

@ShudiLi: This PR has been marked as verified by @ShudiLi.

Details

In response to this:

/verified by @ShudiLi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@lihongan
Copy link
Contributor

lihongan commented Mar 6, 2026

/retest-required

@alebedev87
Copy link
Contributor Author

/retest

2 similar comments
@alebedev87
Copy link
Contributor Author

/retest

@lihongan
Copy link
Contributor

lihongan commented Mar 9, 2026

/retest

@alebedev87
Copy link
Contributor Author

Latest failure of serial-2of2 seems to be related to etcd instability, from Claude Code:

 ---
  Root Cause Analysis

  Summary

  All 126 monitor test failures stem from a single root cause: etcd instability on node ip-10-0-5-164.ec2.internal that cascaded into cluster-wide API server disruption.

  The Chain of Events

  1. Etcd under pressure on node ip-10-0-5-164 (03:54:00)
  - etcd member 84e681fa7f985160 on node ip-10-0-5-164 started experiencing issues:
    - apply request took too long
    - waiting for ReadIndex response took too long, retrying
    - Started new elections at term 4
    - This repeated at 03:55:00 as well

  2. Etcd leader flapping (03:54:37 - 03:55:33)
  - The etcd member rapidly lost and re-found the leader (20dcc50fad3b1af7) multiple times:
    - Lost/found at 03:54:37, again at 03:54:41, again at 03:55:32
  - etcdMemberCommunicationSlow alert fired at 03:54:19 (pending for 4m28s)
  - The etcd operator detected leader change rate of 3.3/5min (reported starting 03:55:19)
  - etcd Raft buffers were full: 100 dropped internal Raft messages on ip-10-0-57-43, 19 on ip-10-0-5-164

  3. API server disruption (03:54:31 - ~03:55:31)
  - Starting at 03:54:31, KubeAPIServer500s errors appeared (1-2 per second out of ~66-95 total requests)
  - All 3 API servers became unreachable for 14-17 seconds (max allowed was 6s):
    - kube-api (new: 14s, reused: 17s)
    - openshift-api (new: 14s, reused: 14s)
    - oauth-api (new: 14s, reused: 17s)
  - Error type: net/http: timeout awaiting response headers and dial tcp ... i/o timeout
  - Host-to-host connectivity to node 10.0.5.164 was disrupted for 13s
  - Scheduler on 10.0.5.164 observed API unreachable for 1m, as did kube-controller-manager

  4. Cascading effects (04:50 - 05:22)
  - KubePodNotReady alerts fired across ~15 namespaces (58s-2min each)
  - etcdMembersDown alert (pending, never firing) across all 3 etcd members from 05:05-05:18
  - machine-config operator went Degraded due to API timeout getting MachineConfigPools
  
Likely Root Cause

  The etcd instability on ip-10-0-5-164 appears to be caused by resource contention or I/O pressure on that master node. The etcd disk metrics show ip-10-0-5-164 had the lowest disk latency (0.006s vs 0.015s for other nodes), yet it was the one experiencing the most etcd issues — suggesting the problem may be network-related rather than disk-related. The node experienced host-to-host connectivity failures to/from multiple other nodes simultaneously, pointing to a transient network partition or NIC/hypervisor-level issue on that specific AWS instance.

The two runs before failed on the same [sig-instrumentation][Late] Platform Prometheus targets should not be accessible without auth [Serial] [Suite:openshift/conformance/serial]. Claude Code tracked it down to this:

  Both runs are from the same CI job (ci-op-6vy7stnn) testing the same PR (revision 4b76ac62), but they used different release
  payloads because the CI system builds a fresh release image for each attempt:

  ┌───────────────────────────┬────────────────────────┬────────────────────┐
  │                           │ Failed Run (previous/) │    Passing Run     │
  ├───────────────────────────┼────────────────────────┼────────────────────┤
  │ Release payload timestamp │ 2026-03-08-193303      │ 2026-03-09-024639  │
  ├───────────────────────────┼────────────────────────┼────────────────────┤
  │ Release digest            │ sha256:99e118a1...     │ sha256:62870a08... │
  ├───────────────────────────┼────────────────────────┼────────────────────┤
  │ CCO image                 │ sha256:7d0676b7...     │ sha256:50ca8cd1... │
  ├───────────────────────────┼────────────────────────┼────────────────────┤
  │ kube-rbac-proxy sidecar   │ Missing                │ Present            │
  └───────────────────────────┴────────────────────────┴────────────────────┘

  The CI release payload is assembled from the latest builds of all OCP components at that point in time. Between the two payload assemblies (~7 hours apart), the cloud-credential-operator component was rebuilt — and that rebuild is what added/restored the kube-rbac-proxy sidecar container.

  This strongly suggests there was a transient regression in the CCO's deployment manifest in the 4.21 branch around 2026-03-08. Someone likely landed a change that removed the kube-rbac-proxy sidecar from the CCO deployment (intentionally or accidentally), and then it was fixed before the second payload was assembled.

This seems to be related to this CCO PR which was recently related and that's why it was not seen in the latest failure.

There was a successful run of the same commit though.

Both analysis look reasonable, I fail to see an evident link to this PR, let's give it another chance.

/test e2e-aws-serial-2of2

@alebedev87
Copy link
Contributor Author

/test e2e-aws-serial-2of2

@lihongan
Copy link
Contributor

/test e2e-aws-serial-1of2

@lihongan
Copy link
Contributor

/test e2e-aws-serial-1of2

(failed to install just now)

@alebedev87
Copy link
Contributor Author

e2e-aws-serial-1of2 passed with the same commit for more than 2 times already. The master branch's head keeps moving and tests keep getting rerun (for installation errors). Let me try to speed this up if I have rights of course.

/override ci/prow/e2e-aws-serial-1of2

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2026

@alebedev87: Overrode contexts on behalf of alebedev87: ci/prow/e2e-aws-serial-1of2

Details

In response to this:

e2e-aws-serial-1of2 passed with the same commit for more than 2 times already. The master branch's head keeps moving and tests keep getting rerun (for installation errors). Let me try to speed this up if I have rights of course.

/override ci/prow/e2e-aws-serial-1of2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2026

@alebedev87: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit ffb930d into openshift:master Mar 10, 2026
10 checks passed
@openshift-ci-robot
Copy link
Contributor

@alebedev87: Jira Issue Verification Checks: Jira Issue OCPBUGS-67161
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-67161 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Summary

  • Replace the HTTP-based HAProxy liveness check with an admin socket show version command
  • The HTTP liveness check counts against HAProxy's maxconn limit. When maxconn is reached due to client traffic, the probe HTTP request gets queued or rejected, causing probe failures and unnecessary container restarts even though HAProxy is still running
  • The admin socket is not subject to maxconn, so the liveness probe remains reliable under high connection load
  • The readiness probe continues to use the HTTP backend check

Manual test

Using https://github.com/mparram/test-backend.

Standard router image:

# 1 pod sends ~500 req/s
$ oc -n test-client get pods
NAME                           READY   STATUS    RESTARTS   AGE
test-client-58c4687f55-5skdb   1/1     Running   0          9m13s
test-client-58c4687f55-bqffc   1/1     Running   0          9m13s
test-client-58c4687f55-gvsl7   1/1     Running   0          9m13s
test-client-58c4687f55-kpvm9   1/1     Running   0          9m13s
test-client-58c4687f55-wkxpp   1/1     Running   0          9m13s

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep image:
   image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196
   image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4bdfb4ce97c4391f07c4183b771dab332f311f2d707adb03281b43fd4dc7e196

$ oc -n openshift-ingress get pods router-default-d5db46b5d-p6k9w -o yaml | grep -A1 MAX_CONN
   - name: ROUTER_MAX_CONNECTIONS
     value: "2000"

# Router pods are restarting
$ oc -n openshift-ingress  get pods
NAME                             READY   STATUS    RESTARTS        AGE
router-default-d5db46b5d-2q7mh   1/1     Running   2 (8m11s ago)   11m
router-default-d5db46b5d-p6k9w   1/1     Running   4 (7m31s ago)   11m

$ oc -n openshift-ingress logs router-default-d5db46b5d-p6k9w -p
. . .
I0224 17:45:24.118761       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:27.608009       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:34.118481       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:37.607969       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: dial tcp [::1]:80: connect: connection refused
I0224 17:45:43.161206       1 template.go:844] "msg"="Shutdown requested, waiting 45s for new connections to cease" "logger"="router"
I0224 17:45:44.119374       1 healthz.go:311] backend-proxy-http check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49768->127.0.0.1:80: i/o timeout
I0224 17:45:47.607867       1 healthz.go:311] backend-proxy-http,process-running check failed: healthz
[-]backend-proxy-http failed: read tcp 127.0.0.1:49782->127.0.0.1:80: i/o timeout
[-]process-running failed: process is terminating

Router image with the fix (and show info logs):

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep -A1 MAX_CONN
   - name: ROUTER_MAX_CONNECTIONS
     value: "2000"

$ oc -n openshift-ingress get pods router-default-7fc6b96c5b-thbsm -o yaml | grep image:
   image: quay.io/alebedev/router:2.24.173
   image: quay.io/alebedev/router:2.24.173

# readiness probe goes on/off
$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m9s
router-default-7fc6b96c5b-thbsm   1/1     Running   0          4m10s

$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS   AGE
router-default-7fc6b96c5b-qbnzp   0/1     Running   0          4m41s
router-default-7fc6b96c5b-thbsm   0/1     Running   0          4m42s

# while liveness probe keeps running ok (no admin-socket healthz failures)
$ oc -n openshift-ingress logs router-default-7fc6b96c5b-qbnzp | grep -c admin-socket
0

$ oc -n openshift-ingress logs router-default-7fc6b96c5b-thbsm | grep CurrConns
CurrConns: 0
CurrConns: 1
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 3
CurrConns: 4
CurrConns: 5
CurrConns: 6
CurrConns: 7
CurrConns: 8
CurrConns: 0
CurrConns: 1
CurrConns: 2
CurrConns: 931
CurrConns: 2000
CurrConns: 2000
CurrConns: 1998
CurrConns: 1994
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 824
CurrConns: 359
CurrConns: 2000
CurrConns: 2000
CurrConns: 2000
CurrConns: 1810
CurrConns: 9
CurrConns: 10
CurrConns: 11
CurrConns: 12
CurrConns: 13

If haproxy process is dead the liveness probe kicks in as expected:

$ oc -n openshift-ingress get pods
NAME                              READY   STATUS    RESTARTS       AGE
router-default-66b9cf9587-r7s2b   1/1     Running   1 (2m9s ago)   11m

$ oc -n openshift-ingress logs router-default-66b9cf9587-r7s2b -p | grep admin-socket
I0305 14:18:14.175477       1 healthz.go:311] admin-socket check failed: healthz
[-]admin-socket failed: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
I0305 14:18:24.175643       1 healthz.go:311] admin-socket check failed: healthz
[-]admin-socket failed: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused
I0305 14:18:34.175771       1 healthz.go:311] admin-socket check failed: healthz
[-]admin-socket failed: dial unix /var/lib/haproxy/run/haproxy.sock: connect: connection refused

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants