fix(inc-2141): only check essential clusters in liveness healthcheck#7925
Open
onewland wants to merge 7 commits into
Open
fix(inc-2141): only check essential clusters in liveness healthcheck#7925onewland wants to merge 7 commits into
onewland wants to merge 7 commits into
Conversation
The `/tmp/snuba.down` drain mechanism worked by having a uwsgi worker consult `uwsgi.started_on` and compare it to the down file's mtime. When we migrated from uwsgi to granian (#7566), `pyuwsgi` was dropped from the runtime deps and Dockerfile, so `import uwsgi` always fails and `check_down_file_exists()` has been a no-op ever since. The ops repo no longer references this drain mechanism either. Remove the entire path: the uwsgi import block, `shutdown_time`, `_set_shutdown`, `get_shutdown`, `check_down_file_exists`, `check_shutdown` and its call sites, and the now-pointless `down_file_exists` field/tag in /health output. /health_envoy is left as a static 200 tombstone so existing envoy callers don't 404 — it can be removed once envoy config no longer references it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously sanity_check_clickhouse_connections returned True if any cluster responded, so a stalled high-traffic cluster (EAP) couldn't fail health and trigger a pod recycle as long as some lower-priority cluster answered. Restrict the check to clusters backing errors, errors_ro, and events_analytics_platform, and require all of them to respond successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Using `with ThreadPoolExecutor(...)` calls shutdown(wait=True) on exit, which blocks until every submitted future finishes — including stalled ones. That defeats the timeout entirely: the very scenario INC-2141 is about (EAP holding broken connections) is exactly when we'd block here. Manage the executor manually and call shutdown(wait=False) so the method returns within the timeout regardless. Stalled worker threads run to completion in the background; that's acceptable because the pod is failing health and will be recycled by k8s. Adds a regression test that submits a 30s sleep as the cluster check and asserts the function returns within the timeout window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reference StorageSetKey.EVENTS/EVENTS_RO/EVENTS_ANALYTICS_PLATFORM directly instead of bare strings, so the constant stays in sync with the canonical source and we don't need a string lookup at filter time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skip the intermediate list comprehension and filter inside the unique-cluster loop directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stall sleep just has to outrun the 5s assertion bound; 10s is plenty of headroom and shaves 20s off the worst-case test runtime if the regression ever resurfaces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines
+199
to
+203
| assert stall_started.wait(timeout=2.0), "stalled worker never started" | ||
| assert not result | ||
| # Generous bound: timeout + executor teardown + test scheduler jitter, but | ||
| # still well short of the 10s sleep. Without shutdown(wait=False) this | ||
| # would wait ~10s. |
Contributor
There was a problem hiding this comment.
Bug: The test passes due to a KeyError from a mocked storage key, which aborts the function under test and bypasses the logic it's meant to validate.
Severity: MEDIUM
Suggested Fix
Modify the test setup to prevent the KeyError. Either properly register the mocked storage in the _storage_factory or patch the get_storage call to return a valid mock storage object. This will ensure the filter_checked_storages function executes completely and the intended logic is tested.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: tests/utils/test_check_clickhouse.py#L199-L203
Potential issue: The tests `test_no_essential_storages_fails` and
`test_non_essential_bad_storage_ignored` pass for the wrong reason. The test setup uses
a mocked `FakeStorageKey` which is not found in the global storage factory, causing
`get_storage` to raise a `KeyError`. This exception is caught and silently passed in
`sanity_check_clickhouse_connections`, causing the `filter_checked_storages` function to
abort prematurely. Consequently, the new logic intended to filter non-essential storages
is never actually executed or tested, making the test's success misleading.
Did we get this right? 👍 / 👎 to inform future reviews.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Depends on #7924
Per INC-2141: previously
sanity_check_clickhouse_connectionsreturnedTrueas soon as any clickhouse cluster answered. The intent was to keep the API alive when low-priority datasets (e.g. replays) had a cluster outage, but it backfired when the events_analytics_platform (EAP) cluster held onto stalled connections — some other cluster always answered fast, so health stayed green and pods that should have been recycled weren't.This change scopes the liveness check to a small set of essential clusters and fails health if any of them is unreachable.
Identifying essential clusters
Match on
storage_set_key(which maps 1:1 to a cluster insettings.CLUSTERS)Picking storages by
storage_set_keyrather than storage key is also more robust against renaming individual storages (e.g.eap_items) — the cluster topology is the load-bearing concept here.Behavior change
sanity_check_clickhouse_connectionsfilters to essential storages only, then requires every unique cluster to return successfully withintimeout_seconds.Falseif any cluster raises, any cluster fails to respond within the timeout, any essential storage has an undefined cluster, or no essential storages are found at all (defensive — should never happen in prod).check_all_tables_present, gated by?thorough=true) is unchanged.health_check_ignore_clickhouseconfig knob and thehealth_check.timeout_override_secondsoverride are unchanged.ThreadPoolExecutor: don't block on stalled futures
Previously the executor was used as a context manager:
__exit__callsshutdown(wait=True), which blocks until every submitted future finishes — including stalled ones. That defeats the timeout entirely: the very scenario INC-2141 is about (EAP holding broken connections) is exactly when we'd block on the way out.The executor is now managed manually with
shutdown(wait=False)in afinally:Stalled worker threads run to completion in the background; that's acceptable because the pod is failing health and will be recycled by k8s. The function itself is now bounded by the timeout regardless of cluster behavior.
Tests
test_sanity_check_clickhouse_connections— unchanged, still passes (errors_rois essential and reachable).test_single_bad_dataset_passes_healthcheck→ renamedtest_non_essential_bad_storage_ignoredwith updated comment: a broken non-essential storage doesn't affect health.test_no_essential_storages_fails: only non-essential storage keys configured → fails defensively.test_unreachable_essential_cluster_fails_healthcheck: mocks_execute_show_tablesto raise → fails (the key INC-2141 regression test).test_stalled_cluster_does_not_block_healthcheck: mocks_execute_show_tableswith a 10s sleep, asserts the function returns within 5s under a 0.2s timeout. Withoutshutdown(wait=False)this would block ~10s. This is the regression test for the executor change.