fix(inc-2141): only check essential clusters in liveness healthcheck by onewland · Pull Request #7925 · getsentry/snuba

onewland · 2026-05-06T23:46:41Z

Depends on #7924

Per INC-2141: previously sanity_check_clickhouse_connections returned True as soon as any clickhouse cluster answered. The intent was to keep the API alive when low-priority datasets (e.g. replays) had a cluster outage, but it backfired when the events_analytics_platform (EAP) cluster held onto stalled connections — some other cluster always answered fast, so health stayed green and pods that should have been recycled weren't.

This change scopes the liveness check to a small set of essential clusters and fails health if any of them is unreachable.

Identifying essential clusters

Match on storage_set_key (which maps 1:1 to a cluster in settings.CLUSTERS)

ESSENTIAL_STORAGE_SET_KEYS = {
    StorageSetKey.EVENTS,                     # errors storage
    StorageSetKey.EVENTS_RO,                  # errors_ro storage
    StorageSetKey.EVENTS_ANALYTICS_PLATFORM,
}

Picking storages by storage_set_key rather than storage key is also more robust against renaming individual storages (e.g. eap_items) — the cluster topology is the load-bearing concept here.

Behavior change

sanity_check_clickhouse_connections filters to essential storages only, then requires every unique cluster to return successfully within timeout_seconds.
Returns False if any cluster raises, any cluster fails to respond within the timeout, any essential storage has an undefined cluster, or no essential storages are found at all (defensive — should never happen in prod).
The thorough path (check_all_tables_present, gated by ?thorough=true) is unchanged.
The health_check_ignore_clickhouse config knob and the health_check.timeout_override_seconds override are unchanged.

ThreadPoolExecutor: don't block on stalled futures

Previously the executor was used as a context manager:

with ThreadPoolExecutor(...) as executor:
    ...

__exit__ calls shutdown(wait=True), which blocks until every submitted future finishes — including stalled ones. That defeats the timeout entirely: the very scenario INC-2141 is about (EAP holding broken connections) is exactly when we'd block on the way out.

The executor is now managed manually with shutdown(wait=False) in a finally:

executor = ThreadPoolExecutor(...)
try:
    ...
finally:
    executor.shutdown(wait=False)

Stalled worker threads run to completion in the background; that's acceptable because the pod is failing health and will be recycled by k8s. The function itself is now bounded by the timeout regardless of cluster behavior.

Tests

test_sanity_check_clickhouse_connections — unchanged, still passes (errors_ro is essential and reachable).
test_single_bad_dataset_passes_healthcheck → renamed test_non_essential_bad_storage_ignored with updated comment: a broken non-essential storage doesn't affect health.
New test_no_essential_storages_fails: only non-essential storage keys configured → fails defensively.
New test_unreachable_essential_cluster_fails_healthcheck: mocks _execute_show_tables to raise → fails (the key INC-2141 regression test).
New test_stalled_cluster_does_not_block_healthcheck: mocks _execute_show_tables with a 10s sleep, asserts the function returns within 5s under a 0.2s timeout. Without shutdown(wait=False) this would block ~10s. This is the regression test for the executor change.

The `/tmp/snuba.down` drain mechanism worked by having a uwsgi worker consult `uwsgi.started_on` and compare it to the down file's mtime. When we migrated from uwsgi to granian (#7566), `pyuwsgi` was dropped from the runtime deps and Dockerfile, so `import uwsgi` always fails and `check_down_file_exists()` has been a no-op ever since. The ops repo no longer references this drain mechanism either. Remove the entire path: the uwsgi import block, `shutdown_time`, `_set_shutdown`, `get_shutdown`, `check_down_file_exists`, `check_shutdown` and its call sites, and the now-pointless `down_file_exists` field/tag in /health output. /health_envoy is left as a static 200 tombstone so existing envoy callers don't 404 — it can be removed once envoy config no longer references it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously sanity_check_clickhouse_connections returned True if any cluster responded, so a stalled high-traffic cluster (EAP) couldn't fail health and trigger a pod recycle as long as some lower-priority cluster answered. Restrict the check to clusters backing errors, errors_ro, and events_analytics_platform, and require all of them to respond successfully. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Using `with ThreadPoolExecutor(...)` calls shutdown(wait=True) on exit, which blocks until every submitted future finishes — including stalled ones. That defeats the timeout entirely: the very scenario INC-2141 is about (EAP holding broken connections) is exactly when we'd block here. Manage the executor manually and call shutdown(wait=False) so the method returns within the timeout regardless. Stalled worker threads run to completion in the background; that's acceptable because the pod is failing health and will be recycled by k8s. Adds a regression test that submits a 30s sleep as the cluster check and asserts the function returns within the timeout window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reference StorageSetKey.EVENTS/EVENTS_RO/EVENTS_ANALYTICS_PLATFORM directly instead of bare strings, so the constant stays in sync with the canonical source and we don't need a string lookup at filter time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Skip the intermediate list comprehension and filter inside the unique-cluster loop directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The stall sleep just has to outrun the 5s assertion bound; 10s is plenty of headroom and shaves 20s off the worst-case test runtime if the regression ever resurfaces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

linear-code · 2026-05-07T21:05:27Z

INC-2141

sentry · 2026-05-07T21:09:05Z

+    assert stall_started.wait(timeout=2.0), "stalled worker never started"
+    assert not result
+    # Generous bound: timeout + executor teardown + test scheduler jitter, but
+    # still well short of the 10s sleep. Without shutdown(wait=False) this
+    # would wait ~10s.


Bug: The test passes due to a KeyError from a mocked storage key, which aborts the function under test and bypasses the logic it's meant to validate.
_{Severity: MEDIUM}

Suggested Fix

Modify the test setup to prevent the KeyError. Either properly register the mocked storage in the _storage_factory or patch the get_storage call to return a valid mock storage object. This will ensure the filter_checked_storages function executes completely and the intended logic is tested.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: tests/utils/test_check_clickhouse.py#L199-L203 Potential issue: The tests `test_no_essential_storages_fails` and `test_non_essential_bad_storage_ignored` pass for the wrong reason. The test setup uses a mocked `FakeStorageKey` which is not found in the global storage factory, causing `get_storage` to raise a `KeyError`. This exception is caught and silently passed in `sanity_check_clickhouse_connections`, causing the `filter_checked_storages` function to abort prematurely. Consequently, the new logic intended to filter non-essential storages is never actually executed or tested, making the test's success misleading.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

onewland and others added 6 commits May 6, 2026 16:40

ref(inc-2141): inline essential-storage filter into cluster loop

93f83f7

Skip the intermediate list comprehension and filter inside the unique-cluster loop directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

onewland marked this pull request as ready for review May 7, 2026 13:37

onewland requested a review from a team as a code owner May 7, 2026 13:37

Base automatically changed from oliver/remove-down-file-shutdown to master May 7, 2026 21:05

Merge branch 'master' into oliver/inc-2141-essential-health-check

a43b6be

sentry Bot reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(inc-2141): only check essential clusters in liveness healthcheck#7925

fix(inc-2141): only check essential clusters in liveness healthcheck#7925
onewland wants to merge 7 commits into
masterfrom
oliver/inc-2141-essential-health-check

onewland commented May 6, 2026 •

edited

Loading

Uh oh!

linear-code Bot commented May 7, 2026

Uh oh!

sentry Bot May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

onewland commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Identifying essential clusters

Behavior change

ThreadPoolExecutor: don't block on stalled futures

Tests

Uh oh!

linear-code Bot commented May 7, 2026

Uh oh!

sentry Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

onewland commented May 6, 2026 •

edited

Loading