Skip to content

fix(inc-2141): only check essential clusters in liveness healthcheck#7925

Open
onewland wants to merge 7 commits into
masterfrom
oliver/inc-2141-essential-health-check
Open

fix(inc-2141): only check essential clusters in liveness healthcheck#7925
onewland wants to merge 7 commits into
masterfrom
oliver/inc-2141-essential-health-check

Conversation

@onewland
Copy link
Copy Markdown
Contributor

@onewland onewland commented May 6, 2026

Depends on #7924

Per INC-2141: previously sanity_check_clickhouse_connections returned True as soon as any clickhouse cluster answered. The intent was to keep the API alive when low-priority datasets (e.g. replays) had a cluster outage, but it backfired when the events_analytics_platform (EAP) cluster held onto stalled connections — some other cluster always answered fast, so health stayed green and pods that should have been recycled weren't.

This change scopes the liveness check to a small set of essential clusters and fails health if any of them is unreachable.

Identifying essential clusters

Match on storage_set_key (which maps 1:1 to a cluster in settings.CLUSTERS)

ESSENTIAL_STORAGE_SET_KEYS = {
    StorageSetKey.EVENTS,                     # errors storage
    StorageSetKey.EVENTS_RO,                  # errors_ro storage
    StorageSetKey.EVENTS_ANALYTICS_PLATFORM,
}

Picking storages by storage_set_key rather than storage key is also more robust against renaming individual storages (e.g. eap_items) — the cluster topology is the load-bearing concept here.

Behavior change

  • sanity_check_clickhouse_connections filters to essential storages only, then requires every unique cluster to return successfully within timeout_seconds.
  • Returns False if any cluster raises, any cluster fails to respond within the timeout, any essential storage has an undefined cluster, or no essential storages are found at all (defensive — should never happen in prod).
  • The thorough path (check_all_tables_present, gated by ?thorough=true) is unchanged.
  • The health_check_ignore_clickhouse config knob and the health_check.timeout_override_seconds override are unchanged.

ThreadPoolExecutor: don't block on stalled futures

Previously the executor was used as a context manager:

with ThreadPoolExecutor(...) as executor:
    ...

__exit__ calls shutdown(wait=True), which blocks until every submitted future finishes — including stalled ones. That defeats the timeout entirely: the very scenario INC-2141 is about (EAP holding broken connections) is exactly when we'd block on the way out.

The executor is now managed manually with shutdown(wait=False) in a finally:

executor = ThreadPoolExecutor(...)
try:
    ...
finally:
    executor.shutdown(wait=False)

Stalled worker threads run to completion in the background; that's acceptable because the pod is failing health and will be recycled by k8s. The function itself is now bounded by the timeout regardless of cluster behavior.

Tests

  • test_sanity_check_clickhouse_connections — unchanged, still passes (errors_ro is essential and reachable).
  • test_single_bad_dataset_passes_healthcheck → renamed test_non_essential_bad_storage_ignored with updated comment: a broken non-essential storage doesn't affect health.
  • New test_no_essential_storages_fails: only non-essential storage keys configured → fails defensively.
  • New test_unreachable_essential_cluster_fails_healthcheck: mocks _execute_show_tables to raise → fails (the key INC-2141 regression test).
  • New test_stalled_cluster_does_not_block_healthcheck: mocks _execute_show_tables with a 10s sleep, asserts the function returns within 5s under a 0.2s timeout. Without shutdown(wait=False) this would block ~10s. This is the regression test for the executor change.

onewland and others added 6 commits May 6, 2026 16:40
The `/tmp/snuba.down` drain mechanism worked by having a uwsgi worker
consult `uwsgi.started_on` and compare it to the down file's mtime. When
we migrated from uwsgi to granian (#7566), `pyuwsgi` was dropped from the
runtime deps and Dockerfile, so `import uwsgi` always fails and
`check_down_file_exists()` has been a no-op ever since. The ops repo no
longer references this drain mechanism either.

Remove the entire path: the uwsgi import block, `shutdown_time`,
`_set_shutdown`, `get_shutdown`, `check_down_file_exists`,
`check_shutdown` and its call sites, and the now-pointless
`down_file_exists` field/tag in /health output. /health_envoy is left as
a static 200 tombstone so existing envoy callers don't 404 — it can be
removed once envoy config no longer references it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously sanity_check_clickhouse_connections returned True if any
cluster responded, so a stalled high-traffic cluster (EAP) couldn't
fail health and trigger a pod recycle as long as some lower-priority
cluster answered. Restrict the check to clusters backing errors,
errors_ro, and events_analytics_platform, and require all of them to
respond successfully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Using `with ThreadPoolExecutor(...)` calls shutdown(wait=True) on exit,
which blocks until every submitted future finishes — including stalled
ones. That defeats the timeout entirely: the very scenario INC-2141 is
about (EAP holding broken connections) is exactly when we'd block here.

Manage the executor manually and call shutdown(wait=False) so the method
returns within the timeout regardless. Stalled worker threads run to
completion in the background; that's acceptable because the pod is
failing health and will be recycled by k8s.

Adds a regression test that submits a 30s sleep as the cluster check and
asserts the function returns within the timeout window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reference StorageSetKey.EVENTS/EVENTS_RO/EVENTS_ANALYTICS_PLATFORM
directly instead of bare strings, so the constant stays in sync with
the canonical source and we don't need a string lookup at filter time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skip the intermediate list comprehension and filter inside the
unique-cluster loop directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The stall sleep just has to outrun the 5s assertion bound; 10s is
plenty of headroom and shaves 20s off the worst-case test runtime if
the regression ever resurfaces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@onewland onewland marked this pull request as ready for review May 7, 2026 13:37
@onewland onewland requested a review from a team as a code owner May 7, 2026 13:37
Base automatically changed from oliver/remove-down-file-shutdown to master May 7, 2026 21:05
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 7, 2026

INC-2141

Comment on lines +199 to +203
assert stall_started.wait(timeout=2.0), "stalled worker never started"
assert not result
# Generous bound: timeout + executor teardown + test scheduler jitter, but
# still well short of the 10s sleep. Without shutdown(wait=False) this
# would wait ~10s.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The test passes due to a KeyError from a mocked storage key, which aborts the function under test and bypasses the logic it's meant to validate.
Severity: MEDIUM

Suggested Fix

Modify the test setup to prevent the KeyError. Either properly register the mocked storage in the _storage_factory or patch the get_storage call to return a valid mock storage object. This will ensure the filter_checked_storages function executes completely and the intended logic is tested.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: tests/utils/test_check_clickhouse.py#L199-L203

Potential issue: The tests `test_no_essential_storages_fails` and
`test_non_essential_bad_storage_ignored` pass for the wrong reason. The test setup uses
a mocked `FakeStorageKey` which is not found in the global storage factory, causing
`get_storage` to raise a `KeyError`. This exception is caught and silently passed in
`sanity_check_clickhouse_connections`, causing the `filter_checked_storages` function to
abort prematurely. Consequently, the new logic intended to filter non-essential storages
is never actually executed or tested, making the test's success misleading.

Did we get this right? 👍 / 👎 to inform future reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant