fix: recover IDLE gRPC connections in health monitor#233
Conversation
Add restart_count field (uint32, field 14) to the HostInfo protobuf message. This allows hadron to report its systemd restart counter to ion during the session hello, enabling crash-loop detection in the debug machines endpoint. - gravity_session.proto: added restart_count = 14 - gravity_session.pb.go: regenerated - grpc_client.go: populate RestartCount from GravityConfig - types.go: added RestartCount to GravityConfig
When an ion restarts (tableflip), the gRPC connection transitions to IDLE state. In this state, gRPC does not attempt to reconnect until a new RPC call is issued or Connect() is explicitly called. The health monitor detected IDLE connections (logging 'connection N is unhealthy, state: IDLE') but never attempted recovery — causing hadrons to permanently lose tunnel connectivity to restarted ions. This caused a 48-minute full outage in us-central: all 3 gravity endpoints went IDLE after ion rollout, hadrons had 0 tunnel streams, and no deployments could be provisioned or serve traffic. A manual hadron restart was required to recover. Fix: when performHealthCheck detects an IDLE connection, call conn.Connect() to force gRPC to transition from IDLE → CONNECTING → READY (or TRANSIENT_FAILURE if the server is unreachable, which has its own retry logic). Track consecutive idle counts for observability.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds per-connection idle counters to StreamManager, initializes and grows them with endpoint slots, updates performHealthCheck to increment/reset counters and call conn.Connect() for unhealthy IDLEs, escalate persistent IDLEs to trigger endpoint disconnection and stream rebuilds, and adds tests covering these behaviors and edge cases. Changes
🚥 Pre-merge checks | ✅ 1✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
gravity/grpc_client.go (1)
5611-5647: Well-implemented IDLE recovery logic.The implementation correctly addresses the root cause from the PR objectives:
- Detects IDLE state via
state.String() == "IDLE"- Calls
conn.Connect()to force transition to CONNECTING- Tracks consecutive idle counts for observability
- Resets counters appropriately on state transitions
The defensive bounds checking (
i < len(g.streamManager.connectionIdleCount)) handles edge cases where arrays may be out of sync.The
idleCountis tracked but currently only used for logging. If repeated IDLE transitions become problematic, consider adding a threshold (e.g., after N consecutive IDLEs, trigger full endpoint reconnection):// Future enhancement: escalate if IDLE persists across multiple health checks if idleCount > 3 { g.logger.Warn("connection %d stuck in IDLE after %d attempts, triggering endpoint reconnection", i, idleCount) go g.handleEndpointDisconnection(i, "persistent_idle") }This is optional since the current implementation should handle normal tableflip scenarios.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@gravity/grpc_client.go` around lines 5611 - 5647, Approve the current IDLE recovery logic but optionally add escalation for persistent IDLEs: in the health-check loop where connectionIdleCount is incremented (refer to g.streamManager.connectionIdleCount, conn.Connect(), and the for loop over g.connections), after incrementing idleCount check a threshold (e.g., >3) and then log a warning and trigger an endpoint reconnection/handling routine (e.g., call go g.handleEndpointDisconnection(i, "persistent_idle")) so repeated IDLEs escalate beyond just logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@gravity/grpc_client.go`:
- Line 923: Start initializes streamManager.connectionIdleCount with make([]int,
connectionCount), but addEndpoint (method addEndpoint) only grows
streamManager.connectionHealth when adding endpoints dynamically, leaving
connectionIdleCount unexpanded and breaking idle recovery tracking; update
addEndpoint to append or re-slice streamManager.connectionIdleCount in the same
place it grows streamManager.connectionHealth so that connectionIdleCount length
matches the new endpoint count (mirror the growth logic used in Start for
consistency).
---
Nitpick comments:
In `@gravity/grpc_client.go`:
- Around line 5611-5647: Approve the current IDLE recovery logic but optionally
add escalation for persistent IDLEs: in the health-check loop where
connectionIdleCount is incremented (refer to
g.streamManager.connectionIdleCount, conn.Connect(), and the for loop over
g.connections), after incrementing idleCount check a threshold (e.g., >3) and
then log a warning and trigger an endpoint reconnection/handling routine (e.g.,
call go g.handleEndpointDisconnection(i, "persistent_idle")) so repeated IDLEs
escalate beyond just logging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: aee4ecd3-b74a-4899-9b97-8d5ebf20f6b0
📒 Files selected for processing (1)
gravity/grpc_client.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (go)
🔇 Additional comments (2)
gravity/grpc_client.go (2)
416-418: LGTM!The new
connectionIdleCountfield is well-placed alongsideconnectionHealthand follows the existing pattern for per-connection tracking.
658-658: LGTM!Correct initialization of
connectionIdleCountto matchconnectionCount, consistent with theconnectionHealthinitialization pattern.
Two fixes from PR review: 1. addEndpoint grows connectionHealth when adding endpoints dynamically but did not grow connectionIdleCount, leaving it shorter than connectionHealth. This would cause idle recovery tracking to silently skip dynamically-added endpoints. Mirror the growth logic. 2. When a connection stays IDLE for >3 consecutive health checks, escalate beyond just conn.Connect() — trigger a full endpoint reconnection via handleEndpointDisconnection which tears down and rebuilds all streams for that endpoint. This handles cases where the underlying gRPC connection is fundamentally broken and Connect() alone cannot recover it.
… escalation - TestAddEndpoint_GrowsConnectionIdleCount: verifies connectionIdleCount is grown alongside connectionHealth when addEndpoint adds a new endpoint dynamically (prevents idle recovery from silently skipping dynamically-added endpoints) - TestPerformHealthCheck_PersistentIDLEEscalates: verifies that when connectionIdleCount exceeds the escalation threshold (>3), performHealthCheck triggers handleEndpointDisconnection without panicking (pre-sets count to 3, runs one health check, verifies count=4 and no crash from the async escalation)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@gravity/hardening_test.go`:
- Around line 854-873: The test seeds connectionIdleCount too high (3) so
performHealthCheck() triggers escalation and concurrent writes; change the seed
to a below-threshold value (e.g., 2) so the test only verifies increment
behavior, and when asserting connectionIdleCount and connectionHealth read them
while holding the streamManager.healthMu lock (use streamManager.healthMu to
protect access) to avoid unlocked race conditions with performHealthCheck().
Ensure you update references to connectionIdleCount, connectionHealth,
performHealthCheck, and streamManager.healthMu in the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 72d60b7a-5308-4d33-985e-7bf71ea3a054
📒 Files selected for processing (1)
gravity/hardening_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (go)
🧰 Additional context used
🪛 GitHub Actions: Go
gravity/hardening_test.go
[error] 871-871: go test -race detected a DATA RACE (read) in TestPerformHealthCheck_IDLECounterIncrementsOnConsecutiveIDLE at /home/runner/work/go-common/go-common/gravity/hardening_test.go:871 during parallel execution; race reported as write at grpc_client.go:3679.
…ion, lifecycle tests P1 — Connection state transitions: - SHUTDOWN connection: no recovery attempt, counter reset - Nil connection: marked unhealthy without panic - IDLE → CONNECTING transition: idle counter resets to 0 P1 — Endpoint disconnection cascade: - Single endpoint IDLE while others healthy: targeted recovery - All endpoints IDLE simultaneously: all get recovery (rolling restart) - Invalid endpoint index: no panic - Already reconnecting guard: duplicate disconnection skipped P2 — Stream + connection interaction: - Streams on IDLE connection: not recovered (connReady=false) - Stream recovery blocked until connection reaches READY - refreshEndpointHealth: derives from connectionHealth correctly - All connections unhealthy: all endpoints marked unhealthy P3 — Timing & lifecycle: - Context cancellation: health monitor goroutine exits promptly - Empty connections: no panic - Mismatched array lengths: documents current panic behavior (connectionHealth shorter than connections — needs bounds fix) Also found a real bug: performHealthCheck panics when connectionHealth array is shorter than connections array. Documented as TestPerformHealthCheck_MismatchedArrayLengths_Panics.
performHealthCheck iterated over g.connections but indexed into g.streamManager.connectionHealth without bounds checking. If connectionHealth was shorter than connections (e.g., during dynamic endpoint addition before arrays are grown), this caused an index out of range panic. Fix: break the loop when i >= len(connectionHealth). Connections beyond the health array are skipped — they'll be picked up once the arrays are grown by addEndpoint. Updated TestPerformHealthCheck_MismatchedArrayLengths_NoPanic to verify the fix (was previously documenting the panic).
There was a problem hiding this comment.
♻️ Duplicate comments (1)
gravity/hardening_test.go (1)
850-874:⚠️ Potential issue | 🔴 CriticalData race: escalation triggers async writes that race with unlocked reads.
Pre-setting
connectionIdleCountto3makes the nextperformHealthCheck()cross the>3threshold and trigger escalation logic, which spawns an async goroutine that can mutate health state. The unlocked reads at lines 865-872 race with those writes.The pipeline failure confirms this:
go test -racedetected a DATA RACE at line 871.Fix options:
- Use below-threshold seed (e.g.,
2) so this test validates only increment behavior without triggering escalation- Read shared state under
healthMu.RLock()(as done correctly inTestPerformHealthCheck_PersistentIDLEEscalatesat lines 1167-1170)🔧 Proposed fix (option 1 - simpler)
- // Pre-set the idle count to 3 to simulate 3 previous IDLE detections. - g.streamManager.connectionIdleCount = []int{3} + // Pre-set below escalation threshold so this test validates increment-only behavior. + g.streamManager.connectionIdleCount = []int{2} // Create a fresh IDLE connection (simulates a connection that went back to IDLE). conn := newIdleGRPCConn(t) @@ -862,13 +862,13 @@ g.performHealthCheck() - // Counter should have incremented from 3 to 4. - if g.streamManager.connectionIdleCount[0] != 4 { - t.Fatalf("expected connectionIdleCount[0]=4 (incremented from 3), got %d", + // Counter should have incremented from 2 to 3. + if g.streamManager.connectionIdleCount[0] != 3 { + t.Fatalf("expected connectionIdleCount[0]=3 (incremented from 2), got %d", g.streamManager.connectionIdleCount[0]) },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@gravity/hardening_test.go` around lines 850 - 874, The test pre-seeds g.streamManager.connectionIdleCount = []int{3} which triggers the escalation goroutine inside performHealthCheck and causes a data race with the subsequent unlocked reads; change the seed to a below-threshold value (e.g., set g.streamManager.connectionIdleCount = []int{2}) so the test only verifies the increment behavior without spawning the async escalation, or alternatively read the shared state under the streamManager.healthMu.RLock() before asserting (use streamManager.healthMu.RLock()/RUnlock() around reads of connectionIdleCount and connectionHealth) to eliminate the race.
🧹 Nitpick comments (2)
gravity/hardening_test.go (2)
1226-1242: Consider asserting that idle count is unchanged for nil connections.The comment at line 1240 states idle count should not be modified, but there's no assertion to verify this. Adding one would strengthen the test:
💡 Suggested addition
if g.streamManager.connectionHealth[0] { t.Fatal("expected nil connection to be marked unhealthy") } - // Idle count should not be modified for nil connections - // (the nil path doesn't touch connectionIdleCount) + // Idle count should not be modified for nil connections + if g.streamManager.connectionIdleCount[0] != 5 { + t.Fatalf("expected connectionIdleCount unchanged (5) for nil connection, got %d", + g.streamManager.connectionIdleCount[0]) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@gravity/hardening_test.go` around lines 1226 - 1242, Add an assertion to TestPerformHealthCheck_NilConnection_MarkedUnhealthy to verify that g.streamManager.connectionIdleCount remains unchanged after calling g.performHealthCheck(); locate the test function TestPerformHealthCheck_NilConnection_MarkedUnhealthy and after the existing check for connectionHealth[0] add an assertion that g.streamManager.connectionIdleCount[0] still equals the pre-existing value (5) to ensure the nil-connection path does not modify connectionIdleCount.
1343-1369: Near-duplicate ofTestPerformHealthCheck_AllConnectionsIDLE.This test (
TestPerformHealthCheck_AllEndpointsIDLE_AllRecover) is structurally identical toTestPerformHealthCheck_AllConnectionsIDLE(lines 991-1028). Both create 3 IDLE connections and verify all get recovery treatment.Consider consolidating or differentiating them (e.g., one tests connection-level behavior, the other tests endpoint-level health propagation via
refreshEndpointHealth).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@gravity/hardening_test.go` around lines 1343 - 1369, This test is a near-duplicate of TestPerformHealthCheck_AllConnectionsIDLE; either consolidate or change it to explicitly test endpoint-level propagation: update TestPerformHealthCheck_AllEndpointsIDLE_AllRecover to exercise refreshEndpointHealth by setting up g.endpoints (or whatever endpoint health structure your code uses), ensuring performHealthCheck triggers refreshEndpointHealth for each endpoint, and then assert endpoint-level health state changes in addition to per-connection checks (use symbols performHealthCheck, refreshEndpointHealth, streamManager.connectionHealth, streamManager.connectionIdleCount, and streamManager.tunnelStreams to locate and modify the logic); alternatively remove this test and keep the original TestPerformHealthCheck_AllConnectionsIDLE if no distinct behavior is required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@gravity/hardening_test.go`:
- Around line 850-874: The test pre-seeds g.streamManager.connectionIdleCount =
[]int{3} which triggers the escalation goroutine inside performHealthCheck and
causes a data race with the subsequent unlocked reads; change the seed to a
below-threshold value (e.g., set g.streamManager.connectionIdleCount = []int{2})
so the test only verifies the increment behavior without spawning the async
escalation, or alternatively read the shared state under the
streamManager.healthMu.RLock() before asserting (use
streamManager.healthMu.RLock()/RUnlock() around reads of connectionIdleCount and
connectionHealth) to eliminate the race.
---
Nitpick comments:
In `@gravity/hardening_test.go`:
- Around line 1226-1242: Add an assertion to
TestPerformHealthCheck_NilConnection_MarkedUnhealthy to verify that
g.streamManager.connectionIdleCount remains unchanged after calling
g.performHealthCheck(); locate the test function
TestPerformHealthCheck_NilConnection_MarkedUnhealthy and after the existing
check for connectionHealth[0] add an assertion that
g.streamManager.connectionIdleCount[0] still equals the pre-existing value (5)
to ensure the nil-connection path does not modify connectionIdleCount.
- Around line 1343-1369: This test is a near-duplicate of
TestPerformHealthCheck_AllConnectionsIDLE; either consolidate or change it to
explicitly test endpoint-level propagation: update
TestPerformHealthCheck_AllEndpointsIDLE_AllRecover to exercise
refreshEndpointHealth by setting up g.endpoints (or whatever endpoint health
structure your code uses), ensuring performHealthCheck triggers
refreshEndpointHealth for each endpoint, and then assert endpoint-level health
state changes in addition to per-connection checks (use symbols
performHealthCheck, refreshEndpointHealth, streamManager.connectionHealth,
streamManager.connectionIdleCount, and streamManager.tunnelStreams to locate and
modify the logic); alternatively remove this test and keep the original
TestPerformHealthCheck_AllConnectionsIDLE if no distinct behavior is required.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e2a4b586-a67f-4d64-ad33-7a060496d762
📒 Files selected for processing (1)
gravity/hardening_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (go)
🧰 Additional context used
🪛 GitHub Actions: Go
gravity/hardening_test.go
[error] 871-871: WARNING: DATA RACE (race detected by Go race detector). Previous read at 0x00c0004729af by goroutine 1355 in TestPerformHealthCheck_IDLECounterIncrementsOnConsecutiveIDLE.
🔇 Additional comments (5)
gravity/hardening_test.go (5)
751-765: Well-designed helper for testing real gRPC IDLE state.Using
grpc.NewClientwith a bogus target to get a lazy connection in IDLE state is a clean approach that tests real gRPC behavior rather than mocking state. The cleanup registration ensures no resource leaks.
774-808: LGTM! Core regression test for IDLE recovery.This test correctly validates the critical fix: detecting IDLE state, marking unhealthy, incrementing counter, and calling
Connect(). The state-change verification confirms the recovery actually triggers gRPC reconnection attempts.
1138-1182: LGTM! Correct synchronization for escalation test.This test properly uses
healthMu.RLock()when reading shared state after triggering escalation, avoiding the race condition present in the consecutive-IDLE test. Good coverage of thehandleEndpointDisconnectionescalation path.
1582-1603: Good defensive test documenting panic on mismatched array lengths.This test documents a real bug (per commit messages) and serves as a regression canary. If
performHealthCheckis later fixed to handle mismatched lengths gracefully, this test will fail and prompt an update.
741-750: Comprehensive IDLE recovery test suite addresses critical production bug.The test organization with clear section headers (P1: Connection State Transitions, P1: Endpoint Disconnection Cascade, P2: Stream + Connection Interaction, P3: Timing & Lifecycle) makes it easy to understand coverage. Good use of
t.Skipfor timing-sensitive assertions that may be flaky in CI environments.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
gravity/hardening_test.go (1)
853-873:⚠️ Potential issue | 🔴 CriticalFix unresolved data race in consecutive-IDLE test (still escalates asynchronously).
connectionIdleCountseeded to3at Line 853 makes the nextperformHealthCheck()increment to4and trigger async escalation (handleEndpointDisconnection). Unlocked reads at Line 865 and Line 871 then race with concurrent writes. This matches the race failure reported by CI.Proposed patch
- // Pre-set the idle count to 3 to simulate 3 previous IDLE detections. - g.streamManager.connectionIdleCount = []int{3} + // Keep below escalation threshold so this test validates increment-only behavior. + g.streamManager.connectionIdleCount = []int{2} @@ - // Counter should have incremented from 3 to 4. - if g.streamManager.connectionIdleCount[0] != 4 { - t.Fatalf("expected connectionIdleCount[0]=4 (incremented from 3), got %d", - g.streamManager.connectionIdleCount[0]) - } - - // Health should be false (IDLE is not healthy). - if g.streamManager.connectionHealth[0] { + g.streamManager.healthMu.RLock() + idleCount := g.streamManager.connectionIdleCount[0] + isHealthy := g.streamManager.connectionHealth[0] + g.streamManager.healthMu.RUnlock() + + // Counter should have incremented from 2 to 3. + if idleCount != 3 { + t.Fatalf("expected connectionIdleCount[0]=3 (incremented from 2), got %d", idleCount) + } + + // Health should be false (IDLE is not healthy). + if isHealthy { t.Fatal("expected IDLE connection to be marked unhealthy") }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@gravity/hardening_test.go` around lines 853 - 873, The test seeds streamManager.connectionIdleCount to 3 which causes performHealthCheck to increment it and asynchronously call handleEndpointDisconnection, creating a data race when the test reads connectionIdleCount and connectionHealth without synchronization; fix by preventing the async escalation during the assertion: either seed connectionIdleCount to a value that does not trigger escalation, or mock/disable handleEndpointDisconnection (or inject a sync hook) so performHealthCheck runs synchronously for the test; specifically update the test around connectionIdleCount, performHealthCheck, and the subsequent reads of streamManager.connectionIdleCount and streamManager.connectionHealth to avoid concurrent writes (e.g., set connectionIdleCount to 2 or stub out handleEndpointDisconnection on streamManager) so the assertions read stable state.
🧹 Nitpick comments (1)
gravity/hardening_test.go (1)
801-823: Consider extracting state-wait logic into a small test helper.The repeated
context.WithTimeout + WaitForStateChange + GetStateblocks increase duplication and make timeout handling less consistent. A shared helper would tighten readability and reduce flaky-test maintenance.Also applies to: 887-896, 1017-1023
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@gravity/hardening_test.go` around lines 801 - 823, Extract the repeated context.WithTimeout + WaitForStateChange + GetState pattern into a small test helper (e.g., waitForStateChange(t *testing.T, conn *grpc.ClientConn, from connectivity.State, timeout time.Duration) or waitForStateNot(t, conn, state, timeout)) and replace the duplicated blocks in TestPerformHealthCheck_IDLECounterResets and the other locations (lines referenced around WaitForStateChange/GetState) with calls to that helper; the helper should create the context with the provided timeout, call conn.WaitForStateChange(ctx, from), check conn.GetState(), and call t.Fatalf/t.Helper() on timeout or unexpected state to preserve test failure semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@gravity/hardening_test.go`:
- Around line 853-873: The test seeds streamManager.connectionIdleCount to 3
which causes performHealthCheck to increment it and asynchronously call
handleEndpointDisconnection, creating a data race when the test reads
connectionIdleCount and connectionHealth without synchronization; fix by
preventing the async escalation during the assertion: either seed
connectionIdleCount to a value that does not trigger escalation, or mock/disable
handleEndpointDisconnection (or inject a sync hook) so performHealthCheck runs
synchronously for the test; specifically update the test around
connectionIdleCount, performHealthCheck, and the subsequent reads of
streamManager.connectionIdleCount and streamManager.connectionHealth to avoid
concurrent writes (e.g., set connectionIdleCount to 2 or stub out
handleEndpointDisconnection on streamManager) so the assertions read stable
state.
---
Nitpick comments:
In `@gravity/hardening_test.go`:
- Around line 801-823: Extract the repeated context.WithTimeout +
WaitForStateChange + GetState pattern into a small test helper (e.g.,
waitForStateChange(t *testing.T, conn *grpc.ClientConn, from connectivity.State,
timeout time.Duration) or waitForStateNot(t, conn, state, timeout)) and replace
the duplicated blocks in TestPerformHealthCheck_IDLECounterResets and the other
locations (lines referenced around WaitForStateChange/GetState) with calls to
that helper; the helper should create the context with the provided timeout,
call conn.WaitForStateChange(ctx, from), check conn.GetState(), and call
t.Fatalf/t.Helper() on timeout or unexpected state to preserve test failure
semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5f9eef56-6a7c-4019-a6a0-2267b892927c
📒 Files selected for processing (2)
gravity/grpc_client.gogravity/hardening_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- gravity/grpc_client.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (go)
🧰 Additional context used
🪛 GitHub Actions: Go
gravity/hardening_test.go
[error] 871-871: Go test -race detected a data race (previous read) in TestPerformHealthCheck_IDLECounterIncrementsOnConsecutiveIDLE at hardening_test.go:871.
[error] 1-1: Go test -race process failed with exit code 1 due to race detected during execution of tests in package github.com/agentuity/go-common/gravity.
🔇 Additional comments (1)
gravity/hardening_test.go (1)
767-808: Strong regression coverage for the IDLE reconnect bug.The test correctly validates the critical IDLE path: unhealthy marking, idle counter increment, and forced state transition via
Connect(). Good production-focused hardening.
…eIDLE The test read connectionHealth[0] immediately after performHealthCheck() returned, but the write happens in a background goroutine spawned by handleEndpointDisconnection. Wait for the async write with a polling loop under the healthMu lock.
Summary
Fixes a critical bug where gRPC connections stuck in IDLE state after ion restarts are never recovered, causing permanent tunnel loss and full region outages.
Problem
When an ion restarts via tableflip, the gRPC connection transitions to IDLE. The health monitor detected this (
connection N is unhealthy, state: IDLElogged every 30s) but never attempted recovery. gRPC does not auto-reconnect from IDLE — it requires either a new RPC call or an explicitConnect().Production impact: 48-minute full outage in us-central. All 3 ion gravity endpoints went IDLE after rollout. Both hadrons had 0 tunnel streams. 241 deployments were unreachable. Manual hadron restart was required to recover.
Root Cause
performHealthCheck()ingrpc_client.gohad:Fix
When an IDLE connection is detected, call
conn.Connect()to force gRPC to transition from IDLE → CONNECTING → READY (or TRANSIENT_FAILURE, which has its own retry backoff):Also tracks
connectionIdleCountper connection for observability.Changes
gravity/grpc_client.go: AddedconnectionIdleCount []inttracking, IDLE recovery viaconn.Connect()inperformHealthCheck(), idle counter reset on healthy/non-IDLE statesTesting
All gravity tests pass. The fix is minimal and targeted — only adds recovery behavior to the existing health check loop.
Summary by CodeRabbit
Bug Fixes
Tests