Skip to content

fix: recover IDLE gRPC connections in health monitor#233

Merged
jhaynie merged 8 commits intomainfrom
fix/hostinfo-restart-count
Apr 21, 2026
Merged

fix: recover IDLE gRPC connections in health monitor#233
jhaynie merged 8 commits intomainfrom
fix/hostinfo-restart-count

Conversation

@jhaynie
Copy link
Copy Markdown
Member

@jhaynie jhaynie commented Apr 20, 2026

Summary

Fixes a critical bug where gRPC connections stuck in IDLE state after ion restarts are never recovered, causing permanent tunnel loss and full region outages.

Problem

When an ion restarts via tableflip, the gRPC connection transitions to IDLE. The health monitor detected this (connection N is unhealthy, state: IDLE logged every 30s) but never attempted recovery. gRPC does not auto-reconnect from IDLE — it requires either a new RPC call or an explicit Connect().

Production impact: 48-minute full outage in us-central. All 3 ion gravity endpoints went IDLE after rollout. Both hadrons had 0 tunnel streams. 241 deployments were unreachable. Manual hadron restart was required to recover.

Root Cause

performHealthCheck() in grpc_client.go had:

if !isHealthy {
    g.logger.Debug("connection %d is unhealthy, state: %s", i, state.String())
    // ← nothing else — just logged and moved on
}

Fix

When an IDLE connection is detected, call conn.Connect() to force gRPC to transition from IDLE → CONNECTING → READY (or TRANSIENT_FAILURE, which has its own retry backoff):

if stateStr == "IDLE" {
    conn.Connect()
    g.logger.Info("connection %d is IDLE (count=%d), forced reconnection", i, idleCount)
}

Also tracks connectionIdleCount per connection for observability.

Changes

  • gravity/grpc_client.go: Added connectionIdleCount []int tracking, IDLE recovery via conn.Connect() in performHealthCheck(), idle counter reset on healthy/non-IDLE states

Testing

All gravity tests pass. The fix is minimal and targeted — only adds recovery behavior to the existing health check loop.

Summary by CodeRabbit

  • Bug Fixes

    • Improved per-connection idle detection and handling: idle connections now trigger reconnection attempts, maintain per-connection idle counters that reset when healthy, escalate after persistent idle, and are safely initialized and grown when endpoints are added.
  • Tests

    • Added comprehensive hardening tests covering idle recovery, multi-connection mixed states, escalation flows, bounds-safety, lifecycle/monitor shutdown, and related edge cases.

jhaynie added 3 commits April 19, 2026 18:51
Add restart_count field (uint32, field 14) to the HostInfo protobuf
message. This allows hadron to report its systemd restart counter
to ion during the session hello, enabling crash-loop detection in
the debug machines endpoint.

- gravity_session.proto: added restart_count = 14
- gravity_session.pb.go: regenerated
- grpc_client.go: populate RestartCount from GravityConfig
- types.go: added RestartCount to GravityConfig
When an ion restarts (tableflip), the gRPC connection transitions to
IDLE state. In this state, gRPC does not attempt to reconnect until a
new RPC call is issued or Connect() is explicitly called. The health
monitor detected IDLE connections (logging 'connection N is unhealthy,
state: IDLE') but never attempted recovery — causing hadrons to
permanently lose tunnel connectivity to restarted ions.

This caused a 48-minute full outage in us-central: all 3 gravity
endpoints went IDLE after ion rollout, hadrons had 0 tunnel streams,
and no deployments could be provisioned or serve traffic. A manual
hadron restart was required to recover.

Fix: when performHealthCheck detects an IDLE connection, call
conn.Connect() to force gRPC to transition from IDLE → CONNECTING →
READY (or TRANSIENT_FAILURE if the server is unreachable, which has
its own retry logic). Track consecutive idle counts for observability.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds per-connection idle counters to StreamManager, initializes and grows them with endpoint slots, updates performHealthCheck to increment/reset counters and call conn.Connect() for unhealthy IDLEs, escalate persistent IDLEs to trigger endpoint disconnection and stream rebuilds, and adds tests covering these behaviors and edge cases.

Changes

Cohort / File(s) Summary
Connection Health Tracking
gravity/grpc_client.go
Introduce connectionIdleCount []int; align struct fields and mutex usage; initialize idle counters in single- and multi-endpoint startup paths and when addEndpoint() grows slots; update performHealthCheck() to increment idle counters for unhealthy IDLE connections, call conn.Connect(), reset counters for healthy/non-IDLE, log transient vs persistent IDLEs, and asynchronously invoke handleEndpointDisconnection(i, "persistent_idle") when threshold exceeded.
Hardening Tests
gravity/hardening_test.go
Add comprehensive tests and helper newIdleGRPCConn to produce real lazy grpc.ClientConn (initial connectivity.Idle); verify idle-count increments, resets, conn.Connect() invocation, escalation after consecutive IDLEs, mixed-pool behavior, slice-length/bounds safety, addEndpoint() growth, null/SHUTDOWN handling, endpoint-disconnection helpers' no-panic behavior, and monitor goroutine shutdown on context cancel.
🚥 Pre-merge checks | ✅ 1
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
gravity/grpc_client.go (1)

5611-5647: Well-implemented IDLE recovery logic.

The implementation correctly addresses the root cause from the PR objectives:

  • Detects IDLE state via state.String() == "IDLE"
  • Calls conn.Connect() to force transition to CONNECTING
  • Tracks consecutive idle counts for observability
  • Resets counters appropriately on state transitions

The defensive bounds checking (i < len(g.streamManager.connectionIdleCount)) handles edge cases where arrays may be out of sync.

The idleCount is tracked but currently only used for logging. If repeated IDLE transitions become problematic, consider adding a threshold (e.g., after N consecutive IDLEs, trigger full endpoint reconnection):

// Future enhancement: escalate if IDLE persists across multiple health checks
if idleCount > 3 {
    g.logger.Warn("connection %d stuck in IDLE after %d attempts, triggering endpoint reconnection", i, idleCount)
    go g.handleEndpointDisconnection(i, "persistent_idle")
}

This is optional since the current implementation should handle normal tableflip scenarios.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gravity/grpc_client.go` around lines 5611 - 5647, Approve the current IDLE
recovery logic but optionally add escalation for persistent IDLEs: in the
health-check loop where connectionIdleCount is incremented (refer to
g.streamManager.connectionIdleCount, conn.Connect(), and the for loop over
g.connections), after incrementing idleCount check a threshold (e.g., >3) and
then log a warning and trigger an endpoint reconnection/handling routine (e.g.,
call go g.handleEndpointDisconnection(i, "persistent_idle")) so repeated IDLEs
escalate beyond just logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@gravity/grpc_client.go`:
- Line 923: Start initializes streamManager.connectionIdleCount with make([]int,
connectionCount), but addEndpoint (method addEndpoint) only grows
streamManager.connectionHealth when adding endpoints dynamically, leaving
connectionIdleCount unexpanded and breaking idle recovery tracking; update
addEndpoint to append or re-slice streamManager.connectionIdleCount in the same
place it grows streamManager.connectionHealth so that connectionIdleCount length
matches the new endpoint count (mirror the growth logic used in Start for
consistency).

---

Nitpick comments:
In `@gravity/grpc_client.go`:
- Around line 5611-5647: Approve the current IDLE recovery logic but optionally
add escalation for persistent IDLEs: in the health-check loop where
connectionIdleCount is incremented (refer to
g.streamManager.connectionIdleCount, conn.Connect(), and the for loop over
g.connections), after incrementing idleCount check a threshold (e.g., >3) and
then log a warning and trigger an endpoint reconnection/handling routine (e.g.,
call go g.handleEndpointDisconnection(i, "persistent_idle")) so repeated IDLEs
escalate beyond just logging.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aee4ecd3-b74a-4899-9b97-8d5ebf20f6b0

📥 Commits

Reviewing files that changed from the base of the PR and between a3221e8 and c7d62a6.

📒 Files selected for processing (1)
  • gravity/grpc_client.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (go)
🔇 Additional comments (2)
gravity/grpc_client.go (2)

416-418: LGTM!

The new connectionIdleCount field is well-placed alongside connectionHealth and follows the existing pattern for per-connection tracking.


658-658: LGTM!

Correct initialization of connectionIdleCount to match connectionCount, consistent with the connectionHealth initialization pattern.

Comment thread gravity/grpc_client.go
jhaynie added 2 commits April 19, 2026 21:31
Two fixes from PR review:

1. addEndpoint grows connectionHealth when adding endpoints dynamically
   but did not grow connectionIdleCount, leaving it shorter than
   connectionHealth. This would cause idle recovery tracking to silently
   skip dynamically-added endpoints. Mirror the growth logic.

2. When a connection stays IDLE for >3 consecutive health checks,
   escalate beyond just conn.Connect() — trigger a full endpoint
   reconnection via handleEndpointDisconnection which tears down and
   rebuilds all streams for that endpoint. This handles cases where
   the underlying gRPC connection is fundamentally broken and
   Connect() alone cannot recover it.
… escalation

- TestAddEndpoint_GrowsConnectionIdleCount: verifies connectionIdleCount
  is grown alongside connectionHealth when addEndpoint adds a new
  endpoint dynamically (prevents idle recovery from silently skipping
  dynamically-added endpoints)

- TestPerformHealthCheck_PersistentIDLEEscalates: verifies that when
  connectionIdleCount exceeds the escalation threshold (>3),
  performHealthCheck triggers handleEndpointDisconnection without
  panicking (pre-sets count to 3, runs one health check, verifies
  count=4 and no crash from the async escalation)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@gravity/hardening_test.go`:
- Around line 854-873: The test seeds connectionIdleCount too high (3) so
performHealthCheck() triggers escalation and concurrent writes; change the seed
to a below-threshold value (e.g., 2) so the test only verifies increment
behavior, and when asserting connectionIdleCount and connectionHealth read them
while holding the streamManager.healthMu lock (use streamManager.healthMu to
protect access) to avoid unlocked race conditions with performHealthCheck().
Ensure you update references to connectionIdleCount, connectionHealth,
performHealthCheck, and streamManager.healthMu in the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 72d60b7a-5308-4d33-985e-7bf71ea3a054

📥 Commits

Reviewing files that changed from the base of the PR and between cc410af and ff6a75d.

📒 Files selected for processing (1)
  • gravity/hardening_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (go)
🧰 Additional context used
🪛 GitHub Actions: Go
gravity/hardening_test.go

[error] 871-871: go test -race detected a DATA RACE (read) in TestPerformHealthCheck_IDLECounterIncrementsOnConsecutiveIDLE at /home/runner/work/go-common/go-common/gravity/hardening_test.go:871 during parallel execution; race reported as write at grpc_client.go:3679.

Comment thread gravity/hardening_test.go
jhaynie added 2 commits April 19, 2026 22:14
…ion, lifecycle tests

P1 — Connection state transitions:
- SHUTDOWN connection: no recovery attempt, counter reset
- Nil connection: marked unhealthy without panic
- IDLE → CONNECTING transition: idle counter resets to 0

P1 — Endpoint disconnection cascade:
- Single endpoint IDLE while others healthy: targeted recovery
- All endpoints IDLE simultaneously: all get recovery (rolling restart)
- Invalid endpoint index: no panic
- Already reconnecting guard: duplicate disconnection skipped

P2 — Stream + connection interaction:
- Streams on IDLE connection: not recovered (connReady=false)
- Stream recovery blocked until connection reaches READY
- refreshEndpointHealth: derives from connectionHealth correctly
- All connections unhealthy: all endpoints marked unhealthy

P3 — Timing & lifecycle:
- Context cancellation: health monitor goroutine exits promptly
- Empty connections: no panic
- Mismatched array lengths: documents current panic behavior
  (connectionHealth shorter than connections — needs bounds fix)

Also found a real bug: performHealthCheck panics when connectionHealth
array is shorter than connections array. Documented as
TestPerformHealthCheck_MismatchedArrayLengths_Panics.
performHealthCheck iterated over g.connections but indexed into
g.streamManager.connectionHealth without bounds checking. If
connectionHealth was shorter than connections (e.g., during dynamic
endpoint addition before arrays are grown), this caused an index
out of range panic.

Fix: break the loop when i >= len(connectionHealth). Connections
beyond the health array are skipped — they'll be picked up once
the arrays are grown by addEndpoint.

Updated TestPerformHealthCheck_MismatchedArrayLengths_NoPanic to
verify the fix (was previously documenting the panic).
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
gravity/hardening_test.go (1)

850-874: ⚠️ Potential issue | 🔴 Critical

Data race: escalation triggers async writes that race with unlocked reads.

Pre-setting connectionIdleCount to 3 makes the next performHealthCheck() cross the >3 threshold and trigger escalation logic, which spawns an async goroutine that can mutate health state. The unlocked reads at lines 865-872 race with those writes.

The pipeline failure confirms this: go test -race detected a DATA RACE at line 871.

Fix options:

  1. Use below-threshold seed (e.g., 2) so this test validates only increment behavior without triggering escalation
  2. Read shared state under healthMu.RLock() (as done correctly in TestPerformHealthCheck_PersistentIDLEEscalates at lines 1167-1170)
🔧 Proposed fix (option 1 - simpler)
-	// Pre-set the idle count to 3 to simulate 3 previous IDLE detections.
-	g.streamManager.connectionIdleCount = []int{3}
+	// Pre-set below escalation threshold so this test validates increment-only behavior.
+	g.streamManager.connectionIdleCount = []int{2}
 
 	// Create a fresh IDLE connection (simulates a connection that went back to IDLE).
 	conn := newIdleGRPCConn(t)
@@ -862,13 +862,13 @@
 
 	g.performHealthCheck()
 
-	// Counter should have incremented from 3 to 4.
-	if g.streamManager.connectionIdleCount[0] != 4 {
-		t.Fatalf("expected connectionIdleCount[0]=4 (incremented from 3), got %d",
+	// Counter should have incremented from 2 to 3.
+	if g.streamManager.connectionIdleCount[0] != 3 {
+		t.Fatalf("expected connectionIdleCount[0]=3 (incremented from 2), got %d",
 			g.streamManager.connectionIdleCount[0])
 	}

,

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gravity/hardening_test.go` around lines 850 - 874, The test pre-seeds
g.streamManager.connectionIdleCount = []int{3} which triggers the escalation
goroutine inside performHealthCheck and causes a data race with the subsequent
unlocked reads; change the seed to a below-threshold value (e.g., set
g.streamManager.connectionIdleCount = []int{2}) so the test only verifies the
increment behavior without spawning the async escalation, or alternatively read
the shared state under the streamManager.healthMu.RLock() before asserting (use
streamManager.healthMu.RLock()/RUnlock() around reads of connectionIdleCount and
connectionHealth) to eliminate the race.
🧹 Nitpick comments (2)
gravity/hardening_test.go (2)

1226-1242: Consider asserting that idle count is unchanged for nil connections.

The comment at line 1240 states idle count should not be modified, but there's no assertion to verify this. Adding one would strengthen the test:

💡 Suggested addition
 	if g.streamManager.connectionHealth[0] {
 		t.Fatal("expected nil connection to be marked unhealthy")
 	}
-	// Idle count should not be modified for nil connections
-	// (the nil path doesn't touch connectionIdleCount)
+	// Idle count should not be modified for nil connections
+	if g.streamManager.connectionIdleCount[0] != 5 {
+		t.Fatalf("expected connectionIdleCount unchanged (5) for nil connection, got %d",
+			g.streamManager.connectionIdleCount[0])
+	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gravity/hardening_test.go` around lines 1226 - 1242, Add an assertion to
TestPerformHealthCheck_NilConnection_MarkedUnhealthy to verify that
g.streamManager.connectionIdleCount remains unchanged after calling
g.performHealthCheck(); locate the test function
TestPerformHealthCheck_NilConnection_MarkedUnhealthy and after the existing
check for connectionHealth[0] add an assertion that
g.streamManager.connectionIdleCount[0] still equals the pre-existing value (5)
to ensure the nil-connection path does not modify connectionIdleCount.

1343-1369: Near-duplicate of TestPerformHealthCheck_AllConnectionsIDLE.

This test (TestPerformHealthCheck_AllEndpointsIDLE_AllRecover) is structurally identical to TestPerformHealthCheck_AllConnectionsIDLE (lines 991-1028). Both create 3 IDLE connections and verify all get recovery treatment.

Consider consolidating or differentiating them (e.g., one tests connection-level behavior, the other tests endpoint-level health propagation via refreshEndpointHealth).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gravity/hardening_test.go` around lines 1343 - 1369, This test is a
near-duplicate of TestPerformHealthCheck_AllConnectionsIDLE; either consolidate
or change it to explicitly test endpoint-level propagation: update
TestPerformHealthCheck_AllEndpointsIDLE_AllRecover to exercise
refreshEndpointHealth by setting up g.endpoints (or whatever endpoint health
structure your code uses), ensuring performHealthCheck triggers
refreshEndpointHealth for each endpoint, and then assert endpoint-level health
state changes in addition to per-connection checks (use symbols
performHealthCheck, refreshEndpointHealth, streamManager.connectionHealth,
streamManager.connectionIdleCount, and streamManager.tunnelStreams to locate and
modify the logic); alternatively remove this test and keep the original
TestPerformHealthCheck_AllConnectionsIDLE if no distinct behavior is required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@gravity/hardening_test.go`:
- Around line 850-874: The test pre-seeds g.streamManager.connectionIdleCount =
[]int{3} which triggers the escalation goroutine inside performHealthCheck and
causes a data race with the subsequent unlocked reads; change the seed to a
below-threshold value (e.g., set g.streamManager.connectionIdleCount = []int{2})
so the test only verifies the increment behavior without spawning the async
escalation, or alternatively read the shared state under the
streamManager.healthMu.RLock() before asserting (use
streamManager.healthMu.RLock()/RUnlock() around reads of connectionIdleCount and
connectionHealth) to eliminate the race.

---

Nitpick comments:
In `@gravity/hardening_test.go`:
- Around line 1226-1242: Add an assertion to
TestPerformHealthCheck_NilConnection_MarkedUnhealthy to verify that
g.streamManager.connectionIdleCount remains unchanged after calling
g.performHealthCheck(); locate the test function
TestPerformHealthCheck_NilConnection_MarkedUnhealthy and after the existing
check for connectionHealth[0] add an assertion that
g.streamManager.connectionIdleCount[0] still equals the pre-existing value (5)
to ensure the nil-connection path does not modify connectionIdleCount.
- Around line 1343-1369: This test is a near-duplicate of
TestPerformHealthCheck_AllConnectionsIDLE; either consolidate or change it to
explicitly test endpoint-level propagation: update
TestPerformHealthCheck_AllEndpointsIDLE_AllRecover to exercise
refreshEndpointHealth by setting up g.endpoints (or whatever endpoint health
structure your code uses), ensuring performHealthCheck triggers
refreshEndpointHealth for each endpoint, and then assert endpoint-level health
state changes in addition to per-connection checks (use symbols
performHealthCheck, refreshEndpointHealth, streamManager.connectionHealth,
streamManager.connectionIdleCount, and streamManager.tunnelStreams to locate and
modify the logic); alternatively remove this test and keep the original
TestPerformHealthCheck_AllConnectionsIDLE if no distinct behavior is required.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e2a4b586-a67f-4d64-ad33-7a060496d762

📥 Commits

Reviewing files that changed from the base of the PR and between ff6a75d and 8780c9e.

📒 Files selected for processing (1)
  • gravity/hardening_test.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (go)
🧰 Additional context used
🪛 GitHub Actions: Go
gravity/hardening_test.go

[error] 871-871: WARNING: DATA RACE (race detected by Go race detector). Previous read at 0x00c0004729af by goroutine 1355 in TestPerformHealthCheck_IDLECounterIncrementsOnConsecutiveIDLE.

🔇 Additional comments (5)
gravity/hardening_test.go (5)

751-765: Well-designed helper for testing real gRPC IDLE state.

Using grpc.NewClient with a bogus target to get a lazy connection in IDLE state is a clean approach that tests real gRPC behavior rather than mocking state. The cleanup registration ensures no resource leaks.


774-808: LGTM! Core regression test for IDLE recovery.

This test correctly validates the critical fix: detecting IDLE state, marking unhealthy, incrementing counter, and calling Connect(). The state-change verification confirms the recovery actually triggers gRPC reconnection attempts.


1138-1182: LGTM! Correct synchronization for escalation test.

This test properly uses healthMu.RLock() when reading shared state after triggering escalation, avoiding the race condition present in the consecutive-IDLE test. Good coverage of the handleEndpointDisconnection escalation path.


1582-1603: Good defensive test documenting panic on mismatched array lengths.

This test documents a real bug (per commit messages) and serves as a regression canary. If performHealthCheck is later fixed to handle mismatched lengths gracefully, this test will fail and prompt an update.


741-750: Comprehensive IDLE recovery test suite addresses critical production bug.

The test organization with clear section headers (P1: Connection State Transitions, P1: Endpoint Disconnection Cascade, P2: Stream + Connection Interaction, P3: Timing & Lifecycle) makes it easy to understand coverage. Good use of t.Skip for timing-sensitive assertions that may be flaky in CI environments.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
gravity/hardening_test.go (1)

853-873: ⚠️ Potential issue | 🔴 Critical

Fix unresolved data race in consecutive-IDLE test (still escalates asynchronously).

connectionIdleCount seeded to 3 at Line 853 makes the next performHealthCheck() increment to 4 and trigger async escalation (handleEndpointDisconnection). Unlocked reads at Line 865 and Line 871 then race with concurrent writes. This matches the race failure reported by CI.

Proposed patch
-	// Pre-set the idle count to 3 to simulate 3 previous IDLE detections.
-	g.streamManager.connectionIdleCount = []int{3}
+	// Keep below escalation threshold so this test validates increment-only behavior.
+	g.streamManager.connectionIdleCount = []int{2}
@@
-	// Counter should have incremented from 3 to 4.
-	if g.streamManager.connectionIdleCount[0] != 4 {
-		t.Fatalf("expected connectionIdleCount[0]=4 (incremented from 3), got %d",
-			g.streamManager.connectionIdleCount[0])
-	}
-
-	// Health should be false (IDLE is not healthy).
-	if g.streamManager.connectionHealth[0] {
+	g.streamManager.healthMu.RLock()
+	idleCount := g.streamManager.connectionIdleCount[0]
+	isHealthy := g.streamManager.connectionHealth[0]
+	g.streamManager.healthMu.RUnlock()
+
+	// Counter should have incremented from 2 to 3.
+	if idleCount != 3 {
+		t.Fatalf("expected connectionIdleCount[0]=3 (incremented from 2), got %d", idleCount)
+	}
+
+	// Health should be false (IDLE is not healthy).
+	if isHealthy {
 		t.Fatal("expected IDLE connection to be marked unhealthy")
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gravity/hardening_test.go` around lines 853 - 873, The test seeds
streamManager.connectionIdleCount to 3 which causes performHealthCheck to
increment it and asynchronously call handleEndpointDisconnection, creating a
data race when the test reads connectionIdleCount and connectionHealth without
synchronization; fix by preventing the async escalation during the assertion:
either seed connectionIdleCount to a value that does not trigger escalation, or
mock/disable handleEndpointDisconnection (or inject a sync hook) so
performHealthCheck runs synchronously for the test; specifically update the test
around connectionIdleCount, performHealthCheck, and the subsequent reads of
streamManager.connectionIdleCount and streamManager.connectionHealth to avoid
concurrent writes (e.g., set connectionIdleCount to 2 or stub out
handleEndpointDisconnection on streamManager) so the assertions read stable
state.
🧹 Nitpick comments (1)
gravity/hardening_test.go (1)

801-823: Consider extracting state-wait logic into a small test helper.

The repeated context.WithTimeout + WaitForStateChange + GetState blocks increase duplication and make timeout handling less consistent. A shared helper would tighten readability and reduce flaky-test maintenance.

Also applies to: 887-896, 1017-1023

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@gravity/hardening_test.go` around lines 801 - 823, Extract the repeated
context.WithTimeout + WaitForStateChange + GetState pattern into a small test
helper (e.g., waitForStateChange(t *testing.T, conn *grpc.ClientConn, from
connectivity.State, timeout time.Duration) or waitForStateNot(t, conn, state,
timeout)) and replace the duplicated blocks in
TestPerformHealthCheck_IDLECounterResets and the other locations (lines
referenced around WaitForStateChange/GetState) with calls to that helper; the
helper should create the context with the provided timeout, call
conn.WaitForStateChange(ctx, from), check conn.GetState(), and call
t.Fatalf/t.Helper() on timeout or unexpected state to preserve test failure
semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@gravity/hardening_test.go`:
- Around line 853-873: The test seeds streamManager.connectionIdleCount to 3
which causes performHealthCheck to increment it and asynchronously call
handleEndpointDisconnection, creating a data race when the test reads
connectionIdleCount and connectionHealth without synchronization; fix by
preventing the async escalation during the assertion: either seed
connectionIdleCount to a value that does not trigger escalation, or mock/disable
handleEndpointDisconnection (or inject a sync hook) so performHealthCheck runs
synchronously for the test; specifically update the test around
connectionIdleCount, performHealthCheck, and the subsequent reads of
streamManager.connectionIdleCount and streamManager.connectionHealth to avoid
concurrent writes (e.g., set connectionIdleCount to 2 or stub out
handleEndpointDisconnection on streamManager) so the assertions read stable
state.

---

Nitpick comments:
In `@gravity/hardening_test.go`:
- Around line 801-823: Extract the repeated context.WithTimeout +
WaitForStateChange + GetState pattern into a small test helper (e.g.,
waitForStateChange(t *testing.T, conn *grpc.ClientConn, from connectivity.State,
timeout time.Duration) or waitForStateNot(t, conn, state, timeout)) and replace
the duplicated blocks in TestPerformHealthCheck_IDLECounterResets and the other
locations (lines referenced around WaitForStateChange/GetState) with calls to
that helper; the helper should create the context with the provided timeout,
call conn.WaitForStateChange(ctx, from), check conn.GetState(), and call
t.Fatalf/t.Helper() on timeout or unexpected state to preserve test failure
semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5f9eef56-6a7c-4019-a6a0-2267b892927c

📥 Commits

Reviewing files that changed from the base of the PR and between 8780c9e and 1635a0f.

📒 Files selected for processing (2)
  • gravity/grpc_client.go
  • gravity/hardening_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • gravity/grpc_client.go
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (go)
🧰 Additional context used
🪛 GitHub Actions: Go
gravity/hardening_test.go

[error] 871-871: Go test -race detected a data race (previous read) in TestPerformHealthCheck_IDLECounterIncrementsOnConsecutiveIDLE at hardening_test.go:871.


[error] 1-1: Go test -race process failed with exit code 1 due to race detected during execution of tests in package github.com/agentuity/go-common/gravity.

🔇 Additional comments (1)
gravity/hardening_test.go (1)

767-808: Strong regression coverage for the IDLE reconnect bug.

The test correctly validates the critical IDLE path: unhealthy marking, idle counter increment, and forced state transition via Connect(). Good production-focused hardening.

…eIDLE

The test read connectionHealth[0] immediately after performHealthCheck()
returned, but the write happens in a background goroutine spawned by
handleEndpointDisconnection. Wait for the async write with a polling
loop under the healthMu lock.
@jhaynie jhaynie merged commit 6806372 into main Apr 21, 2026
5 checks passed
@jhaynie jhaynie deleted the fix/hostinfo-restart-count branch April 21, 2026 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant