fix: handle ResourceSyncResponse in gravity client by jhaynie · Pull Request #234 · agentuity/go-common

jhaynie · 2026-04-21T20:33:55Z

Summary

Add handler for ResourceSyncResponse message type in the gravity client that was being logged as unhandled.

Changes

Add handleResourceSyncResponse function in grpc_client.go
Add case for SessionMessage_ResourceSyncResponse in processSessionMessage

Why

The ResourceSync flow sends a request from hadron to ion, and ion sends back a response. Without this handler, the response was logged as "unhandled session message type".

This completes the ResourceSync message flow:

Hadron broadcasts ResourceSyncRequest to all ions
Ion processes and sends ResourceSyncResponse back
Hadron now handles the response properly

Summary by CodeRabbit

Release Notes

New Features
- Added resource synchronization functionality for deployments and sandbox resources with configurable timeout duration.
- Integrated synchronization status tracking to monitor resource additions and removals with success confirmation.
- Response handling provides comprehensive error reporting and detailed synchronization operation feedback.

Add restart_count field (uint32, field 14) to the HostInfo protobuf message. This allows hadron to report its systemd restart counter to ion during the session hello, enabling crash-loop detection in the debug machines endpoint. - gravity_session.proto: added restart_count = 14 - gravity_session.pb.go: regenerated - grpc_client.go: populate RestartCount from GravityConfig - types.go: added RestartCount to GravityConfig

When an ion restarts (tableflip), the gRPC connection transitions to IDLE state. In this state, gRPC does not attempt to reconnect until a new RPC call is issued or Connect() is explicitly called. The health monitor detected IDLE connections (logging 'connection N is unhealthy, state: IDLE') but never attempted recovery — causing hadrons to permanently lose tunnel connectivity to restarted ions. This caused a 48-minute full outage in us-central: all 3 gravity endpoints went IDLE after ion rollout, hadrons had 0 tunnel streams, and no deployments could be provisioned or serve traffic. A manual hadron restart was required to recover. Fix: when performHealthCheck detects an IDLE connection, call conn.Connect() to force gRPC to transition from IDLE → CONNECTING → READY (or TRANSIENT_FAILURE if the server is unreachable, which has its own retry logic). Track consecutive idle counts for observability.

Two fixes from PR review: 1. addEndpoint grows connectionHealth when adding endpoints dynamically but did not grow connectionIdleCount, leaving it shorter than connectionHealth. This would cause idle recovery tracking to silently skip dynamically-added endpoints. Mirror the growth logic. 2. When a connection stays IDLE for >3 consecutive health checks, escalate beyond just conn.Connect() — trigger a full endpoint reconnection via handleEndpointDisconnection which tears down and rebuilds all streams for that endpoint. This handles cases where the underlying gRPC connection is fundamentally broken and Connect() alone cannot recover it.

… escalation - TestAddEndpoint_GrowsConnectionIdleCount: verifies connectionIdleCount is grown alongside connectionHealth when addEndpoint adds a new endpoint dynamically (prevents idle recovery from silently skipping dynamically-added endpoints) - TestPerformHealthCheck_PersistentIDLEEscalates: verifies that when connectionIdleCount exceeds the escalation threshold (>3), performHealthCheck triggers handleEndpointDisconnection without panicking (pre-sets count to 3, runs one health check, verifies count=4 and no crash from the async escalation)

…ion, lifecycle tests P1 — Connection state transitions: - SHUTDOWN connection: no recovery attempt, counter reset - Nil connection: marked unhealthy without panic - IDLE → CONNECTING transition: idle counter resets to 0 P1 — Endpoint disconnection cascade: - Single endpoint IDLE while others healthy: targeted recovery - All endpoints IDLE simultaneously: all get recovery (rolling restart) - Invalid endpoint index: no panic - Already reconnecting guard: duplicate disconnection skipped P2 — Stream + connection interaction: - Streams on IDLE connection: not recovered (connReady=false) - Stream recovery blocked until connection reaches READY - refreshEndpointHealth: derives from connectionHealth correctly - All connections unhealthy: all endpoints marked unhealthy P3 — Timing & lifecycle: - Context cancellation: health monitor goroutine exits promptly - Empty connections: no panic - Mismatched array lengths: documents current panic behavior (connectionHealth shorter than connections — needs bounds fix) Also found a real bug: performHealthCheck panics when connectionHealth array is shorter than connections array. Documented as TestPerformHealthCheck_MismatchedArrayLengths_Panics.

performHealthCheck iterated over g.connections but indexed into g.streamManager.connectionHealth without bounds checking. If connectionHealth was shorter than connections (e.g., during dynamic endpoint addition before arrays are grown), this caused an index out of range panic. Fix: break the loop when i >= len(connectionHealth). Connections beyond the health array are skipped — they'll be picked up once the arrays are grown by addEndpoint. Updated TestPerformHealthCheck_MismatchedArrayLengths_NoPanic to verify the fix (was previously documenting the panic).

…eIDLE The test read connectionHealth[0] immediately after performHealthCheck() returned, but the write happens in a background goroutine spawned by handleEndpointDisconnection. Wait for the async write with a polling loop under the healthMu lock.

…count

…iliation Add ResourceSyncRequest/Response proto messages and SendResourceSync() client method. Replaces per-resource RouteDeploymentRequest calls during reconnection with a single atomic message containing all active resources. Gravity compares the list against its cached state and unprovisions any stale entries, preventing VIP collisions from gossip contamination.

Add handler for ResourceSyncResponse message type that was being logged as unhandled. This completes the ResourceSync flow where ion sends a response after processing the sync request.

coderabbitai · 2026-04-21T20:34:08Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This pull request adds a new resource synchronization capability to the gRPC client, introducing protobuf message types and a SendResourceSync API that enables bidirectional resource sync messages over the existing EstablishSession stream.

Changes

Cohort / File(s)	Summary
Protobuf Protocol Definition `gravity/proto/gravity_session.proto`	Added new message types `ResourceSyncRequest`, `ResourceSyncEntry`, and `ResourceSyncResponse` to define resource synchronization payloads. Extended `SessionMessage` with two new `oneof message_type` variants (fields 70-71) for routing sync requests and responses.
Client Implementation `gravity/grpc_client.go`	Implemented `SendResourceSync()` API method that converts deployment and sandbox items to protobuf format and broadcasts them over healthy control streams. Added `handleResourceSyncResponse()` to process incoming sync responses and introduced the `ResourceSyncItem` struct with ID, VirtualIP, and OwnerID fields.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jhaynie added 11 commits April 19, 2026 18:51

Merge main

9ec708b

Merge remote-tracking branch 'origin/main' into fix/hostinfo-restart-…

833e2ee

…count

fix: handle ResourceSyncResponse in gravity client

5372172

Add handler for ResourceSyncResponse message type that was being logged as unhandled. This completes the ResourceSync flow where ion sends a response after processing the sync request.

jhaynie merged commit 094774f into main Apr 21, 2026
4 of 5 checks passed

jhaynie deleted the fix/hostinfo-restart-count branch April 21, 2026 20:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle ResourceSyncResponse in gravity client#234

fix: handle ResourceSyncResponse in gravity client#234
jhaynie merged 11 commits intomainfrom
fix/hostinfo-restart-count

jhaynie commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jhaynie commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jhaynie commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading