Skip to content

fix: handle ResourceSyncResponse in gravity client#234

Merged
jhaynie merged 11 commits intomainfrom
fix/hostinfo-restart-count
Apr 21, 2026
Merged

fix: handle ResourceSyncResponse in gravity client#234
jhaynie merged 11 commits intomainfrom
fix/hostinfo-restart-count

Conversation

@jhaynie
Copy link
Copy Markdown
Member

@jhaynie jhaynie commented Apr 21, 2026

Summary

Add handler for ResourceSyncResponse message type in the gravity client that was being logged as unhandled.

Changes

  • Add handleResourceSyncResponse function in grpc_client.go
  • Add case for SessionMessage_ResourceSyncResponse in processSessionMessage

Why

The ResourceSync flow sends a request from hadron to ion, and ion sends back a response. Without this handler, the response was logged as "unhandled session message type".

This completes the ResourceSync message flow:

  1. Hadron broadcasts ResourceSyncRequest to all ions
  2. Ion processes and sends ResourceSyncResponse back
  3. Hadron now handles the response properly

Summary by CodeRabbit

Release Notes

  • New Features
    • Added resource synchronization functionality for deployments and sandbox resources with configurable timeout duration.
    • Integrated synchronization status tracking to monitor resource additions and removals with success confirmation.
    • Response handling provides comprehensive error reporting and detailed synchronization operation feedback.

jhaynie added 11 commits April 19, 2026 18:51
Add restart_count field (uint32, field 14) to the HostInfo protobuf
message. This allows hadron to report its systemd restart counter
to ion during the session hello, enabling crash-loop detection in
the debug machines endpoint.

- gravity_session.proto: added restart_count = 14
- gravity_session.pb.go: regenerated
- grpc_client.go: populate RestartCount from GravityConfig
- types.go: added RestartCount to GravityConfig
When an ion restarts (tableflip), the gRPC connection transitions to
IDLE state. In this state, gRPC does not attempt to reconnect until a
new RPC call is issued or Connect() is explicitly called. The health
monitor detected IDLE connections (logging 'connection N is unhealthy,
state: IDLE') but never attempted recovery — causing hadrons to
permanently lose tunnel connectivity to restarted ions.

This caused a 48-minute full outage in us-central: all 3 gravity
endpoints went IDLE after ion rollout, hadrons had 0 tunnel streams,
and no deployments could be provisioned or serve traffic. A manual
hadron restart was required to recover.

Fix: when performHealthCheck detects an IDLE connection, call
conn.Connect() to force gRPC to transition from IDLE → CONNECTING →
READY (or TRANSIENT_FAILURE if the server is unreachable, which has
its own retry logic). Track consecutive idle counts for observability.
Two fixes from PR review:

1. addEndpoint grows connectionHealth when adding endpoints dynamically
   but did not grow connectionIdleCount, leaving it shorter than
   connectionHealth. This would cause idle recovery tracking to silently
   skip dynamically-added endpoints. Mirror the growth logic.

2. When a connection stays IDLE for >3 consecutive health checks,
   escalate beyond just conn.Connect() — trigger a full endpoint
   reconnection via handleEndpointDisconnection which tears down and
   rebuilds all streams for that endpoint. This handles cases where
   the underlying gRPC connection is fundamentally broken and
   Connect() alone cannot recover it.
… escalation

- TestAddEndpoint_GrowsConnectionIdleCount: verifies connectionIdleCount
  is grown alongside connectionHealth when addEndpoint adds a new
  endpoint dynamically (prevents idle recovery from silently skipping
  dynamically-added endpoints)

- TestPerformHealthCheck_PersistentIDLEEscalates: verifies that when
  connectionIdleCount exceeds the escalation threshold (>3),
  performHealthCheck triggers handleEndpointDisconnection without
  panicking (pre-sets count to 3, runs one health check, verifies
  count=4 and no crash from the async escalation)
…ion, lifecycle tests

P1 — Connection state transitions:
- SHUTDOWN connection: no recovery attempt, counter reset
- Nil connection: marked unhealthy without panic
- IDLE → CONNECTING transition: idle counter resets to 0

P1 — Endpoint disconnection cascade:
- Single endpoint IDLE while others healthy: targeted recovery
- All endpoints IDLE simultaneously: all get recovery (rolling restart)
- Invalid endpoint index: no panic
- Already reconnecting guard: duplicate disconnection skipped

P2 — Stream + connection interaction:
- Streams on IDLE connection: not recovered (connReady=false)
- Stream recovery blocked until connection reaches READY
- refreshEndpointHealth: derives from connectionHealth correctly
- All connections unhealthy: all endpoints marked unhealthy

P3 — Timing & lifecycle:
- Context cancellation: health monitor goroutine exits promptly
- Empty connections: no panic
- Mismatched array lengths: documents current panic behavior
  (connectionHealth shorter than connections — needs bounds fix)

Also found a real bug: performHealthCheck panics when connectionHealth
array is shorter than connections array. Documented as
TestPerformHealthCheck_MismatchedArrayLengths_Panics.
performHealthCheck iterated over g.connections but indexed into
g.streamManager.connectionHealth without bounds checking. If
connectionHealth was shorter than connections (e.g., during dynamic
endpoint addition before arrays are grown), this caused an index
out of range panic.

Fix: break the loop when i >= len(connectionHealth). Connections
beyond the health array are skipped — they'll be picked up once
the arrays are grown by addEndpoint.

Updated TestPerformHealthCheck_MismatchedArrayLengths_NoPanic to
verify the fix (was previously documenting the panic).
…eIDLE

The test read connectionHealth[0] immediately after performHealthCheck()
returned, but the write happens in a background goroutine spawned by
handleEndpointDisconnection. Wait for the async write with a polling
loop under the healthMu lock.
…iliation

Add ResourceSyncRequest/Response proto messages and SendResourceSync()
client method. Replaces per-resource RouteDeploymentRequest calls during
reconnection with a single atomic message containing all active resources.
Gravity compares the list against its cached state and unprovisions any
stale entries, preventing VIP collisions from gossip contamination.
Add handler for ResourceSyncResponse message type that was being
logged as unhandled. This completes the ResourceSync flow where
ion sends a response after processing the sync request.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 21, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This pull request adds a new resource synchronization capability to the gRPC client, introducing protobuf message types and a SendResourceSync API that enables bidirectional resource sync messages over the existing EstablishSession stream.

Changes

Cohort / File(s) Summary
Protobuf Protocol Definition
gravity/proto/gravity_session.proto
Added new message types ResourceSyncRequest, ResourceSyncEntry, and ResourceSyncResponse to define resource synchronization payloads. Extended SessionMessage with two new oneof message_type variants (fields 70-71) for routing sync requests and responses.
Client Implementation
gravity/grpc_client.go
Implemented SendResourceSync() API method that converts deployment and sandbox items to protobuf format and broadcasts them over healthy control streams. Added handleResourceSyncResponse() to process incoming sync responses and introduced the ResourceSyncItem struct with ID, VirtualIP, and OwnerID fields.
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@jhaynie jhaynie merged commit 094774f into main Apr 21, 2026
4 of 5 checks passed
@jhaynie jhaynie deleted the fix/hostinfo-restart-count branch April 21, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant