fix(nonce-init): retry on RPC errors, not just on tokio timeouts by keanji-x · Pull Request #47 · Galxe/gravity_bench

keanji-x · 2026-04-27T12:25:27Z

Summary

The 5-retry loop in init_nonce only checked the tokio timeout outer Result, not the inner RPC Result. Inner RPC errors (connection drops etc.) still satisfied res.is_ok(), breaking the loop on the first attempt and panicking the task.
Match both layers and add a small backoff so we actually retry on RPC errors and on timeouts.

Why this matters

Reproducible against a slow / overloaded RPC endpoint with ~100k accounts in --recover mode: a single dropped connection during nonce initialization aborted the whole run. After this fix, the same scenario completes nonce init in ~4-5 minutes with one or two warning logs as transient errors get retried instead of being fatal.

Behavior changes

✅ Transient RPC errors are retried (same as transient timeouts already were).
✅ Persistent failure still panics after 5 attempts (unchanged).
✅ Successful first attempt: no behavior change.

Test plan

cargo check passes.
Verified end-to-end against gravity testnet (~100k accounts, RPC under heavy load): nonce init completes successfully where it previously panicked on the first dropped connection.

🤖 Generated with Claude Code

The previous logic was let res = tokio::time::timeout(..., client.get_pending_txn_count(addr)).await; if res.is_ok() { init_nonce = res.ok(); break; } `res` is `Result<Result<u64, _>, Elapsed>`, so `res.is_ok()` is true whenever the timeout did not expire — including when the inner RPC call returned `Err` (connection reset, request error, etc.). In that case the loop broke after a single attempt and `init_nonce` was set to `Some(Err(_))`, which then panicked the task with "Failed to get nonce for address". The five-retry intent never took effect. Match both layers so we actually retry on inner RPC errors and on outer timeouts, and add a small backoff between attempts. Behavior on success is unchanged; behavior on persistent failure is the same panic after exhausting all 5 attempts. This was reproducible against a slow / overloaded RPC endpoint with ~100k accounts in --recover mode: a single dropped connection during nonce initialization aborted the whole run.

keanji-x mentioned this pull request Apr 27, 2026

chore(fmt): apply rustfmt + strip trailing whitespace across the tree #49

Merged

3 tasks

keanji-x force-pushed the fix/nonce-init-retry-bug branch from dc8a4c1 to dd4308f Compare April 27, 2026 14:34

keanji-x merged commit c5e6a2c into main Apr 27, 2026
1 check failed

keanji-x deleted the fix/nonce-init-retry-bug branch April 27, 2026 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nonce-init): retry on RPC errors, not just on tokio timeouts#47

fix(nonce-init): retry on RPC errors, not just on tokio timeouts#47
keanji-x merged 1 commit into
mainfrom
fix/nonce-init-retry-bug

keanji-x commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

keanji-x commented Apr 27, 2026

Summary

Why this matters

Behavior changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant