Skip to content

fix(nonce-init): retry on RPC errors, not just on tokio timeouts#47

Merged
keanji-x merged 1 commit into
mainfrom
fix/nonce-init-retry-bug
Apr 27, 2026
Merged

fix(nonce-init): retry on RPC errors, not just on tokio timeouts#47
keanji-x merged 1 commit into
mainfrom
fix/nonce-init-retry-bug

Conversation

@keanji-x

Copy link
Copy Markdown
Collaborator

Summary

  • The 5-retry loop in init_nonce only checked the tokio timeout outer Result, not the inner RPC Result. Inner RPC errors (connection drops etc.) still satisfied res.is_ok(), breaking the loop on the first attempt and panicking the task.
  • Match both layers and add a small backoff so we actually retry on RPC errors and on timeouts.

Why this matters

Reproducible against a slow / overloaded RPC endpoint with ~100k accounts in --recover mode: a single dropped connection during nonce initialization aborted the whole run. After this fix, the same scenario completes nonce init in ~4-5 minutes with one or two warning logs as transient errors get retried instead of being fatal.

Behavior changes

  • ✅ Transient RPC errors are retried (same as transient timeouts already were).
  • ✅ Persistent failure still panics after 5 attempts (unchanged).
  • ✅ Successful first attempt: no behavior change.

Test plan

  • cargo check passes.
  • Verified end-to-end against gravity testnet (~100k accounts, RPC under heavy load): nonce init completes successfully where it previously panicked on the first dropped connection.

🤖 Generated with Claude Code

The previous logic was

    let res = tokio::time::timeout(..., client.get_pending_txn_count(addr)).await;
    if res.is_ok() { init_nonce = res.ok(); break; }

`res` is `Result<Result<u64, _>, Elapsed>`, so `res.is_ok()` is true
whenever the timeout did not expire — including when the inner RPC call
returned `Err` (connection reset, request error, etc.). In that case
the loop broke after a single attempt and `init_nonce` was set to
`Some(Err(_))`, which then panicked the task with "Failed to get nonce
for address". The five-retry intent never took effect.

Match both layers so we actually retry on inner RPC errors and on
outer timeouts, and add a small backoff between attempts. Behavior on
success is unchanged; behavior on persistent failure is the same panic
after exhausting all 5 attempts.

This was reproducible against a slow / overloaded RPC endpoint with
~100k accounts in --recover mode: a single dropped connection during
nonce initialization aborted the whole run.
@keanji-x keanji-x force-pushed the fix/nonce-init-retry-bug branch from dc8a4c1 to dd4308f Compare April 27, 2026 14:34
@keanji-x keanji-x merged commit c5e6a2c into main Apr 27, 2026
1 check failed
@keanji-x keanji-x deleted the fix/nonce-init-retry-bug branch April 27, 2026 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant