Skip to content

fix(bench): connection leak + faucet U256 underflow#50

Merged
keanji-x merged 2 commits into
mainfrom
fix/connection-leak-and-faucet-underflow
May 26, 2026
Merged

fix(bench): connection leak + faucet U256 underflow#50
keanji-x merged 2 commits into
mainfrom
fix/connection-leak-and-faucet-underflow

Conversation

@keanji-x

Copy link
Copy Markdown
Collaborator

Summary

Two production bugs surfaced during a 1500-sender long-run bench against a single reth RPC.

Bug 1 — 20K-connection leak (src/eth/eth_cli.rs)

Symptom: Target reth's rpc.max-connections=20000 cap is exhausted within minutes, wedging the RPC server. ss -t state established on the bench host shows ~20K sockets to one peer.

Root causes:

  1. pool_idle_timeout(Duration::from_secs(10)) — too aggressive. reqwest's default is 90s. With 10s, sockets get evicted from the pool faster than 1500 concurrent senders can reuse them, so each request opens a new TCP connection.
  2. Each hot RPC (get_pending_txn_count, send_raw_tx, get_latest_txn_count) was wrapped in tokio::time::timeout(Duration::from_secs(10), ...). When that fires, the future is dropped mid-response; reqwest treats that as an aborted request and discards the socket instead of returning it to the pool. Under any tail latency this leaks connections fast.

Fix:

  • pool_idle_timeout → 90s (reqwest default)
  • Add .timeout(Duration::from_secs(10)) to the reqwest::ClientBuilder chain. reqwest's built-in timeout is connection-aware and recycles the socket back to the pool when it fires.
  • Remove all three tokio::time::timeout(...) wraps at the call sites.

Measured impact: 20K → 2K ESTABLISHED connections in the long-run bench.

Bug 2 — U256 underflow in faucet cascade (src/main.rs, src/txn_plan/constructor/faucet.rs)

Scenario: If fauce_eth_balance (config) is larger than the actual on-chain faucet balance, or gas_budget × total_txns exceeds fauce_eth_balance, then inside FaucetTreePlanBuilder::new:

let amount_for_leaves = faucet_balance - total_cost;  // U256 wraps silently

The U256 subtraction wraps to ~2^256, producing per-child funding values that permanently strip ENOUGH_BALANCE checks in the reth txpool — the cascade silently deadlocks with no progress and no error.

Fixes:

  • src/main.rs: query eth_getBalance(faucet) at startup and compute effective_faucet_eth = min(configured, on_chain) − 1% headroom (or clamp down when config exceeds on-chain). Pass this to create_faucet_tree_plan_builder instead of the raw config value.
  • src/txn_plan/constructor/faucet.rs: assert!(faucet_balance >= total_cost, ...) before the subtraction, with a diagnostic that names every input.

Drive-by — nonce-init tuning (src/main.rs)

  • buffer_unordered: 1024 → 16 (don't hammer the RPC during init)
  • per-call timeout: 10s → 60s (init under load is genuinely slow)

Test plan

  • cargo check clean (no warnings)
  • Stable across overnight 1500-sender long-run bench; ESTABLISHED socket count plateaus around 2K instead of climbing to 20K
  • Cascade no longer deadlocks when fauce_eth_balance is misconfigured — assertion now fires with a readable diagnostic

keanji-x and others added 2 commits May 12, 2026 13:16
Two production bugs uncovered during a long-running 1500-sender bench
against a single reth RPC endpoint.

eth_cli.rs — connection leak under high load
- pool_idle_timeout was 10s, well below reqwest's 90s default. Short
  idle eviction churned sockets faster than they were reused, opening
  a fresh TCP connection for nearly every RPC call. With 1500 senders
  this drove ~20K ESTABLISHED sockets to a single RPC port and tripped
  the target reth's rpc.max-connections=20000 cap, wedging the server.
  Restored to 90s and added a connection-aware reqwest .timeout(10s)
  on the ClientBuilder.
- The three hot RPCs (get_pending_txn_count, send_raw_tx,
  get_latest_txn_count) were wrapped in tokio::time::timeout(10s, ...).
  When that fires it drops the future mid-response; reqwest sees the
  abort and discards the socket rather than returning it to the pool,
  which leaks connections under any tail latency. Replaced with
  reqwest's built-in per-request timeout (set on the ClientBuilder
  above), which is connection-aware and recycles sockets correctly.
- Measured impact: 20K -> 2K ESTABLISHED connections in long-run bench.

main.rs — clamp faucet ETH to on-chain reality
- start_bench now queries eth_getBalance(faucet) and computes
  effective_faucet_eth = on_chain - 1% headroom when the configured
  fauce_eth_balance is smaller than the on-chain amount, and clamps
  down when the config exceeds on-chain. This prevents the cascade
  builder from receiving a value that can't cover total_cost, which
  would silently U256-underflow downstream.
- init_nonce: per-RPC timeout 10s -> 60s and buffer_unordered 1024 -> 16
  to stop hammering the RPC endpoint during the nonce-init phase.

faucet.rs — assert before U256 subtraction
- Added assert!(faucet_balance >= total_cost, ...) before the
  faucet_balance - total_cost subtraction inside
  FaucetTreePlanBuilder::new. U256 wraps silently on underflow,
  producing per-child funding ~2^256/N that permanently strips
  ENOUGH_BALANCE in the txpool and deadlocks the cascade. Fail fast
  with a diagnostic instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- main.rs: rewrite self-contradicting comment block on effective_faucet_eth
  selection (clamp DOWN vs scale UP semantics) in plain English
- faucet.rs: add symmetric faucet_balance >= total_gas_cost assert to the
  single-level (total_levels == 1) branch so it surfaces the same diagnostic
  as the multi-level path before U256::sub silently underflows
- main.rs: document the buffer_unordered(16) concurrency choice and the 60s
  per-fetch timeout in init_nonce
- main.rs: note that the faucet-balance .expect() is wrapped in
  retry_with_backoff (3x with 5s/10s/10s, ~25s wall before panic)
- main.rs: drop the two redundant warn! lines on the cascade-sizing branches;
  the single info! "Faucet ETH plan: configured=.. on_chain=.. effective=.."
  already carries every field, and the warn! for the "configured exceeds
  on-chain" anomaly is retained as it signals operator misconfig
- main.rs: add inline note that on_chain == 0 falls through to the cascade
  builder's assert (rather than U256-underflowing inside this fn)

No behavior change. cargo check clean, no new warnings.
@keanji-x keanji-x merged commit 204f083 into main May 26, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant