chore: add actor-level retry for infra errors#2062
Conversation
| }, | ||
| ExecutionOutcome::IntrinsicFailure => { | ||
| self.reset_infra_backoff(); | ||
| self.mode = ActorMode::NoViableNotes; |
There was a problem hiding this comment.
I think this is incorrect? A failure doesn't mean that there are notes to consume?
Mirko-von-Leipzig
left a comment
There was a problem hiding this comment.
Some minor nits, suggestions, but LGTM!
There was a problem hiding this comment.
LGTM! Just a few non-blocking minor nits, feel free to disregard if you prefer otherwise.
Also, not sure but this might also close #1550, no? Or maybe we want to separately implement the crate in other parts of the code first.
| .notify(|err, dur| { | ||
| tracing::warn!( | ||
| err = %err.as_report(), | ||
| sleep_ms = dur.as_millis() as u64, | ||
| "store get_note_script_by_root failed transiently; retrying after backoff", | ||
| ); | ||
| }) |
There was a problem hiding this comment.
These are all very similar so might be worth doing a generic function for logging any NTX transient errors. This may seem overkill, but when debugging live services and filtering logs I've found it helps a lot to have as much of the error wording be predictable/structured so you can narrow down a log as much as possible (e.g., a new retryable call may be introduced with different wording so when you are filtering it's harder to identify a call or error)
Does not fully, to achieve that we should replace our implementation of retry and backoff in several places. The scope of this PR was to, besides fixing the actor level errors, to try a first approach to that. If we did like this, I can open another issue to replace it in some more places |
Mirko-von-Leipzig
left a comment
There was a problem hiding this comment.
Thanks, feel free to merge when you're happy.
closes #2052.
The NTX builder was permanently dropping notes after a run of unrelated infrastructure failures because every error type counted against the per-note
max_note_attemptscap.This PR classifies
NtxErrorinto infrastructure vs. intrinsic. Infra failures now retry the same candidate after a exponential backoff (usingbackon) without touching per-note state. Intrinsic failures keep the existing behaviour.This PR also introduces a dependency for backoff,
backon, and note that I didn't used theretryfeature (probably the most important part) because we use atokio::select!that needs to interleave several other things around each retry. Making use of retry will require further "plumbing" changes.