Skip to content

chore: add actor-level retry for infra errors#2062

Merged
SantiagoPittella merged 5 commits into
mainfrom
santiagopittella-fix-ntx-builder-note-discarding
May 19, 2026
Merged

chore: add actor-level retry for infra errors#2062
SantiagoPittella merged 5 commits into
mainfrom
santiagopittella-fix-ntx-builder-note-discarding

Conversation

@SantiagoPittella
Copy link
Copy Markdown
Collaborator

@SantiagoPittella SantiagoPittella commented May 7, 2026

closes #2052.

The NTX builder was permanently dropping notes after a run of unrelated infrastructure failures because every error type counted against the per-note max_note_attempts cap.

This PR classifies NtxError into infrastructure vs. intrinsic. Infra failures now retry the same candidate after a exponential backoff (using backon) without touching per-note state. Intrinsic failures keep the existing behaviour.

This PR also introduces a dependency for backoff, backon, and note that I didn't used the retry feature (probably the most important part) because we use a tokio::select! that needs to interleave several other things around each retry. Making use of retry will require further "plumbing" changes.

Comment thread crates/ntx-builder/src/actor/mod.rs Outdated
},
ExecutionOutcome::IntrinsicFailure => {
self.reset_infra_backoff();
self.mode = ActorMode::NoViableNotes;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect? A failure doesn't mean that there are notes to consume?

Comment thread crates/ntx-builder/src/actor/mod.rs
Copy link
Copy Markdown
Collaborator

@Mirko-von-Leipzig Mirko-von-Leipzig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits, suggestions, but LGTM!

Comment thread crates/ntx-builder/src/actor/mod.rs
Comment thread crates/ntx-builder/src/actor/execute.rs Outdated
Copy link
Copy Markdown
Collaborator

@igamigo igamigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a few non-blocking minor nits, feel free to disregard if you prefer otherwise.
Also, not sure but this might also close #1550, no? Or maybe we want to separately implement the crate in other parts of the code first.

Comment thread crates/ntx-builder/src/lib.rs Outdated
Comment on lines +715 to +721
.notify(|err, dur| {
tracing::warn!(
err = %err.as_report(),
sleep_ms = dur.as_millis() as u64,
"store get_note_script_by_root failed transiently; retrying after backoff",
);
})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all very similar so might be worth doing a generic function for logging any NTX transient errors. This may seem overkill, but when debugging live services and filtering logs I've found it helps a lot to have as much of the error wording be predictable/structured so you can narrow down a log as much as possible (e.g., a new retryable call may be introduced with different wording so when you are filtering it's harder to identify a call or error)

Comment thread crates/ntx-builder/src/actor/execute.rs Outdated
@SantiagoPittella
Copy link
Copy Markdown
Collaborator Author

Also, not sure but this might also close #1550, no? Or maybe we want to separately implement the crate in other parts of the code first.

Does not fully, to achieve that we should replace our implementation of retry and backoff in several places. The scope of this PR was to, besides fixing the actor level errors, to try a first approach to that. If we did like this, I can open another issue to replace it in some more places

Copy link
Copy Markdown
Collaborator

@Mirko-von-Leipzig Mirko-von-Leipzig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, feel free to merge when you're happy.

@SantiagoPittella SantiagoPittella merged commit 3f7a18f into main May 19, 2026
17 checks passed
@SantiagoPittella SantiagoPittella deleted the santiagopittella-fix-ntx-builder-note-discarding branch May 19, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants