Skip to content

feat(load/peak_ewma): expose round-trip time estimate#871

Open
cratelyn wants to merge 4 commits into
tower-rs:masterfrom
cratelyn:peak-ewma-expose-rtt
Open

feat(load/peak_ewma): expose round-trip time estimate#871
cratelyn wants to merge 4 commits into
tower-rs:masterfrom
cratelyn:peak-ewma-expose-rtt

Conversation

@cratelyn

@cratelyn cratelyn commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

this pull request proposes some extensions to the public interface of the tower::load::peak_ewma middleware.

today, this submodule provides a tower::load::peak_ewma::PeakEwma<S, C> service that implements tower::load::Load. this is useful for using peak latency as a component that influences load-balancing decisions in tower::balance.

today, users that may wish to build middleware that wish to build upon this exponentially-weighted moving average do not have a way to access the observed round-trip time (RTT) estimate. this prevents callers from, for example, injecting additional penalties for particular status codes, in order to steer traffic away from services returning 5XX or 429 responses.

this branch makes a few changes:

  • expose RttEstimate in the public interface
  • make RttEstimate a Cloneable type
  • add accessors to RttEstimate

cratelyn added 4 commits June 4, 2026 08:19
Signed-off-by: katelyn martin <git@katelyn.world>
Signed-off-by: katelyn martin <git@katelyn.world>
this commit introduces a method to allow the current round-trip time
(RTT) estimate of a `PeakEwma<S, C>` service to be accessed.

Signed-off-by: katelyn martin <git@katelyn.world>
Signed-off-by: katelyn martin <git@katelyn.world>

@cratelyn cratelyn left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some notes for review...

Comment thread tower/src/load/peak_ewma.rs
Comment on lines +231 to +239
/// Returns the [`Instant`] that this estimate was last updated.
pub fn updated_at(&self) -> Instant {
self.update_at
}

/// Returns the round-trip time estimate, in nanoseconds.
pub fn rtt_ns(&self) -> f64 {
self.rtt_ns
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another way to do this would be making the inner update_at and rtt_ns fields public, but i didn't like that. introducing new fields would become a semver breaking change, and i'm fairly sure that the name update_at contains a typo (i would expect that this be named updated_at).

i went this route to provide a stable public interface.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't even remember, why is nanoseconds stored in f64 instead of a u64 or u128? Is it to make use of the exponent to represent even bigger numbers in 64 bits?

Would it be worth exposing the value publicly as an unsigned int?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, a different crazy idea: struct Rtt(_), so the exact repr inside is not public? Does the number truly matter, or do you just need an impl Ord?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, a different crazy idea: struct Rtt(_), so the exact repr inside is not public? Does the number truly matter, or do you just need an impl Ord?

in our case, the number does matter. you're right that Ord is the way that tower::balance steers traffic given two Load::metrics (see https://github.com/cratelyn/tower/blob/251296dc54a044383dffd16d2179b443e2615672/tower/src/balance/p2c/service.rs#L168-L171), but i'm interested in providing a way for tower middleware to build atop this layer with their own Load implementations.

we'd like to be able to take this RTT estimate, while also taking into account other properties of the traffic like response status codes.

@cratelyn cratelyn Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't even remember, why is nanoseconds stored in f64 instead of a u64 or u128? Is it to make use of the exponent to represent even bigger numbers in 64 bits?

Would it be worth exposing the value publicly as an unsigned int?

my understanding is that this is stored as a floating-point number because of the math in the RttEstimate::update(..) method here: https://github.com/tower-rs/tower/blob/master/tower/src/load/peak_ewma.rs#L263-L277

// When an RTT is observed that is less than the estimated RTT, we decay the
// prior estimate according to how much time has elapsed since the last
// update. The inverse of the decay is used to scale the estimate towards the
// observed RTT value.
let elapsed = nanos(now.saturating_duration_since(self.update_at));
let decay = (-elapsed / decay_ns).exp();
let recency = 1.0 - decay;
let next_estimate = (self.rtt_ns * decay) + (rtt * recency);
trace!(
    "update rtt={:03.0}ms decay={:06.0}ns; next={:03.0}ms",
    rtt / NANOS_PER_MILLI,
    self.rtt_ns - next_estimate,
    next_estimate / NANOS_PER_MILLI,
);

because of this, when we observe requests taking less time than the original peak, we decay the estimate exponentially, which can lead to fractional estimates.

rounding to the nearest nanosecond, i.e. writing an inverse of the nanos() helper here, would let us return the estimate as a Duration (via Duration::from_nanos()), which does seem like a much more intuitive interface to expose publicly.

what do you think, @seanmonstar?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, gotcha. So, we don't have to do RTT math with floating points. The Linux kernel does TCP RTT/EWMA with integers. 🤷

But, it also could be just whatever, I guess if we change the types internally, the numbers will never be so large that a cast will lose anything.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a great point. i can refactor this middleware to switch away from floating point numbers to integers, which would make exposing the current estimate more natural.

i'll do that as a separate pull request, to keep each change easily reviewable.

Comment on lines +236 to +239
/// Returns the round-trip time estimate, in nanoseconds.
pub fn rtt_ns(&self) -> f64 {
self.rtt_ns
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i toyed briefly with finding a way to convert this into a Duration, but it seems like a fundamentally lossy translation that's best handled by callers. we could add a rtt that rounds to the nearest nanosecond, if that'd be welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants