feat(load/peak_ewma): expose round-trip time estimate#871
Conversation
Signed-off-by: katelyn martin <git@katelyn.world>
Signed-off-by: katelyn martin <git@katelyn.world>
this commit introduces a method to allow the current round-trip time (RTT) estimate of a `PeakEwma<S, C>` service to be accessed. Signed-off-by: katelyn martin <git@katelyn.world>
Signed-off-by: katelyn martin <git@katelyn.world>
cratelyn
left a comment
There was a problem hiding this comment.
some notes for review...
| /// Returns the [`Instant`] that this estimate was last updated. | ||
| pub fn updated_at(&self) -> Instant { | ||
| self.update_at | ||
| } | ||
|
|
||
| /// Returns the round-trip time estimate, in nanoseconds. | ||
| pub fn rtt_ns(&self) -> f64 { | ||
| self.rtt_ns | ||
| } |
There was a problem hiding this comment.
another way to do this would be making the inner update_at and rtt_ns fields public, but i didn't like that. introducing new fields would become a semver breaking change, and i'm fairly sure that the name update_at contains a typo (i would expect that this be named updated_at).
i went this route to provide a stable public interface.
There was a problem hiding this comment.
Hm, I don't even remember, why is nanoseconds stored in f64 instead of a u64 or u128? Is it to make use of the exponent to represent even bigger numbers in 64 bits?
Would it be worth exposing the value publicly as an unsigned int?
There was a problem hiding this comment.
Or, a different crazy idea: struct Rtt(_), so the exact repr inside is not public? Does the number truly matter, or do you just need an impl Ord?
There was a problem hiding this comment.
Or, a different crazy idea:
struct Rtt(_), so the exact repr inside is not public? Does the number truly matter, or do you just need animpl Ord?
in our case, the number does matter. you're right that Ord is the way that tower::balance steers traffic given two Load::metrics (see https://github.com/cratelyn/tower/blob/251296dc54a044383dffd16d2179b443e2615672/tower/src/balance/p2c/service.rs#L168-L171), but i'm interested in providing a way for tower middleware to build atop this layer with their own Load implementations.
we'd like to be able to take this RTT estimate, while also taking into account other properties of the traffic like response status codes.
There was a problem hiding this comment.
Hm, I don't even remember, why is nanoseconds stored in f64 instead of a u64 or u128? Is it to make use of the exponent to represent even bigger numbers in 64 bits?
Would it be worth exposing the value publicly as an unsigned int?
my understanding is that this is stored as a floating-point number because of the math in the RttEstimate::update(..) method here: https://github.com/tower-rs/tower/blob/master/tower/src/load/peak_ewma.rs#L263-L277
// When an RTT is observed that is less than the estimated RTT, we decay the
// prior estimate according to how much time has elapsed since the last
// update. The inverse of the decay is used to scale the estimate towards the
// observed RTT value.
let elapsed = nanos(now.saturating_duration_since(self.update_at));
let decay = (-elapsed / decay_ns).exp();
let recency = 1.0 - decay;
let next_estimate = (self.rtt_ns * decay) + (rtt * recency);
trace!(
"update rtt={:03.0}ms decay={:06.0}ns; next={:03.0}ms",
rtt / NANOS_PER_MILLI,
self.rtt_ns - next_estimate,
next_estimate / NANOS_PER_MILLI,
);because of this, when we observe requests taking less time than the original peak, we decay the estimate exponentially, which can lead to fractional estimates.
rounding to the nearest nanosecond, i.e. writing an inverse of the nanos() helper here, would let us return the estimate as a Duration (via Duration::from_nanos()), which does seem like a much more intuitive interface to expose publicly.
what do you think, @seanmonstar?
There was a problem hiding this comment.
Oh, gotcha. So, we don't have to do RTT math with floating points. The Linux kernel does TCP RTT/EWMA with integers. 🤷
But, it also could be just whatever, I guess if we change the types internally, the numbers will never be so large that a cast will lose anything.
There was a problem hiding this comment.
that's a great point. i can refactor this middleware to switch away from floating point numbers to integers, which would make exposing the current estimate more natural.
i'll do that as a separate pull request, to keep each change easily reviewable.
| /// Returns the round-trip time estimate, in nanoseconds. | ||
| pub fn rtt_ns(&self) -> f64 { | ||
| self.rtt_ns | ||
| } |
There was a problem hiding this comment.
i toyed briefly with finding a way to convert this into a Duration, but it seems like a fundamentally lossy translation that's best handled by callers. we could add a rtt that rounds to the nearest nanosecond, if that'd be welcome.
this pull request proposes some extensions to the public interface of the
tower::load::peak_ewmamiddleware.today, this submodule provides a
tower::load::peak_ewma::PeakEwma<S, C>service that implementstower::load::Load. this is useful for using peak latency as a component that influences load-balancing decisions intower::balance.today, users that may wish to build middleware that wish to build upon this exponentially-weighted moving average do not have a way to access the observed round-trip time (RTT) estimate. this prevents callers from, for example, injecting additional penalties for particular status codes, in order to steer traffic away from services returning 5XX or 429 responses.
this branch makes a few changes:
RttEstimatein the public interfaceRttEstimateaCloneable typeRttEstimate