Skip to content

feat(tonic-xds): implement gRFC A50 outlier detection success-rate algorithm#2673

Open
LYZJU2019 wants to merge 1 commit into
grpc:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detector-success-rate
Open

feat(tonic-xds): implement gRFC A50 outlier detection success-rate algorithm#2673
LYZJU2019 wants to merge 1 commit into
grpc:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detector-success-rate

Conversation

@LYZJU2019

@LYZJU2019 LYZJU2019 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Motivation

#2619 landed the failure-percentage algorithm and the shared sweep plumbing. This PR fills in the second gRFC A50 ejection algorithm: success-rate (cross-host mean/stdev).

Solution

In the housekeeping sweep, the success-rate algorithm now runs before the failure-percentage algorithm (A50 lists them in that order, and Envoy's implementation follows the same order; the two algorithms are otherwise independent and gated separately).

Tests

Seven new unit tests covering:

  • success_rate_ejects_outlier_below_threshold — 4× 100% + 1× 0%, mean=80, stdev=40, threshold@1900 = 4 → outlier ejected.
  • success_rate_uniform_population_does_not_eject — stdev=0 → threshold=mean, nothing strictly below.
  • success_rate_minimum_hosts_gates_ejection — qualifying<minimum_hosts → algorithm skipped.
  • success_rate_request_volume_filters_low_traffic — low-volume outlier excluded from population and candidates.
  • success_rate_enforcement_zero_never_ejectsenforcing_success_rate = 0 short-circuits the roll.
  • success_rate_stdev_factor_zero_ejects_below_mean — factor=0 collapses threshold to the mean.
  • success_rate_max_ejection_percent_caps_concurrent_ejections — cap (with floor) holds.
  • success_rate_and_failure_percentage_compose — both algorithms configured: success-rate ejects, failure-percentage skips already-ejected host (no double-count).

All 33 tests in client::loadbalance::outlier_detection pass, including the 26 from #2619.

…gorithm

grpc#2619 landed the failure-percentage algorithm and the shared sweep
plumbing. This change fills in the second A50 ejection algorithm:
success-rate (cross-host mean/stdev).

Algorithm (run before failure-percentage in the same sweep, gated by
SuccessRateConfig being present on the cluster config):

  1. For each host with total >= request_volume, compute its success
     rate as a percentage in 0.0..=100.0.
  2. If fewer than minimum_hosts qualify, skip the algorithm.
  3. Compute mean and population stdev of the success rates.
  4. threshold = mean - stdev * stdev_factor / 1000
  5. For each qualifying host whose success rate is strictly below the
     threshold, attempt ejection — subject to max_ejection_percent (with
     A50's at-least-one floor) and the enforcing_success_rate roll.
     Hosts already ejected (e.g., by a previous algorithm in this sweep)
     are skipped, and ejections feed into ejected_count so the
     subsequent failure-percentage pass respects the cap.

Success-rate runs first because A50 lists it first and Envoy's
implementation runs the algorithms in the same order; the two
algorithms are otherwise independent.

Seven new unit tests cover the outlier-below-threshold happy path,
uniform-population no-eject (stdev = 0), minimum_hosts gating,
request_volume filtering, enforcement = 0 no-op, stdev_factor = 0
collapsing to the mean, max_ejection_percent interaction, and
combined success-rate + failure-percentage composition.
Comment on lines +221 to +239
for (state, s, f) in &snapshots {
let total = s + f;
if total < request_volume || state.is_ejected() {
continue;
}
if self.ejected_count.load(Ordering::Relaxed) >= max_ejections {
break;
}
let rate = 100.0 * (*s as f64) / (total as f64);
if rate >= threshold {
continue;
}
if !roll(enforcing) {
continue;
}
if state.try_eject(now) {
self.ejected_count.fetch_add(1, Ordering::Relaxed);
self.ejected_set_version.fetch_add(1, Ordering::Relaxed);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is duplicated both for success_rate and failure_percentage logic except one line. Can you pleas create a new function for the common logic so that no drift is introduced in future between the two impls

if self.ejected_count.load(Ordering::Relaxed) >= max_ejections {
break;
}
let rate = 100.0 * (*s as f64) / (total as f64);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If total and request_volume are 0, then rate will (100*0)/0 which is NaN. Can this might lead to ejection of all the hosts

let mut config = sr_config(1900, 10, 3);
config.max_ejection_percent = pct(20);
let registry = make_registry_only(config);
// 4 hosts at 100%, 1 at 0%. The outlier is the only candidate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only 1 bad host and max cap is also 1. So this test will never breach the cap. Maybe keep number of bad hosts > 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants