Skip to content

locator: Don't crash when on startup when control plane is unreachable#128

Merged
lynnagara merged 4 commits into
mainfrom
allow-startup-without-control-plane
May 18, 2026
Merged

locator: Don't crash when on startup when control plane is unreachable#128
lynnagara merged 4 commits into
mainfrom
allow-startup-without-control-plane

Conversation

@lynnagara
Copy link
Copy Markdown
Member

In sentry devservices, the proxy and ingest router must be available even when the Sentry application (control plane) isn't up and running yet AND no backup file exists from a previous run.

Now the locator can start up even with the initial load fails: this is retried periodically in line with the incremental load loop.

The devservices healthcheck now hits the /health endpoint instead of the /ready endpoint. /ready is unchanged and still reports 503 until routing data is loaded, so it can still be used to gate the kubernetes service from routing traffic before control plane data is loaded.

In sentry devservices, the proxy and ingest router must be available even when the
Sentry application (control plane) isn't up and running yet AND no backup file exists
from a previous run.

Now the locator can start up even with the initial load fails: this is retried
periodically in line with the incremental load loop.

The devservices healthcheck now hits the /health endpoint instead of the /ready endpoint.
/ready is unchanged and still reports 503 until routing data is loaded, so it
can still be used to gate the kubernetes service from routing traffic before control
plane data is loaded.
@lynnagara lynnagara requested a review from a team as a code owner May 15, 2026 23:19
Copy link
Copy Markdown
Member

@strongs strongs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nitpick comment but otherwise LGTM

Comment thread locator/src/locator.rs Outdated
match self.load_incremental().await {
// Flip ready on first success — handles the case where
// initial snapshot failed but a later incremental succeeded.
Ok(()) => self.ready.store(true, Ordering::Relaxed),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: we're setting to true on every successful incremental load, just means store() gets called every loop - we could gate w/ if !self.ready.laod(Ordering::Relaxed) to only write once

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Copy Markdown
Member Author

@lynnagara lynnagara May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i rewrote this splitting out the incremental and snapshot load cases more clearly..

this solves the problem that the sentry bot called out in #128 (comment)

it also means the self.ready.store() is now only inside the snapshot load path and not the incremental one

Comment thread locator/src/locator.rs Outdated
Comment on lines 302 to 312
// initial snapshot failed but a later incremental succeeded.
Ok(()) => {
if !self.ready.load(Ordering::Relaxed) {
self.ready.store(true, Ordering::Relaxed);
}
}
Err(err) => tracing::warn!("Incremental load failed: {err:?}; will retry"),
}
}
Some(cmd) = rx.recv() => {
match cmd {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Data recovered via load_incremental() after a failed snapshot is not persisted to the backup file, leading to data loss on restart.
Severity: HIGH

Suggested Fix

After successfully loading data within the load_incremental function, especially when it performs a full data fetch, call self.backup_routes.store() to persist the newly acquired data to the backup file. This ensures that subsequent restarts can use the recovered data even if the control plane is unreachable.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: locator/src/locator.rs#L297-L312

Potential issue: When an initial `load_snapshot()` fails, the service can recover via a
subsequent `load_incremental()` call. However, the `load_incremental` function loads
data into memory and marks the service as ready without ever persisting this data to the
backup store. The function `load_incremental` never calls `backup_routes.store()`. If
the service restarts while the control plane is still unavailable, it will have no
backup data to fall back on, losing all previously recovered information and starting
from an empty state. This makes any data recovered through the incremental path
ephemeral.

Did we get this right? 👍 / 👎 to inform future reviews.

@lynnagara lynnagara merged commit db1cd1b into main May 18, 2026
20 checks passed
@lynnagara lynnagara deleted the allow-startup-without-control-plane branch May 18, 2026 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants