locator: Don't crash when on startup when control plane is unreachable#128
Conversation
In sentry devservices, the proxy and ingest router must be available even when the Sentry application (control plane) isn't up and running yet AND no backup file exists from a previous run. Now the locator can start up even with the initial load fails: this is retried periodically in line with the incremental load loop. The devservices healthcheck now hits the /health endpoint instead of the /ready endpoint. /ready is unchanged and still reports 503 until routing data is loaded, so it can still be used to gate the kubernetes service from routing traffic before control plane data is loaded.
strongs
left a comment
There was a problem hiding this comment.
One nitpick comment but otherwise LGTM
| match self.load_incremental().await { | ||
| // Flip ready on first success — handles the case where | ||
| // initial snapshot failed but a later incremental succeeded. | ||
| Ok(()) => self.ready.store(true, Ordering::Relaxed), |
There was a problem hiding this comment.
nitpick: we're setting to true on every successful incremental load, just means store() gets called every loop - we could gate w/ if !self.ready.laod(Ordering::Relaxed) to only write once
There was a problem hiding this comment.
actually i rewrote this splitting out the incremental and snapshot load cases more clearly..
this solves the problem that the sentry bot called out in #128 (comment)
it also means the self.ready.store() is now only inside the snapshot load path and not the incremental one
| // initial snapshot failed but a later incremental succeeded. | ||
| Ok(()) => { | ||
| if !self.ready.load(Ordering::Relaxed) { | ||
| self.ready.store(true, Ordering::Relaxed); | ||
| } | ||
| } | ||
| Err(err) => tracing::warn!("Incremental load failed: {err:?}; will retry"), | ||
| } | ||
| } | ||
| Some(cmd) = rx.recv() => { | ||
| match cmd { |
There was a problem hiding this comment.
Bug: Data recovered via load_incremental() after a failed snapshot is not persisted to the backup file, leading to data loss on restart.
Severity: HIGH
Suggested Fix
After successfully loading data within the load_incremental function, especially when it performs a full data fetch, call self.backup_routes.store() to persist the newly acquired data to the backup file. This ensures that subsequent restarts can use the recovered data even if the control plane is unreachable.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: locator/src/locator.rs#L297-L312
Potential issue: When an initial `load_snapshot()` fails, the service can recover via a
subsequent `load_incremental()` call. However, the `load_incremental` function loads
data into memory and marks the service as ready without ever persisting this data to the
backup store. The function `load_incremental` never calls `backup_routes.store()`. If
the service restarts while the control plane is still unavailable, it will have no
backup data to fall back on, losing all previously recovered information and starting
from an empty state. This makes any data recovered through the incremental path
ephemeral.
Did we get this right? 👍 / 👎 to inform future reviews.
In sentry devservices, the proxy and ingest router must be available even when the Sentry application (control plane) isn't up and running yet AND no backup file exists from a previous run.
Now the locator can start up even with the initial load fails: this is retried periodically in line with the incremental load loop.
The devservices healthcheck now hits the /health endpoint instead of the /ready endpoint. /ready is unchanged and still reports 503 until routing data is loaded, so it can still be used to gate the kubernetes service from routing traffic before control plane data is loaded.