fix(watchdog): force-kill stuck-busy backends instead of deadlocking the loader#10578
Open
nandanadileep wants to merge 1 commit into
Open
fix(watchdog): force-kill stuck-busy backends instead of deadlocking the loader#10578nandanadileep wants to merge 1 commit into
nandanadileep wants to merge 1 commit into
Conversation
…the loader When the watchdog's busy-killer decides a backend has been busy past the busy timeout, it shuts it down via ModelLoader.ShutdownModel -> deleteProcess, which grabs ml.mu and then waits for IsBusy() to clear BEFORE stopping the process. But a backend that exceeds the busy timeout is, by definition, stuck on an in-flight gRPC call, so the graceful wait never returns, ml.mu is held forever, and every other ml.Load blocks — including the shared opus backend load at the start of every realtime (WebRTC) session. New realtime connections then hang at "Connected, waiting for session..." whenever the watchdog is enabled, while logs repeatedly print the watchdog's busy / "active connection" line. Fix: add a force shutdown path (ShutdownModelForce / deleteProcess(s, force=true)) that stops the process FIRST — dropping the stuck call's gRPC connection and unblocking it — instead of waiting on it. Route the watchdog's busy-killer and busy LRU / group / memory evictions through the force path; keep the graceful wait for idle and user-initulated unloads. Graceful/unforced kills are unchanged. Regression test: the watchdog busy-killer uses ShutdownModelForce. Fixes mudler#10391
Collaborator
|
I wonder why this would effect realtime specifically? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
With the watchdog enabled, the first realtime (Talk) WebRTC session works, but the second connection hangs at "Connected, waiting for session...". The logs keep printing the watchdog's busy / "active connection" line (and a VAD-already-loaded message) forever, and no further realtime session starts.
Fixes #10391
Root cause
When the watchdog's busy-killer decides a backend has been busy past the busy timeout, it shuts it down via
ModelLoader.ShutdownModel→deleteProcess. That path grabsml.muand then waits forIsBusy()to clear BEFORE stopping the process:But a backend that exceeds the busy timeout is, by definition, stuck on an in-flight gRPC call, so the graceful wait never returns.
ml.muis held forever, and every subsequentml.Loadblocks — including the shared opus backend load that runs at the start of every realtime (WebRTC) session (RealtimeCalls→application.ModelLoader().Load(...)). New realtime connections therefore hang at "Connected, waiting for session..." until the server is restarted. The repeated "vad already loaded" log line is the watchdog re-logging the busy model that nothing can touch whileml.muis wedged.Fix
Add a force shutdown path (
ShutdownModelForce/deleteProcess(force=true)) that stops the process FIRST — dropping the stuck call's gRPC connection and unblocking it — instead of waiting on it, and skips theFree()RPC (a stuck-busy backend won't answer it; stopping the process releases its VRAM anyway).Route the watchdog's in-place evictions through the force path, since those only run on backends that are busy:
checkBusy(busy-killer) →ShutdownModelForceEnforceLRULimit/EnforceGroupExclusivity→ force for busy targets, graceful for idle onesevictLRUModel(memory reclaimer) → force for busy targetsGraceful behaviour for idle timeouts and user/initiated unloads is unchanged.
Verification
go build ./core/... ./pkg/... ./cmd/... ./internal/...✅go test ./pkg/model/✅ (114 specs, incl. new regression test)ShutdownModelForce(not the gracefulShutdownModel)gofmtcleanFiles
pkg/model/process.go—deleteProcess(s, force bool); skip theIsBusy()wait andFree()on force; stop the process first.pkg/model/loader.go— addShutdownModelForce; threadforce=falsethrough existingdeleteProcesscallers.pkg/model/watchdog.go—ProcessManagergainsShutdownModelForce;evictionTargetcarries busy state;checkBusy/LRU/group/memory busy evictions use the force path.pkg/model/watchdog_test.go— mock PM implementsShutdownModelForce+ regression spec.