Skip to content

fix(watchdog): force-kill stuck-busy backends instead of deadlocking the loader#10578

Open
nandanadileep wants to merge 1 commit into
mudler:masterfrom
nandanadileep:fix/watchdog-realtime-stuck-busy
Open

fix(watchdog): force-kill stuck-busy backends instead of deadlocking the loader#10578
nandanadileep wants to merge 1 commit into
mudler:masterfrom
nandanadileep:fix/watchdog-realtime-stuck-busy

Conversation

@nandanadileep

Copy link
Copy Markdown

Problem

With the watchdog enabled, the first realtime (Talk) WebRTC session works, but the second connection hangs at "Connected, waiting for session...". The logs keep printing the watchdog's busy / "active connection" line (and a VAD-already-loaded message) forever, and no further realtime session starts.

Fixes #10391

Root cause

When the watchdog's busy-killer decides a backend has been busy past the busy timeout, it shuts it down via ModelLoader.ShutdownModeldeleteProcess. That path grabs ml.mu and then waits for IsBusy() to clear BEFORE stopping the process:

func (ml *ModelLoader) deleteProcess(s string) error {
	...
	for model.GRPC(false, ml.wd).IsBusy() {   // blocks while ml.mu is held
		time.Sleep(...)
	}
	...
	process.Stop()
}

But a backend that exceeds the busy timeout is, by definition, stuck on an in-flight gRPC call, so the graceful wait never returns. ml.mu is held forever, and every subsequent ml.Load blocks — including the shared opus backend load that runs at the start of every realtime (WebRTC) session (RealtimeCallsapplication.ModelLoader().Load(...)). New realtime connections therefore hang at "Connected, waiting for session..." until the server is restarted. The repeated "vad already loaded" log line is the watchdog re-logging the busy model that nothing can touch while ml.mu is wedged.

Fix

Add a force shutdown path (ShutdownModelForce / deleteProcess(force=true)) that stops the process FIRST — dropping the stuck call's gRPC connection and unblocking it — instead of waiting on it, and skips the Free() RPC (a stuck-busy backend won't answer it; stopping the process releases its VRAM anyway).

Route the watchdog's in-place evictions through the force path, since those only run on backends that are busy:

  • checkBusy (busy-killer) → ShutdownModelForce
  • EnforceLRULimit / EnforceGroupExclusivity → force for busy targets, graceful for idle ones
  • evictLRUModel (memory reclaimer) → force for busy targets

Graceful behaviour for idle timeouts and user/initiated unloads is unchanged.

Verification

  • go build ./core/... ./pkg/... ./cmd/... ./internal/...
  • go test ./pkg/model/ ✅ (114 specs, incl. new regression test)
  • new regression test: the watchdog busy-killer uses ShutdownModelForce (not the graceful ShutdownModel)
  • gofmt clean

Files

  • pkg/model/process.godeleteProcess(s, force bool); skip the IsBusy() wait and Free() on force; stop the process first.
  • pkg/model/loader.go — add ShutdownModelForce; thread force=false through existing deleteProcess callers.
  • pkg/model/watchdog.goProcessManager gains ShutdownModelForce; evictionTarget carries busy state; checkBusy/LRU/group/memory busy evictions use the force path.
  • pkg/model/watchdog_test.go — mock PM implements ShutdownModelForce + regression spec.

…the loader

When the watchdog's busy-killer decides a backend has been busy past the
busy timeout, it shuts it down via ModelLoader.ShutdownModel -> deleteProcess,
which grabs ml.mu and then waits for IsBusy() to clear BEFORE stopping the
process. But a backend that exceeds the busy timeout is, by definition,
stuck on an in-flight gRPC call, so the graceful wait never returns, ml.mu
is held forever, and every other ml.Load blocks — including the shared
opus backend load at the start of every realtime (WebRTC) session. New
realtime connections then hang at "Connected, waiting for session..."
whenever the watchdog is enabled, while logs repeatedly print the
watchdog's busy / "active connection" line.

Fix: add a force shutdown path (ShutdownModelForce / deleteProcess(s,
force=true)) that stops the process FIRST — dropping the stuck call's
gRPC connection and unblocking it — instead of waiting on it. Route the
watchdog's busy-killer and busy LRU / group / memory evictions through
the force path; keep the graceful wait for idle and user-initulated
unloads. Graceful/unforced kills are unchanged.

Regression test: the watchdog busy-killer uses ShutdownModelForce.

Fixes mudler#10391
@mudler mudler requested a review from richiejp June 28, 2026 17:09
@richiejp

Copy link
Copy Markdown
Collaborator

I wonder why this would effect realtime specifically?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

When watchdog enabled, realtime (talk in the web app) works once, and then fails the second connection waiting for a session

2 participants