fix(uninstall): kill the Ollama auth proxy on uninstall#3167
Conversation
The Local Ollama onboard path spawns a detached Node child
(`scripts/ollama-auth-proxy.js`) listening on `:11435`, persists its PID
to `~/.nemoclaw/ollama-auth-proxy.pid`, and relies on `killStaleProxy()`
inside onboard to clean up before each fresh start. But uninstall never
touched the proxy: the PID file was wiped along with `~/.nemoclaw`, the
process kept running, and the next `curl … nemoclaw.sh | bash` reinstall
hit `Ollama auth proxy failed to start on :11435` because
`killStaleProxy()` could not always find the orphan (cmdline drift across
versions, missing `lsof`, root-vs-user listener ownership).
Add a `stop-ollama-auth-proxy` action to the "Stopping services" step.
The handler mirrors `killStaleProxy()` semantics rather than running a
blanket `pgrep -f ollama-auth-proxy.js`:
1. Read the persisted PID file at `~/.nemoclaw/ollama-auth-proxy.pid`,
verify the PID's cmdline contains `ollama-auth-proxy.js`, then kill.
2. Fall back to `lsof -ti :11435` for orphans whose PID file is gone,
filtering each candidate through the same cmdline check.
Both branches gate every kill on the cmdline match, so an unrelated
process bound to `:11435` (custom proxy, foreign service, another user's
session under `sudo uninstall`) is never touched. Running this in step 1
ensures the proxy is gone before the State step removes the PID file.
Co-located plan-shape and handler tests cover: PID-file path,
lsof-orphan path, foreign-process-on-port path, and no-proxy-running.
Closes #2759
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…all-kill-ollama-proxy
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/uninstall/run-plan.ts`:
- Around line 282-285: The current branch in run-plan.ts calls
runtime.commandExists("lsof") and, if false, logs "No Ollama auth proxy
processes found" which is incorrect when lsof is simply unavailable; update the
block around runtime.commandExists("lsof") to avoid emitting a success-looking
message when lsof is missing: instead log a warning that orphaned-PID detection
via lsof was skipped (include context like "lsof not available; skipped proxy
process scan") and as an improvement add a fallback probe using "ss" or "fuser"
(check runtime.commandExists("ss")/runtime.commandExists("fuser") and run the
secondary check to populate stopped) so you only report "No Ollama auth proxy
processes found" when all detection methods ran and found none.
- Around line 242-248: The function tryStopOllamaProxyPid incorrectly logs
success immediately after runtime.kill returns true; change it to send a
graceful signal first (runtime.kill(pid) / SIGTERM), then poll/verify that the
process no longer exists (e.g., check existence via a runtime.processExists or
by attempting a zero signal) with short retries and a timeout, and only if the
process is still alive after the retries send SIGKILL (runtime.kill(pid,
"SIGKILL")), re-check until confirmed dead, then call runtime.log(`Stopped
Ollama auth proxy ${pid}`) only when the PID is proven gone; if it cannot be
confirmed dead after SIGKILL, call runtime.warn(`Failed to stop Ollama auth
proxy ${pid}`) instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4cc3d11c-045a-4fa6-9383-dc33aedbf250
📒 Files selected for processing (2)
src/lib/actions/uninstall/run-plan.test.tssrc/lib/actions/uninstall/run-plan.ts
`runtime.kill()` only confirms the signal was sent; the proxy may ignore SIGTERM, take time to clean up, or linger as a zombie. The previous code logged "Stopped" the moment kill() returned true, so uninstall could claim success while :11435 stayed bound — recreating the very failure mode this cleanup was added to fix. Send SIGTERM, poll the PID with signal 0 for up to 1s, escalate to SIGKILL and re-poll if needed, and only log success when the PID is proven gone. Otherwise warn the operator. Also rewrite the lsof-missing branch to warn so a missing tool is not mistaken for "no proxy running". Per #3167 review feedback. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/actions/uninstall/run-plan.ts (1)
29-42:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winAdd
warntoUninstallRunDepsinterface and wire it inbuildRuntime.
warnis missing from theUninstallRunDepstype contract but is expected by callers and tests. This causes type mismatches when passingwarntorunUninstallPlan. The internalUninstallRuntimeinterface correctly declareswarn, and theremoveFileWithOptionalSudofunction uses it, butbuildRuntimeat line 169 only falls back todeps.errorinstead of checkingdeps.warnfirst.Suggested fix
export interface UninstallRunDeps { commandExists?: (command: string) => boolean; env?: NodeJS.ProcessEnv; error?: (message: string) => void; existsSync?: (target: string) => boolean; fs?: FileSystemDeps; isTty?: boolean; kill?: (pid: number, signal?: NodeJS.Signals | number) => boolean; log?: (message: string) => void; + warn?: (message: string) => void; readLine?: () => string | null; rmSync?: typeof fs.rmSync; run?: (command: string, args: string[], options?: SpawnSyncOptions) => RunResult; runDocker?: (args: string[], options?: SpawnSyncOptions) => RunResult; } @@ - warn: deps.error ?? ((message) => console.warn(message)), + warn: deps.warn ?? deps.error ?? ((message) => console.warn(message)), }; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/actions/uninstall/run-plan.ts` around lines 29 - 42, UninstallRunDeps is missing a warn?: (message: string) => void property while callers and tests expect it; update the UninstallRunDeps interface to include warn, and in buildRuntime (the function that constructs UninstallRuntime) wire deps.warn into the runtime (fall back to deps.error only if deps.warn is undefined) so removeFileWithOptionalSudo and runUninstallPlan receive the correct warn handler; ensure the runtime property name matches UninstallRuntime.warn and any usages call runtime.warn.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/uninstall/run-plan.test.ts`:
- Around line 257-260: The inline kill stub pushes a value that can be
number|NodeJS.Signals into a NodeJS.Signals[]; change the guard to narrow to
string signals (e.g., check typeof signal === 'string' or signal is in known
signal set) before pushing into the signals array in the kill closure so only
NodeJS.Signals are added; and for the logging callback mismatch, stop using the
non-existent warn property on UninstallRunDeps—either use the existing error
property (replace calls to deps.warn with deps.error) or add warn to the
UninstallRunDeps interface if a separate warn callback was intended.
---
Outside diff comments:
In `@src/lib/actions/uninstall/run-plan.ts`:
- Around line 29-42: UninstallRunDeps is missing a warn?: (message: string) =>
void property while callers and tests expect it; update the UninstallRunDeps
interface to include warn, and in buildRuntime (the function that constructs
UninstallRuntime) wire deps.warn into the runtime (fall back to deps.error only
if deps.warn is undefined) so removeFileWithOptionalSudo and runUninstallPlan
receive the correct warn handler; ensure the runtime property name matches
UninstallRuntime.warn and any usages call runtime.warn.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: aa9015a6-46a6-4453-8f46-4716211475da
📒 Files selected for processing (2)
src/lib/actions/uninstall/run-plan.test.tssrc/lib/actions/uninstall/run-plan.ts
Two TS2345/TS2353 errors landed in CI on the new tests: - `signals.push(signal)` typed as `NodeJS.Signals[]` rejected the `number | NodeJS.Signals` parameter; widen to `Array<NodeJS.Signals | number>` and switch the guard to `signal !== undefined` so the signal-0 probe is filtered cleanly. - `warn:` is not a key on `UninstallRunDeps`; the runtime maps `deps.error` to `runtime.warn` (run-plan.ts:168). Drop the duplicate `warn:` overrides and rely on `error:` alone. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
CodeRabbit flagged the broader `signal !== undefined` guard: pushing a `number | NodeJS.Signals` into `Array<NodeJS.Signals | number>` works but conflates the signal-0 existence probe with real terminating signals. Tighten the array to `NodeJS.Signals[]` and use `typeof signal === "string"` so only SIGTERM/SIGKILL land in the recorded list. Per #3167 review feedback. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/uninstall/run-plan.ts`:
- Around line 255-263: waitForPidExit currently treats runtime.kill(pid, 0) ===
false as "gone" but buildRuntime().kill collapses errors to false so EPERM
(process exists but not permitted) is indistinguishable from ESRCH; update
waitForPidExit to call a runtime method that returns detailed result (or
throw/catch the underlying error) and only consider the PID exited when the
probe returns ESRCH/NOT_FOUND; treat EPERM as still alive and keep waiting (or
return failure after timeout). Also tighten the fallback that uses lsof
(referenced in the same flow around lines mentioned) so it filters candidates to
the current effective UID/owner before attempting to stop or reporting
"Stopped". Ensure references: waitForPidExit, runtime.kill (or replace with
runtime.probePid / runtime.killWithErr), and the lsof-based fallback candidate
selection are updated accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5303098d-3aa3-46ef-8b06-d8bf516a5054
📒 Files selected for processing (2)
src/lib/actions/uninstall/run-plan.test.tssrc/lib/actions/uninstall/run-plan.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/actions/uninstall/run-plan.test.ts
…configured port Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
## Summary - Bump the docs release metadata to `0.0.38`. - Document release-prep updates for status policy versions, Local Ollama validation and cleanup, blueprint policy additions, rebuild backup handling, and NemoHermes uninstall branding. - Refresh generated `nemoclaw-user-*` skills from the updated docs. ## Source summary - #3185 -> `docs/reference/commands.md`: Documents that `nemoclaw <name> status` displays the gateway active policy version when OpenShell reports one. - #3167 -> `docs/reference/commands.md`, `docs/inference/use-local-inference.md`: Documents uninstall cleanup for matching Local Ollama auth proxy processes. - #2737 -> `docs/inference/use-local-inference.md`, `docs/network-policy/customize-network-policy.md`, `docs/manage-sandboxes/lifecycle.md`, `docs/reference/commands.md`: Documents stricter Local Ollama tool-call validation, blueprint policy additions, and partial rebuild backup handling. - #3220 -> `docs/reference/commands.md`: Documents NemoHermes-specific uninstall progress and completion text. - #3158 -> `.agents/skills/nemoclaw-user-configure-inference/*`: Refreshes generated user skills from existing `docs/inference/switch-inference-providers.md` heartbeat documentation. - #3199 -> `.agents/skills/nemoclaw-user-get-started/SKILL.md`: Refreshes generated user skills from existing `docs/get-started/quickstart.md` Model Router wording. ## Skipped - #3272 and #3268 were already documented by their merged docs updates on `main`. - #3154, #3216, #3166, and #3195 have no additional user-facing docs impact for this release-prep pass. - No commits matched `docs/.docs-skip`. ## Test plan - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - `make docs` - `npm run build:cli` - Commit and pre-push hooks: markdownlint, docs-to-skills verification, gitleaks, commitlint, skills YAML tests, CLI typecheck Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Behavior Changes** * Rebuild now safely handles partial backups, preserving successfully captured entries while reporting only unarchived paths * Uninstall for Local Ollama setups now stops proxy processes before cleanup * Local Ollama models require stricter tool-call response validation during onboarding * Blueprint policy additions enable custom network policy extensions via `components.policy.additions` * New `NEMOCLAW_AGENT_HEARTBEAT_EVERY` configuration controls agent periodic task frequency * Status display now shows active policy version when available <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
nemoclaw uninstalldid not stop the detached Ollama auth proxy started by the Local Ollama onboard path. The PID file at~/.nemoclaw/ollama-auth-proxy.pidwas wiped along with the rest of~/.nemoclaw, but the proxy kept listening on:11435. The nextcurl … nemoclaw.sh | bashreinstall then hitOllama auth proxy failed to start on :11435because the existing in-onboardkillStaleProxy()could not always find the orphan (cmdline drift across versions, missinglsof, root-vs-user listener ownership). Add a targetedstop-ollama-auth-proxyaction to the uninstall "Stopping services" step that mirrorskillStaleProxy()semantics.Related Issue
Closes #2759
Changes
stop-ollama-auth-proxyin the "Stopping services" step (src/lib/domain/uninstall/plan.ts). Order matters: it runs before the "State and binaries" step deletes the PID file.stopOllamaAuthProxy(paths, runtime)in src/lib/actions/uninstall-run-plan.ts:~/.nemoclaw/ollama-auth-proxy.pid(mirrorsPROXY_PID_PATHinsrc/lib/onboard-ollama-proxy.ts).ollama-auth-proxy.jsviaps -p <pid> -o args=. If it does, send SIGTERM (escalating to SIGKILL).lsof -ti :11435for orphans whose PID file is gone, filtering each candidate through the same cmdline check.:11435(custom proxy, foreign service, another user's session undersudo uninstall) is never touched.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests