Skip to content

fix(uninstall): kill the Ollama auth proxy on uninstall#3167

Merged
ericksoa merged 8 commits into
mainfrom
fix/issue-2759-uninstall-kill-ollama-proxy
May 8, 2026
Merged

fix(uninstall): kill the Ollama auth proxy on uninstall#3167
ericksoa merged 8 commits into
mainfrom
fix/issue-2759-uninstall-kill-ollama-proxy

Conversation

@laitingsheng
Copy link
Copy Markdown
Contributor

@laitingsheng laitingsheng commented May 7, 2026

Summary

nemoclaw uninstall did not stop the detached Ollama auth proxy started by the Local Ollama onboard path. The PID file at ~/.nemoclaw/ollama-auth-proxy.pid was wiped along with the rest of ~/.nemoclaw, but the proxy kept listening on :11435. The next curl … nemoclaw.sh | bash reinstall then hit Ollama auth proxy failed to start on :11435 because the existing in-onboard killStaleProxy() could not always find the orphan (cmdline drift across versions, missing lsof, root-vs-user listener ownership). Add a targeted stop-ollama-auth-proxy action to the uninstall "Stopping services" step that mirrors killStaleProxy() semantics.

Related Issue

Closes #2759

Changes

  • New plan action stop-ollama-auth-proxy in the "Stopping services" step (src/lib/domain/uninstall/plan.ts). Order matters: it runs before the "State and binaries" step deletes the PID file.
  • Handler stopOllamaAuthProxy(paths, runtime) in src/lib/actions/uninstall-run-plan.ts:
    1. Read the persisted PID file at ~/.nemoclaw/ollama-auth-proxy.pid (mirrors PROXY_PID_PATH in src/lib/onboard-ollama-proxy.ts).
    2. Verify the candidate PID's cmdline contains ollama-auth-proxy.js via ps -p <pid> -o args=. If it does, send SIGTERM (escalating to SIGKILL).
    3. Fall back to lsof -ti :11435 for orphans whose PID file is gone, filtering each candidate through the same cmdline check.
    4. Both branches gate every kill on the cmdline match, so an unrelated process bound to :11435 (custom proxy, foreign service, another user's session under sudo uninstall) is never touched.
  • Co-located handler tests cover the PID-file branch, lsof-orphan branch, foreign-process-on-port branch, and the no-proxy-running case. Plan-shape test asserts the new action is in the "Stopping services" step (so order can't drift).

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Uninstall now reliably stops the Ollama auth proxy before cleaning state, with clear logging, fallback discovery when needed, and escalation if the proxy doesn’t respond—preventing orphaned processes or race conditions during uninstall.
  • Tests

    • Added comprehensive tests covering proxy shutdown, fallback discovery, validation safeguards, escalation behavior, and related success/failure logging.

The Local Ollama onboard path spawns a detached Node child
(`scripts/ollama-auth-proxy.js`) listening on `:11435`, persists its PID
to `~/.nemoclaw/ollama-auth-proxy.pid`, and relies on `killStaleProxy()`
inside onboard to clean up before each fresh start. But uninstall never
touched the proxy: the PID file was wiped along with `~/.nemoclaw`, the
process kept running, and the next `curl … nemoclaw.sh | bash` reinstall
hit `Ollama auth proxy failed to start on :11435` because
`killStaleProxy()` could not always find the orphan (cmdline drift across
versions, missing `lsof`, root-vs-user listener ownership).

Add a `stop-ollama-auth-proxy` action to the "Stopping services" step.
The handler mirrors `killStaleProxy()` semantics rather than running a
blanket `pgrep -f ollama-auth-proxy.js`:

  1. Read the persisted PID file at `~/.nemoclaw/ollama-auth-proxy.pid`,
     verify the PID's cmdline contains `ollama-auth-proxy.js`, then kill.
  2. Fall back to `lsof -ti :11435` for orphans whose PID file is gone,
     filtering each candidate through the same cmdline check.

Both branches gate every kill on the cmdline match, so an unrelated
process bound to `:11435` (custom proxy, foreign service, another user's
session under `sudo uninstall`) is never touched. Running this in step 1
ensures the proxy is gone before the State step removes the PID file.

Co-located plan-shape and handler tests cover: PID-file path,
lsof-orphan path, foreign-process-on-port path, and no-proxy-running.

Closes #2759

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e62e6b0a-d047-4911-9a64-e548950df35f

📥 Commits

Reviewing files that changed from the base of the PR and between a03cd3c and 00c9b6a.

📒 Files selected for processing (2)
  • src/lib/actions/uninstall/run-plan.test.ts
  • src/lib/actions/uninstall/run-plan.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/actions/uninstall/run-plan.ts
  • src/lib/actions/uninstall/run-plan.test.ts

📝 Walkthrough

Walkthrough

Adds stopping of the Ollama auth proxy to uninstall. The plan includes a stop-ollama-auth-proxy action; implementation stops a persisted PID (or falls back to lsof :11435), validates cmdline, uses SIGTERM→SIGKILL escalation, and integrates the call into executePlan. Tests cover success and edge cases.

Changes

Ollama Auth Proxy Shutdown in Uninstall Flow

Layer / File(s) Summary
Plan Definition
src/lib/domain/uninstall/plan.ts, src/lib/domain/uninstall/plan.test.ts
Uninstall plan now includes { kind: "stop-ollama-auth-proxy" } in "Stopping services"; tests assert presence and ordering before state/binaries cleanup.
Stop Action Implementation
src/lib/actions/uninstall/run-plan.ts
Adds stopOllamaAuthProxy with cmdline validation (ps -p <pid> -o args=), two-phase termination (SIGTERM then SIGKILL with wait via sleepMs), persisted PID-file check under NemoClaw state dir, and lsof -ti :11435 fallback that filters by cmdline and owner.
Execution Integration
src/lib/actions/uninstall/run-plan.ts
Calls stopOllamaAuthProxy(paths, runtime) from executePlan during the "Stopping services" step before other cleanup actions.
Test Coverage
src/lib/actions/uninstall/run-plan.test.ts
Adds fs import and tests for: persisted PID stop, lsof fallback stop, cmdline-mismatch safety, SIGTERM→SIGKILL escalation, lsof-unavailable warning, and no-candidate success; uses temporary HOME/PID fixtures.

Sequence Diagram

sequenceDiagram
  participant Executor
  participant stopOllamaAuthProxy
  participant PS
  participant LSOF
  Executor->>stopOllamaAuthProxy: request stop during "Stopping services"
  stopOllamaAuthProxy->>PS: ps -p <pid> -o args= / ps checks
  stopOllamaAuthProxy->>LSOF: lsof -ti :<port> (fallback)
  stopOllamaAuthProxy-->>Executor: log success / warn / no-op
Loading

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hop-hop, the proxy must go,
Before we scrub the state below.
PID or lsof, we check the line,
SIGTERM waits — SIGKILL when due — fine.
Now uninstall hops on, calm and bright.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: stopping the Ollama auth proxy during uninstall, which directly addresses the core problem in the PR objectives.
Linked Issues check ✅ Passed The PR fully implements the objectives from issue #2759: stops the Ollama auth proxy using multiple strategies (PID file, lsof fallback), validates process identity before killing, prevents unrelated process termination, and includes comprehensive test coverage.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing Ollama auth proxy shutdown: test cases for the new behavior, plan action definition, and the handler implementation with no unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue-2759-uninstall-kill-ollama-proxy

Comment @coderabbitai help to get the list of available commands and usage tips.

@laitingsheng laitingsheng added the v0.0.37 Release target label May 7, 2026
@jyaunches jyaunches self-assigned this May 7, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/actions/uninstall/run-plan.ts`:
- Around line 282-285: The current branch in run-plan.ts calls
runtime.commandExists("lsof") and, if false, logs "No Ollama auth proxy
processes found" which is incorrect when lsof is simply unavailable; update the
block around runtime.commandExists("lsof") to avoid emitting a success-looking
message when lsof is missing: instead log a warning that orphaned-PID detection
via lsof was skipped (include context like "lsof not available; skipped proxy
process scan") and as an improvement add a fallback probe using "ss" or "fuser"
(check runtime.commandExists("ss")/runtime.commandExists("fuser") and run the
secondary check to populate stopped) so you only report "No Ollama auth proxy
processes found" when all detection methods ran and found none.
- Around line 242-248: The function tryStopOllamaProxyPid incorrectly logs
success immediately after runtime.kill returns true; change it to send a
graceful signal first (runtime.kill(pid) / SIGTERM), then poll/verify that the
process no longer exists (e.g., check existence via a runtime.processExists or
by attempting a zero signal) with short retries and a timeout, and only if the
process is still alive after the retries send SIGKILL (runtime.kill(pid,
"SIGKILL")), re-check until confirmed dead, then call runtime.log(`Stopped
Ollama auth proxy ${pid}`) only when the PID is proven gone; if it cannot be
confirmed dead after SIGKILL, call runtime.warn(`Failed to stop Ollama auth
proxy ${pid}`) instead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4cc3d11c-045a-4fa6-9383-dc33aedbf250

📥 Commits

Reviewing files that changed from the base of the PR and between f4bbd71 and 9a872b1.

📒 Files selected for processing (2)
  • src/lib/actions/uninstall/run-plan.test.ts
  • src/lib/actions/uninstall/run-plan.ts

Comment thread src/lib/actions/uninstall/run-plan.ts
Comment thread src/lib/actions/uninstall/run-plan.ts
`runtime.kill()` only confirms the signal was sent; the proxy may
ignore SIGTERM, take time to clean up, or linger as a zombie. The
previous code logged "Stopped" the moment kill() returned true, so
uninstall could claim success while :11435 stayed bound — recreating
the very failure mode this cleanup was added to fix.

Send SIGTERM, poll the PID with signal 0 for up to 1s, escalate to
SIGKILL and re-poll if needed, and only log success when the PID is
proven gone. Otherwise warn the operator. Also rewrite the
lsof-missing branch to warn so a missing tool is not mistaken for
"no proxy running".

Per #3167 review feedback.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/actions/uninstall/run-plan.ts (1)

29-42: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Add warn to UninstallRunDeps interface and wire it in buildRuntime.

warn is missing from the UninstallRunDeps type contract but is expected by callers and tests. This causes type mismatches when passing warn to runUninstallPlan. The internal UninstallRuntime interface correctly declares warn, and the removeFileWithOptionalSudo function uses it, but buildRuntime at line 169 only falls back to deps.error instead of checking deps.warn first.

Suggested fix
 export interface UninstallRunDeps {
   commandExists?: (command: string) => boolean;
   env?: NodeJS.ProcessEnv;
   error?: (message: string) => void;
   existsSync?: (target: string) => boolean;
   fs?: FileSystemDeps;
   isTty?: boolean;
   kill?: (pid: number, signal?: NodeJS.Signals | number) => boolean;
   log?: (message: string) => void;
+  warn?: (message: string) => void;
   readLine?: () => string | null;
   rmSync?: typeof fs.rmSync;
   run?: (command: string, args: string[], options?: SpawnSyncOptions) => RunResult;
   runDocker?: (args: string[], options?: SpawnSyncOptions) => RunResult;
 }
@@
-    warn: deps.error ?? ((message) => console.warn(message)),
+    warn: deps.warn ?? deps.error ?? ((message) => console.warn(message)),
   };
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/actions/uninstall/run-plan.ts` around lines 29 - 42, UninstallRunDeps
is missing a warn?: (message: string) => void property while callers and tests
expect it; update the UninstallRunDeps interface to include warn, and in
buildRuntime (the function that constructs UninstallRuntime) wire deps.warn into
the runtime (fall back to deps.error only if deps.warn is undefined) so
removeFileWithOptionalSudo and runUninstallPlan receive the correct warn
handler; ensure the runtime property name matches UninstallRuntime.warn and any
usages call runtime.warn.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/actions/uninstall/run-plan.test.ts`:
- Around line 257-260: The inline kill stub pushes a value that can be
number|NodeJS.Signals into a NodeJS.Signals[]; change the guard to narrow to
string signals (e.g., check typeof signal === 'string' or signal is in known
signal set) before pushing into the signals array in the kill closure so only
NodeJS.Signals are added; and for the logging callback mismatch, stop using the
non-existent warn property on UninstallRunDeps—either use the existing error
property (replace calls to deps.warn with deps.error) or add warn to the
UninstallRunDeps interface if a separate warn callback was intended.

---

Outside diff comments:
In `@src/lib/actions/uninstall/run-plan.ts`:
- Around line 29-42: UninstallRunDeps is missing a warn?: (message: string) =>
void property while callers and tests expect it; update the UninstallRunDeps
interface to include warn, and in buildRuntime (the function that constructs
UninstallRuntime) wire deps.warn into the runtime (fall back to deps.error only
if deps.warn is undefined) so removeFileWithOptionalSudo and runUninstallPlan
receive the correct warn handler; ensure the runtime property name matches
UninstallRuntime.warn and any usages call runtime.warn.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: aa9015a6-46a6-4453-8f46-4716211475da

📥 Commits

Reviewing files that changed from the base of the PR and between 9a872b1 and 42f5ee8.

📒 Files selected for processing (2)
  • src/lib/actions/uninstall/run-plan.test.ts
  • src/lib/actions/uninstall/run-plan.ts

Comment thread src/lib/actions/uninstall/run-plan.test.ts
Two TS2345/TS2353 errors landed in CI on the new tests:

- `signals.push(signal)` typed as `NodeJS.Signals[]` rejected the
  `number | NodeJS.Signals` parameter; widen to
  `Array<NodeJS.Signals | number>` and switch the guard to
  `signal !== undefined` so the signal-0 probe is filtered cleanly.
- `warn:` is not a key on `UninstallRunDeps`; the runtime maps
  `deps.error` to `runtime.warn` (run-plan.ts:168). Drop the
  duplicate `warn:` overrides and rely on `error:` alone.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
CodeRabbit flagged the broader `signal !== undefined` guard: pushing a
`number | NodeJS.Signals` into `Array<NodeJS.Signals | number>` works
but conflates the signal-0 existence probe with real terminating
signals. Tighten the array to `NodeJS.Signals[]` and use
`typeof signal === "string"` so only SIGTERM/SIGKILL land in the
recorded list.

Per #3167 review feedback.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@cv cv added v0.0.38 Release target and removed v0.0.37 Release target labels May 8, 2026
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/actions/uninstall/run-plan.ts`:
- Around line 255-263: waitForPidExit currently treats runtime.kill(pid, 0) ===
false as "gone" but buildRuntime().kill collapses errors to false so EPERM
(process exists but not permitted) is indistinguishable from ESRCH; update
waitForPidExit to call a runtime method that returns detailed result (or
throw/catch the underlying error) and only consider the PID exited when the
probe returns ESRCH/NOT_FOUND; treat EPERM as still alive and keep waiting (or
return failure after timeout). Also tighten the fallback that uses lsof
(referenced in the same flow around lines mentioned) so it filters candidates to
the current effective UID/owner before attempting to stop or reporting
"Stopped". Ensure references: waitForPidExit, runtime.kill (or replace with
runtime.probePid / runtime.killWithErr), and the lsof-based fallback candidate
selection are updated accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5303098d-3aa3-46ef-8b06-d8bf516a5054

📥 Commits

Reviewing files that changed from the base of the PR and between a03cd3c and 6542e0f.

📒 Files selected for processing (2)
  • src/lib/actions/uninstall/run-plan.test.ts
  • src/lib/actions/uninstall/run-plan.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/actions/uninstall/run-plan.test.ts

Comment thread src/lib/actions/uninstall/run-plan.ts Outdated
…configured port

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@ericksoa ericksoa merged commit 0549d60 into main May 8, 2026
19 checks passed
@ericksoa ericksoa deleted the fix/issue-2759-uninstall-kill-ollama-proxy branch May 8, 2026 18:48
miyoungc added a commit that referenced this pull request May 9, 2026
## Summary
- Bump the docs release metadata to `0.0.38`.
- Document release-prep updates for status policy versions, Local Ollama
validation and cleanup, blueprint policy additions, rebuild backup
handling, and NemoHermes uninstall branding.
- Refresh generated `nemoclaw-user-*` skills from the updated docs.

## Source summary
- #3185 -> `docs/reference/commands.md`: Documents that `nemoclaw <name>
status` displays the gateway active policy version when OpenShell
reports one.
- #3167 -> `docs/reference/commands.md`,
`docs/inference/use-local-inference.md`: Documents uninstall cleanup for
matching Local Ollama auth proxy processes.
- #2737 -> `docs/inference/use-local-inference.md`,
`docs/network-policy/customize-network-policy.md`,
`docs/manage-sandboxes/lifecycle.md`, `docs/reference/commands.md`:
Documents stricter Local Ollama tool-call validation, blueprint policy
additions, and partial rebuild backup handling.
- #3220 -> `docs/reference/commands.md`: Documents NemoHermes-specific
uninstall progress and completion text.
- #3158 -> `.agents/skills/nemoclaw-user-configure-inference/*`:
Refreshes generated user skills from existing
`docs/inference/switch-inference-providers.md` heartbeat documentation.
- #3199 -> `.agents/skills/nemoclaw-user-get-started/SKILL.md`:
Refreshes generated user skills from existing
`docs/get-started/quickstart.md` Model Router wording.

## Skipped
- #3272 and #3268 were already documented by their merged docs updates
on `main`.
- #3154, #3216, #3166, and #3195 have no additional user-facing docs
impact for this release-prep pass.
- No commits matched `docs/.docs-skip`.

## Test plan
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user`
- `make docs`
- `npm run build:cli`
- Commit and pre-push hooks: markdownlint, docs-to-skills verification,
gitleaks, commitlint, skills YAML tests, CLI typecheck


Made with [Cursor](https://cursor.com)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Behavior Changes**
* Rebuild now safely handles partial backups, preserving successfully
captured entries while reporting only unarchived paths
* Uninstall for Local Ollama setups now stops proxy processes before
cleanup
* Local Ollama models require stricter tool-call response validation
during onboarding
* Blueprint policy additions enable custom network policy extensions via
`components.policy.additions`
* New `NEMOCLAW_AGENT_HEARTBEAT_EVERY` configuration controls agent
periodic task frequency
  * Status display now shows active policy version when available

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.38 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nemoclaw installation fails with local ollama

5 participants