Skip to content

fix(agent): recover from panics in per-file review and comment-pool goroutines (#171)#182

Open
eldar702 wants to merge 3 commits into
alibaba:mainfrom
eldar702:fix/recover-panic-review-goroutines
Open

fix(agent): recover from panics in per-file review and comment-pool goroutines (#171)#182
eldar702 wants to merge 3 commits into
alibaba:mainfrom
eldar702:fix/recover-panic-review-goroutines

Conversation

@eldar702

@eldar702 eldar702 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Problem (CWE-248: Uncaught Exception)

ocr reviews each file in its own goroutine, and the comment worker pool runs
each submitted unit of work in its own goroutine too. Both sites defer only
wg.Done() + semaphore release — neither has a recover():

  • CommentWorkerPool.Submit (internal/agent/agent.go)
  • the per-file review goroutine in dispatchSubtasks (internal/agent/agent.go)

In Go, an unrecovered panic in any goroutine terminates the entire process.
So a panic while reviewing a single file (a malformed diff, a nil deref deep in
a tool/LLM path, etc.) aborts the whole ocr run and discards every other
file's results. This directly defeats the per-file fault isolation the code
already implements for error returns: an error from one file is counted,
recorded as a subtask_error warning, and lets the other files finish — but a
panic from the same code path crashes everything. The isolation is only
half-built.

Fix

Add a recover() defer to both goroutine sites. The crux is defer ordering:
Go runs deferred calls LIFO, so the recover defer is registered last in each
goroutine and therefore runs first on unwind — it recovers before the
semaphore-release and wg.Done() defers run, and all of them still execute.
Result: the semaphore is always drained and the WaitGroup always decremented,
so there is no goroutine/semaphore leak and Await() / wg.Wait() never
deadlocks. The recover defer itself does not re-panic.

  • CommentWorkerPool.Submit — on recover, log the panic to stdout. This
    matches the depth of the existing error handling on that path (a stderr log);
    the pool has no per-task counter to update.
  • dispatchSubtasks — on recover, convert the panic into the existing
    counted warning shape
    : atomic.AddInt64(&a.subtaskFailed, 1) +
    a.recordWarning("subtask_error", d.NewPath, fmt.Sprintf("panic: %v", r)) +
    a stderr line. It reuses the existing subtask_error warning type (no new
    type invented), so panics and errors surface identically downstream and the
    end-of-run all-failed rollup (failed > 0 && failed == dispatched) stays
    correct. The recover defer is registered before the timeout-cancel() defer,
    so cancel() still runs first on unwind (the file context is cancelled before
    the panic is contained) — no ordering hazard.

No new imports (fmt, sync/atomic, and the stdout package were already
imported). The happy path is unchanged; no signatures change.

Tests — genuine RED → GREEN under -race

TestCommentWorkerPool_PanicIsIsolated (new internal/agent/worker_pool_test.go)
submits a panicking closure plus a healthy closure and asserts that Await()
does not crash and returns the healthy result (len == 1).

$ go test -race -run TestCommentWorkerPool_PanicIsIsolated ./internal/agent/...

RED (before the Site A fix) — the panic in one closure crashes the whole
test process:

panic: boom while reviewing EVIL.go

goroutine 6 [running]:
github.com/open-code-review/open-code-review/internal/agent.TestCommentWorkerPool_PanicIsIsolated.func1()
	/tmp/oss-r4/open-code-review/internal/agent/worker_pool_test.go:20 +0x34
github.com/open-code-review/open-code-review/internal/agent.(*CommentWorkerPool).Submit.func1()
	/tmp/oss-r4/open-code-review/internal/agent/agent.go:162 +0xd4
created by github.com/open-code-review/open-code-review/internal/agent.(*CommentWorkerPool).Submit in goroutine 5
	/tmp/oss-r4/open-code-review/internal/agent/agent.go:157 +0xd4
FAIL	github.com/open-code-review/open-code-review/internal/agent	0.376s
FAIL

GREEN (after the fix) — the panic is contained, the healthy closure's result
still arrives, race detector clean:

ok  	github.com/open-code-review/open-code-review/internal/agent	1.526s

Also verified: go build ./... (clean), go vet ./internal/agent/... (clean),
gofmt -l internal/agent/agent.go internal/agent/worker_pool_test.go (empty),
and the full go test -race ./internal/agent/... suite passes.

Test-coverage honesty

Submit takes a caller-supplied closure, so Site A is directly unit-tested
with no new production seam. dispatchSubtasks calls the private
executeSubtask, which needs a fully-wired Agent + LLM client + parsed diffs;
forcing a panic there in a unit test would require adding a net-new production
seam purely for the test. Rather than reshape production code for a test hook,
Site B is covered by the identical recover() idiom + code review (it
converts the panic into the same subtask_error warning path that is already
exercised for error returns). This asymmetry is intentional and called out so
reviewers know Site B is pattern-covered, not unit-covered.

Fixes #171


AI-assisted disclosure: this change was prepared with AI assistance (analysis,
patch, and test authored with an AI coding assistant) and reviewed by the
submitter. The RED→GREEN -race output above is from a real local run on the
checked-out code.


Updates since opening (review follow-ups)

  • 6d00066 — capture debug.Stack() in both recover logs (OpenCodeReview bot suggestion). This adds one import, runtime/debug.
  • b5e2ca3 — Site B panic-recover now calls telemetry.ErrorEvent(ctx, "subtask.panic", …) for telemetry parity with the error path (uses the parent ctx, not the already-cancelled fileCtx); and CommentWorkerPool.Await now documents that a panicking unit of work is recovered+logged but not surfaced in the result.

Correction: the "No new imports" note above is superseded — runtime/debug is now imported (for debug.Stack()). fmt, sync/atomic, stdout, and telemetry were already imported.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 OpenCodeReview found 1 issue(s) in this PR.

  • ✅ 1 posted as inline comment(s)
  • 📝 0 posted as summary

Comment thread internal/agent/agent.go
Comment on lines +165 to +169
defer func() {
if r := recover(); r != nil {
fmt.Fprintf(stdout.Writer(), "[ocr] CommentWorkerPool panic: %v\n", r)
}
}()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both panic recovery handlers log only %v of the recovered value, which typically yields just the panic message (e.g., "nil pointer dereference") without any stack trace. In concurrent goroutines where panics are being swallowed, this makes production debugging extremely difficult. Consider capturing the stack trace via debug.Stack() from runtime/debug.

Same suggestion applies to the subtask panic handler below.

Suggestion:

Suggested change
defer func() {
if r := recover(); r != nil {
fmt.Fprintf(stdout.Writer(), "[ocr] CommentWorkerPool panic: %v\n", r)
}
}()
defer func() {
if r := recover(); r != nil {
fmt.Fprintf(stdout.Writer(), "[ocr] CommentWorkerPool panic: %v\n%s\n", r, debug.Stack())
}
}()

Addresses OpenCodeReview finding on alibaba#182: both panic-recovery handlers logged
only %v of the recovered value (the panic message, no stack), making swallowed
panics in concurrent goroutines hard to debug in production. Capture the stack
via debug.Stack() (runtime/debug) in both the CommentWorkerPool and per-file
subtask recover handlers.

go build/vet clean; internal/agent tests pass.

[AI-assisted]
@eldar702

Copy link
Copy Markdown
Contributor Author

Good catch — addressed in 6d00066: both recover handlers now log debug.Stack() (via runtime/debug) alongside the panic value, so a swallowed goroutine panic carries a full stack for production debugging. go build/go vet are clean and the internal/agent tests pass.

@lizhengfeng101 lizhengfeng101 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #182 Code Review

Changes: 2 files, +52/-0

Overview

This PR adds recover() guards to two goroutine entry points in agent.go:

  1. Site A: CommentWorkerPool.Submit — comment worker pool goroutines
  2. Site B: dispatchSubtasks — per-file review goroutines

The goal is to prevent a panic in a single file review from crashing the entire process (CWE-248), converting panics into warnings consistent with the existing error handling path.

Positives

  • Correct defer ordering: Both recover() defers are registered last (run first in LIFO), ensuring panic recovery happens before semaphore release and wg.Done() — no deadlocks or goroutine leaks.
  • Consistent with existing error path: Site B reuses the subtask_error warning type and subtaskFailed counter, so the all-failed rollup logic remains correct.
  • recordWarning is thread-safe: Confirmed it uses warningsMu mutex protection (agent.go:306-312).
  • Good test quality: The test case clearly validates panic isolation (panicking closure doesn't affect healthy closure). RED→GREEN evidence in PR description is thorough.
  • Excellent PR description: Deep problem analysis, clear fix rationale, honest test-coverage disclosure.

Issues to Address

1. PR description claims "No new imports" — inaccurate

The PR body states "No new imports", but runtime/debug is a newly added import (not previously present in agent.go). Minor inaccuracy, but the description should be precise.

2. Site B is missing telemetry reporting

The existing error path has a telemetry.ErrorEvent call (agent.go:421-422), but the panic recovery path does not. This means panics will appear in warnings and stderr but not in telemetry/tracing.

Suggest adding telemetry in the recover block, using ctx (not fileCtx, since cancel() runs before recover in LIFO order and fileCtx may already be cancelled):

if r := recover(); r != nil {
    atomic.AddInt64(&a.subtaskFailed, 1)
    fmt.Fprintf(stdout.Writer(), "[ocr] Subtask panic for %s: %v\n%s\n", d.NewPath, r, debug.Stack())
    a.recordWarning("subtask_error", d.NewPath, fmt.Sprintf("panic: %v", r))
    telemetry.ErrorEvent(ctx, "subtask.panic", fmt.Errorf("panic: %v", r),
        telemetry.AnyToAttr("file.path", d.NewPath))
}

3. Site A silently swallows panics

CommentWorkerPool.Submit's recover only prints a log. The caller of Await() cannot distinguish between "0 comments because nothing to report" and "0 comments because processing panicked". Consider whether a similar warning/error counting mechanism is needed, or at minimum document this behavior in Await's docstring.

4. What if operations inside recover themselves panic?

If a.recordWarning or debug.Stack() itself panics (extremely rare but theoretically possible), the second panic would not be caught by recover and would still crash the process. A conservative nested recover could guard against this, but this is a very low-priority edge case.

Summary

This is a high-quality defensive fix addressing a real process crash risk. Defer ordering, concurrency safety, and consistency with the existing error path are all handled well. The main suggestion is to add telemetry reporting in Site B so panics and errors have consistent observability. The remaining issues are lower priority and can be addressed as follow-ups.

Recommendation: merge after adding telemetry.

@eldar702

Copy link
Copy Markdown
Contributor Author

Thanks for the careful review — addressed in b5e2ca3:

  1. "No new imports" was inaccurate — correct, runtime/debug was added for debug.Stack(). I've fixed that line in the PR description.
  2. Site B telemetry parity — added, mirroring the error path:
    telemetry.ErrorEvent(ctx, "subtask.panic", fmt.Errorf("panic: %v", r),
        telemetry.AnyToAttr("file.path", d.NewPath))
    Using the parent ctx exactly as you noted — the per-file cancel() defer runs before this recover defer (LIFO), so fileCtx is already cancelled here. Now panics and errors have consistent telemetry/tracing.
  3. Site A silent swallow — documented on CommentWorkerPool.Await's docstring (a panicking unit of work is recovered + logged in Submit but not surfaced as an error or reflected in the count, so callers shouldn't rely on the count alone to detect failures). I kept the pool's behavior/signature unchanged per your "at minimum document" suggestion; happy to add an explicit panic counter if you'd prefer the stronger version.
  4. Nested recover in the handler — agreed it's a very-low-priority edge. recordWarning (a mutex-guarded slice append) and debug.Stack() are panic-free in practice, so I left it out to keep the diff minimal; trivial to add a defensive nested recover if you'd like the belt-and-suspenders.

go build / go vet / gofmt clean; go test -race ./internal/agent/... green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unrecovered panic in a per-file review goroutine crashes the whole run (missing recover, CWE-248)

2 participants