fix(agent): recover from panics in per-file review and comment-pool goroutines (#171) by eldar702 · Pull Request #182 · alibaba/open-code-review

eldar702 · 2026-06-21T17:35:21Z

Problem (CWE-248: Uncaught Exception)

ocr reviews each file in its own goroutine, and the comment worker pool runs
each submitted unit of work in its own goroutine too. Both sites defer only
wg.Done() + semaphore release — neither has a recover():

CommentWorkerPool.Submit (internal/agent/agent.go)
the per-file review goroutine in dispatchSubtasks (internal/agent/agent.go)

In Go, an unrecovered panic in any goroutine terminates the entire process.
So a panic while reviewing a single file (a malformed diff, a nil deref deep in
a tool/LLM path, etc.) aborts the whole ocr run and discards every other
file's results. This directly defeats the per-file fault isolation the code
already implements for error returns: an error from one file is counted,
recorded as a subtask_error warning, and lets the other files finish — but a
panic from the same code path crashes everything. The isolation is only
half-built.

Fix

Add a recover() defer to both goroutine sites. The crux is defer ordering:
Go runs deferred calls LIFO, so the recover defer is registered last in each
goroutine and therefore runs first on unwind — it recovers before the
semaphore-release and wg.Done() defers run, and all of them still execute.
Result: the semaphore is always drained and the WaitGroup always decremented,
so there is no goroutine/semaphore leak and Await() / wg.Wait() never
deadlocks. The recover defer itself does not re-panic.

CommentWorkerPool.Submit — on recover, log the panic to stdout. This
matches the depth of the existing error handling on that path (a stderr log);
the pool has no per-task counter to update.
dispatchSubtasks — on recover, convert the panic into the existing
counted warning shape: atomic.AddInt64(&a.subtaskFailed, 1) +
a.recordWarning("subtask_error", d.NewPath, fmt.Sprintf("panic: %v", r)) +
a stderr line. It reuses the existing subtask_error warning type (no new
type invented), so panics and errors surface identically downstream and the
end-of-run all-failed rollup (failed > 0 && failed == dispatched) stays
correct. The recover defer is registered before the timeout-cancel() defer,
so cancel() still runs first on unwind (the file context is cancelled before
the panic is contained) — no ordering hazard.

No new imports (fmt, sync/atomic, and the stdout package were already
imported). The happy path is unchanged; no signatures change.

Tests — genuine RED → GREEN under `-race`

TestCommentWorkerPool_PanicIsIsolated (new internal/agent/worker_pool_test.go)
submits a panicking closure plus a healthy closure and asserts that Await()
does not crash and returns the healthy result (len == 1).

$ go test -race -run TestCommentWorkerPool_PanicIsIsolated ./internal/agent/...

RED (before the Site A fix) — the panic in one closure crashes the whole
test process:

panic: boom while reviewing EVIL.go

goroutine 6 [running]:
github.com/open-code-review/open-code-review/internal/agent.TestCommentWorkerPool_PanicIsIsolated.func1()
	/tmp/oss-r4/open-code-review/internal/agent/worker_pool_test.go:20 +0x34
github.com/open-code-review/open-code-review/internal/agent.(*CommentWorkerPool).Submit.func1()
	/tmp/oss-r4/open-code-review/internal/agent/agent.go:162 +0xd4
created by github.com/open-code-review/open-code-review/internal/agent.(*CommentWorkerPool).Submit in goroutine 5
	/tmp/oss-r4/open-code-review/internal/agent/agent.go:157 +0xd4
FAIL	github.com/open-code-review/open-code-review/internal/agent	0.376s
FAIL

GREEN (after the fix) — the panic is contained, the healthy closure's result
still arrives, race detector clean:

ok  	github.com/open-code-review/open-code-review/internal/agent	1.526s

Also verified: go build ./... (clean), go vet ./internal/agent/... (clean),
gofmt -l internal/agent/agent.go internal/agent/worker_pool_test.go (empty),
and the full go test -race ./internal/agent/... suite passes.

Test-coverage honesty

Submit takes a caller-supplied closure, so Site A is directly unit-tested
with no new production seam. dispatchSubtasks calls the private
executeSubtask, which needs a fully-wired Agent + LLM client + parsed diffs;
forcing a panic there in a unit test would require adding a net-new production
seam purely for the test. Rather than reshape production code for a test hook,
Site B is covered by the identical recover() idiom + code review (it
converts the panic into the same subtask_error warning path that is already
exercised for error returns). This asymmetry is intentional and called out so
reviewers know Site B is pattern-covered, not unit-covered.

Fixes #171

AI-assisted disclosure: this change was prepared with AI assistance (analysis,
patch, and test authored with an AI coding assistant) and reviewed by the
submitter. The RED→GREEN -race output above is from a real local run on the
checked-out code.

Updates since opening (review follow-ups)

6d00066 — capture debug.Stack() in both recover logs (OpenCodeReview bot suggestion). This adds one import, runtime/debug.
b5e2ca3 — Site B panic-recover now calls telemetry.ErrorEvent(ctx, "subtask.panic", …) for telemetry parity with the error path (uses the parent ctx, not the already-cancelled fileCtx); and CommentWorkerPool.Await now documents that a panicking unit of work is recovered+logged but not surfaced in the result.

Correction: the "No new imports" note above is superseded — runtime/debug is now imported (for debug.Stack()). fmt, sync/atomic, stdout, and telemetry were already imported.

…oroutines (alibaba#171) Fixes alibaba#171

github-actions

🔍 OpenCodeReview found 1 issue(s) in this PR.

✅ 1 posted as inline comment(s)
📝 0 posted as summary

github-actions · 2026-06-21T17:37:00Z

+		defer func() {
+			if r := recover(); r != nil {
+				fmt.Fprintf(stdout.Writer(), "[ocr] CommentWorkerPool panic: %v\n", r)
+			}
+		}()


Both panic recovery handlers log only %v of the recovered value, which typically yields just the panic message (e.g., "nil pointer dereference") without any stack trace. In concurrent goroutines where panics are being swallowed, this makes production debugging extremely difficult. Consider capturing the stack trace via debug.Stack() from runtime/debug.

Same suggestion applies to the subtask panic handler below.

Suggestion:

Suggested change

defer func() {

if r := recover(); r != nil {

fmt.Fprintf(stdout.Writer(), "[ocr] CommentWorkerPool panic: %v\n", r)

}

}()

defer func() {

if r := recover(); r != nil {

fmt.Fprintf(stdout.Writer(), "[ocr] CommentWorkerPool panic: %v\n%s\n", r, debug.Stack())

}

}()

Addresses OpenCodeReview finding on alibaba#182: both panic-recovery handlers logged only %v of the recovered value (the panic message, no stack), making swallowed panics in concurrent goroutines hard to debug in production. Capture the stack via debug.Stack() (runtime/debug) in both the CommentWorkerPool and per-file subtask recover handlers. go build/vet clean; internal/agent tests pass. [AI-assisted]

eldar702 · 2026-06-21T18:14:17Z

Good catch — addressed in 6d00066: both recover handlers now log debug.Stack() (via runtime/debug) alongside the panic value, so a swallowed goroutine panic carries a full stack for production debugging. go build/go vet are clean and the internal/agent tests pass.

lizhengfeng101

PR #182 Code Review

Changes: 2 files, +52/-0

Overview

This PR adds recover() guards to two goroutine entry points in agent.go:

Site A: CommentWorkerPool.Submit — comment worker pool goroutines
Site B: dispatchSubtasks — per-file review goroutines

The goal is to prevent a panic in a single file review from crashing the entire process (CWE-248), converting panics into warnings consistent with the existing error handling path.

Positives

Correct defer ordering: Both recover() defers are registered last (run first in LIFO), ensuring panic recovery happens before semaphore release and wg.Done() — no deadlocks or goroutine leaks.
Consistent with existing error path: Site B reuses the subtask_error warning type and subtaskFailed counter, so the all-failed rollup logic remains correct.
recordWarning is thread-safe: Confirmed it uses warningsMu mutex protection (agent.go:306-312).
Good test quality: The test case clearly validates panic isolation (panicking closure doesn't affect healthy closure). RED→GREEN evidence in PR description is thorough.
Excellent PR description: Deep problem analysis, clear fix rationale, honest test-coverage disclosure.

Issues to Address

1. PR description claims "No new imports" — inaccurate

The PR body states "No new imports", but runtime/debug is a newly added import (not previously present in agent.go). Minor inaccuracy, but the description should be precise.

2. Site B is missing telemetry reporting

The existing error path has a telemetry.ErrorEvent call (agent.go:421-422), but the panic recovery path does not. This means panics will appear in warnings and stderr but not in telemetry/tracing.

Suggest adding telemetry in the recover block, using ctx (not fileCtx, since cancel() runs before recover in LIFO order and fileCtx may already be cancelled):

if r := recover(); r != nil {
    atomic.AddInt64(&a.subtaskFailed, 1)
    fmt.Fprintf(stdout.Writer(), "[ocr] Subtask panic for %s: %v\n%s\n", d.NewPath, r, debug.Stack())
    a.recordWarning("subtask_error", d.NewPath, fmt.Sprintf("panic: %v", r))
    telemetry.ErrorEvent(ctx, "subtask.panic", fmt.Errorf("panic: %v", r),
        telemetry.AnyToAttr("file.path", d.NewPath))
}

3. Site A silently swallows panics

CommentWorkerPool.Submit's recover only prints a log. The caller of Await() cannot distinguish between "0 comments because nothing to report" and "0 comments because processing panicked". Consider whether a similar warning/error counting mechanism is needed, or at minimum document this behavior in Await's docstring.

4. What if operations inside recover themselves panic?

If a.recordWarning or debug.Stack() itself panics (extremely rare but theoretically possible), the second panic would not be caught by recover and would still crash the process. A conservative nested recover could guard against this, but this is a very low-priority edge case.

Summary

This is a high-quality defensive fix addressing a real process crash risk. Defer ordering, concurrency safety, and consistency with the existing error path are all handled well. The main suggestion is to add telemetry reporting in Site B so panics and errors have consistent observability. The remaining issues are lower priority and can be addressed as follow-ups.

Recommendation: merge after adding telemetry.

… comment-pool panic swallow (alibaba#171)

eldar702 · 2026-06-22T12:01:26Z

Thanks for the careful review — addressed in b5e2ca3:

"No new imports" was inaccurate — correct, runtime/debug was added for debug.Stack(). I've fixed that line in the PR description.
Site B telemetry parity — added, mirroring the error path:
```
telemetry.ErrorEvent(ctx, "subtask.panic", fmt.Errorf("panic: %v", r),
    telemetry.AnyToAttr("file.path", d.NewPath))
```
Using the parent ctx exactly as you noted — the per-file cancel() defer runs before this recover defer (LIFO), so fileCtx is already cancelled here. Now panics and errors have consistent telemetry/tracing.
Site A silent swallow — documented on CommentWorkerPool.Await's docstring (a panicking unit of work is recovered + logged in Submit but not surfaced as an error or reflected in the count, so callers shouldn't rely on the count alone to detect failures). I kept the pool's behavior/signature unchanged per your "at minimum document" suggestion; happy to add an explicit panic counter if you'd prefer the stronger version.
Nested recover in the handler — agreed it's a very-low-priority edge. recordWarning (a mutex-guarded slice append) and debug.Stack() are panic-free in practice, so I left it out to keep the diff minimal; trivial to add a defensive nested recover if you'd like the belt-and-suspenders.

go build / go vet / gofmt clean; go test -race ./internal/agent/... green.

fix(agent): recover from panics in per-file review and comment-pool g…

d5b1a4c

…oroutines (alibaba#171) Fixes alibaba#171

github-actions Bot reviewed Jun 21, 2026

View reviewed changes

lizhengfeng101 reviewed Jun 22, 2026

View reviewed changes

fix(agent): report recovered subtask panics to telemetry and document…

b5e2ca3

… comment-pool panic swallow (alibaba#171)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): recover from panics in per-file review and comment-pool goroutines (#171)#182

fix(agent): recover from panics in per-file review and comment-pool goroutines (#171)#182
eldar702 wants to merge 3 commits into
alibaba:mainfrom
eldar702:fix/recover-panic-review-goroutines

eldar702 commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 21, 2026

Uh oh!

eldar702 commented Jun 21, 2026

Uh oh!

lizhengfeng101 left a comment

Uh oh!

eldar702 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eldar702 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem (CWE-248: Uncaught Exception)

Fix

Tests — genuine RED → GREEN under -race

Test-coverage honesty

Updates since opening (review follow-ups)

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

eldar702 commented Jun 21, 2026

Uh oh!

lizhengfeng101 left a comment

Choose a reason for hiding this comment

PR #182 Code Review

Overview

Positives

Issues to Address

1. PR description claims "No new imports" — inaccurate

2. Site B is missing telemetry reporting

3. Site A silently swallows panics

4. What if operations inside recover themselves panic?

Summary

Uh oh!

eldar702 commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eldar702 commented Jun 21, 2026 •

edited

Loading

Tests — genuine RED → GREEN under `-race`