Skip to content

feat(coordinator): limit parallelism during chunk assignment#1797

Open
Thegaram wants to merge 1 commit intodevelopfrom
coordinator-limit-chunk-parallelism
Open

feat(coordinator): limit parallelism during chunk assignment#1797
Thegaram wants to merge 1 commit intodevelopfrom
coordinator-limit-chunk-parallelism

Conversation

@Thegaram
Copy link
Contributor

@Thegaram Thegaram commented Feb 26, 2026

Purpose or design rationale of this PR

applyUniversal calls debug_executionWitness, which degrades significantly under concurrent load. When multiple chunk proof assignments arrive simultaneously, the unthrottled parallel calls cause severe slowdowns.

This PR adds a semaphore (capacity 2) that limits concurrent applyUniversal invocations. Requests that arrive while both slots are occupied block until a slot opens or the request context is cancelled, preventing goroutine accumulation.

PR title

Your PR title must follow conventional commits (as we are doing squash merge for each PR), so it must start with one of the following types:

  • feat: A new feature

Deployment tag versioning

Has tag in common/version.go been updated or have you added bump-version label to this PR?

  • No, this PR doesn't involve a new deployment, git tag, docker image tag
  • Yes

Breaking change label

Does this PR have the breaking-change label?

  • No, this PR is not a breaking change
  • Yes

Summary by CodeRabbit

  • Bug Fixes

    • Optimized execution witness generation with resource throttling to prevent system overload during concurrent operations and enhance stability.
  • Chores

    • Version bumped to 4.7.13.

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

📝 Walkthrough

Walkthrough

This PR bumps the version tag from v4.7.12 to v4.7.13 in the version file and introduces a global throttle mechanism with a maximum parallelism of 2 for debug_executionWitness processing in the prover task coordinator, using semaphores to limit concurrent universal-witness operations across two code paths.

Changes

Cohort / File(s) Summary
Version Bump
common/version/version.go
Updated Go build tag from "v4.7.12" to "v4.7.13"; affects the Version composition string at runtime.
Concurrency Throttle for Witness Processing
coordinator/internal/logic/provertask/chunk_prover_task.go
Introduced package-level throttle variables (applyUniversalMaxParallelism = 2 and witnessSemaphore) and added semaphore acquisition/release logic in two code paths (Assign path and subsequent task formatting path) when universal witness processing is enabled; includes context cancellation handling with logging and active attempt recovery.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Suggested labels

bump-version

Suggested reviewers

  • georgehao

Poem

🐰 A rabbit hops through version bumps with glee,
From 4.7.12 to lucky thirteen it shall be,
With semaphores standing guard, just two at a time,
The witness threads dance to concurrency's rhyme,
Control flows smoothly as context bells chime.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(coordinator): limit parallelism during chunk assignment' accurately describes the main change: adding parallelism limits to chunk assignment operations.
Description check ✅ Passed The description provides clear purpose, follows the template structure, answers key questions (what/why/how), confirms PR title follows conventional commits, and documents deployment versioning. All major sections are covered.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch coordinator-limit-chunk-parallelism

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
coordinator/internal/logic/provertask/chunk_prover_task.go (2)

27-32: Consider making universal parallelism configurable.

Hard-coding 2 at package scope makes tuning production behavior require code changes. A config-backed value (with sane default) would be easier to operate under varying prover capacity and API latency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/internal/logic/provertask/chunk_prover_task.go` around lines 27 -
32, The package currently hard-codes applyUniversalMaxParallelism = 2 and
constructs witnessSemaphore at package init; make this configurable by reading a
config/env/flag value at startup and initializing applyUniversalMaxParallelism
and witnessSemaphore there instead of at package scope. Specifically: add a
config key (or env var/flag) for universal max parallelism with a sane default,
validate it (ensure >=1), and create witnessSemaphore = make(chan struct{}, max)
during application initialization (e.g., in an init hook or the coordinator
startup/new constructor) so the semaphore capacity reflects the configured
value; update any code that references applyUniversalMaxParallelism to use the
configured variable.

212-215: Release the semaphore immediately after applyUniversal completes.

Line 215 defers release until Assign returns, so post-processing (DB insert/metrics/logging) still occupies a slot. Scope the acquire/release around cp.applyUniversal only to keep throughput predictable.

💡 Proposed refactor
 	if getTaskParameter.Universal {
 		var metadata []byte
-
-		select {
-		case witnessSemaphore <- struct{}{}:
-			// Released when Assign returns (defer).
-			defer func() { <-witnessSemaphore }()
-		case <-ctx.Done():
-			log.Warn("context canceled waiting for witness semaphore", "task_id", chunkTask.Hash, "err", ctx.Err())
-			cp.recoverActiveAttempts(ctx, chunkTask)
-			return nil, ctx.Err()
-		}
-
-		taskMsg, metadata, err = cp.applyUniversal(taskMsg)
+		taskMsg, metadata, err = func() (*coordinatorType.GetTaskSchema, []byte, error) {
+			select {
+			case witnessSemaphore <- struct{}{}:
+				defer func() { <-witnessSemaphore }()
+			case <-ctx.Done():
+				log.Warn("context canceled waiting for witness semaphore", "task_id", chunkTask.Hash, "err", ctx.Err())
+				if taskCtx.hasAssignedTask == nil {
+					cp.recoverActiveAttempts(ctx, chunkTask)
+				}
+				return nil, nil, ctx.Err()
+			}
+			return cp.applyUniversal(taskMsg)
+		}()
 		if err != nil {
 			cp.recoverActiveAttempts(ctx, chunkTask)
 			log.Error("Generate universal prover task failure", "task_id", chunkTask.Hash, "type", "chunk", "err", err)
 			return nil, ErrCoordinatorInternalFailure
 		}
 		proverTask.Metadata = metadata
 	}

Also applies to: 222-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/internal/logic/provertask/chunk_prover_task.go` around lines 212
- 215, The semaphore acquire around the witness work currently holds the slot
until Assign returns because the release is deferred; change the scope so
witnessSemaphore is acquired before calling cp.applyUniversal and released
immediately after cp.applyUniversal completes (before any DB
inserts/metrics/logging or before calling Assign) instead of deferring until
Assign returns; update both places that acquire witnessSemaphore (the block
around cp.applyUniversal and the similar block later) so the defer is removed
and a direct <-witnessSemaphore is executed right after cp.applyUniversal
returns to limit semaphore occupancy to only the applyUniversal duration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@coordinator/internal/logic/provertask/chunk_prover_task.go`:
- Around line 217-219: The early return on context cancellation is
unconditionally calling cp.recoverActiveAttempts, which must only run when this
request performed the increment (the increment happens in the
taskCtx.hasAssignedTask == nil path). Update the context-cancel branch in
chunk_prover_task.go to check taskCtx.hasAssignedTask == nil (or another flag
that indicates this request incremented attempts) before calling
cp.recoverActiveAttempts(ctx, chunkTask) and before returning; leave the return
of ctx.Err() unchanged when no increment occurred. Ensure the guard uses the
same condition used where attempts are incremented so accounting stays
consistent.

---

Nitpick comments:
In `@coordinator/internal/logic/provertask/chunk_prover_task.go`:
- Around line 27-32: The package currently hard-codes
applyUniversalMaxParallelism = 2 and constructs witnessSemaphore at package
init; make this configurable by reading a config/env/flag value at startup and
initializing applyUniversalMaxParallelism and witnessSemaphore there instead of
at package scope. Specifically: add a config key (or env var/flag) for universal
max parallelism with a sane default, validate it (ensure >=1), and create
witnessSemaphore = make(chan struct{}, max) during application initialization
(e.g., in an init hook or the coordinator startup/new constructor) so the
semaphore capacity reflects the configured value; update any code that
references applyUniversalMaxParallelism to use the configured variable.
- Around line 212-215: The semaphore acquire around the witness work currently
holds the slot until Assign returns because the release is deferred; change the
scope so witnessSemaphore is acquired before calling cp.applyUniversal and
released immediately after cp.applyUniversal completes (before any DB
inserts/metrics/logging or before calling Assign) instead of deferring until
Assign returns; update both places that acquire witnessSemaphore (the block
around cp.applyUniversal and the similar block later) so the defer is removed
and a direct <-witnessSemaphore is executed right after cp.applyUniversal
returns to limit semaphore occupancy to only the applyUniversal duration.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b2b5e0 and f8cec5d.

📒 Files selected for processing (2)
  • common/version/version.go
  • coordinator/internal/logic/provertask/chunk_prover_task.go

Comment on lines +217 to +219
log.Warn("context canceled waiting for witness semaphore", "task_id", chunkTask.Hash, "err", ctx.Err())
cp.recoverActiveAttempts(ctx, chunkTask)
return nil, ctx.Err()
Copy link

@coderabbitai coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard active-attempt recovery to only run when this request incremented attempts.

Line 218 decrements active attempts unconditionally, but increments only happen in the taskCtx.hasAssignedTask == nil path. This can corrupt attempt accounting for already-assigned tasks.

💡 Proposed fix
 		case <-ctx.Done():
 			log.Warn("context canceled waiting for witness semaphore", "task_id", chunkTask.Hash, "err", ctx.Err())
-			cp.recoverActiveAttempts(ctx, chunkTask)
+			if taskCtx.hasAssignedTask == nil {
+				cp.recoverActiveAttempts(ctx, chunkTask)
+			}
 			return nil, ctx.Err()
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@coordinator/internal/logic/provertask/chunk_prover_task.go` around lines 217
- 219, The early return on context cancellation is unconditionally calling
cp.recoverActiveAttempts, which must only run when this request performed the
increment (the increment happens in the taskCtx.hasAssignedTask == nil path).
Update the context-cancel branch in chunk_prover_task.go to check
taskCtx.hasAssignedTask == nil (or another flag that indicates this request
incremented attempts) before calling cp.recoverActiveAttempts(ctx, chunkTask)
and before returning; leave the return of ctx.Err() unchanged when no increment
occurred. Ensure the guard uses the same condition used where attempts are
incremented so accounting stays consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a pre-existing issue on other code paths as well, might this be a problem in practice? @georgehao

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 55.55556% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.42%. Comparing base (9b2b5e0) to head (f8cec5d).

Files with missing lines Patch % Lines
...tor/internal/logic/provertask/chunk_prover_task.go 55.55% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #1797   +/-   ##
========================================
  Coverage    36.41%   36.42%           
========================================
  Files          248      248           
  Lines        21321    21330    +9     
========================================
+ Hits          7765     7770    +5     
- Misses       12733    12737    +4     
  Partials       823      823           
Flag Coverage Δ
coordinator 32.32% <55.55%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants