Skip to content

Fix scheduler killing healthy running tasks on duplicate dispatch#69336

Open
dsuhinin wants to merge 3 commits into
apache:mainfrom
dsuhinin:dsuhinin/fix-scheduler-fail-running-ti-on-duplicate-dispatch
Open

Fix scheduler killing healthy running tasks on duplicate dispatch#69336
dsuhinin wants to merge 3 commits into
apache:mainfrom
dsuhinin:dsuhinin/fix-scheduler-fail-running-ti-on-duplicate-dispatch

Conversation

@dsuhinin

@dsuhinin dsuhinin commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

When a task instance gets dispatched twice, only one worker actually runs it — the duplicate loses the race and dies. The problem was that the scheduler treated the leftover event from that dead duplicate as a failure and killed the live run that was happily executing.

This fixes both sides of the race:

  • Scheduler: before failing a queued task, it now checks whether the task is actually alive — RUNNING with a recent heartbeat. If a worker is clearly still on it, the stale event is ignored. Tasks that genuinely died are still caught later by heartbeat-timeout detection.
  • Supervisor: when the server tells a duplicate worker to back off (TaskAlreadyRunningError), it now exits quietly instead of reporting a crash. The duplicate never did any real work, so there's nothing to fail.

possible fix for: #57041


@boring-cyborg boring-cyborg Bot added area:Scheduler including HA (high availability) scheduler area:task-sdk labels Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler area:task-sdk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant