requeue_stalled_tasks() uses flow timeout instead of effective step timeout
Summary
Stalled task recovery appears to ignore per-step timeout values and uses only flows.opt_timeout when deciding when to requeue tasks stuck in started.
start_tasks() sets the PGMQ visibility timeout from the effective timeout:
coalesce(step.opt_timeout, flow.opt_timeout) + 2
But requeue_stalled_tasks() detects stalled tasks using only:
flow.opt_timeout + 30s buffer
This creates a mismatch where a message can become visible and be repeatedly read by workers, but start_tasks() cannot start it because the step_tasks row remains status = 'started' until the longer flow-level timeout expires.
Observed Impact
In production, steps with configured timeouts of 5s, 10s, 20s, 25s, and 30s all show stalled-task recovery around 90-105s when the flow timeout is 60s and the hardcoded stall buffer is 30s.
Example pattern from an affected run:
- Step:
generateChunkMetadata
- Step timeout:
30s
- Flow timeout:
60s
- 7 parallel map tasks queued at same time
- 6 tasks started immediately and completed in ~3.5-5.8s
- 1 task stayed stuck until requeue/restart
- Delayed task final state:
attempts_count = 2
requeued_count = 1
- queue-to-final-start: ~98s
- final execution time: ~4.9s
This suggests task handler was not intrinsically slow; it was stuck in started until stall recovery requeued it.
Aggregate Evidence
Over a recent 60-day window, excluding a known unrelated backfill day:
- 157 requeued tasks
- 77 affected runs
- All active flows affected
- Requeued task queue-to-final-start distribution:
- min: ~90.3s
- p50: ~95.9s
- p95: ~139.3s
- max: ~236.7s
- Final execution after restart was usually normal:
By step:
| Step |
Step timeout |
Requeued tasks |
Typical queue-to-restart |
generateChunkMetadata |
30s |
90 |
~98s median |
inferOccurredAt |
20s |
17 |
~94s median |
generateEntryMetadata |
20s |
16 |
~94s median |
inferTags |
25s |
13 |
~94s median |
planChunks |
10s |
6 |
~91s median |
| DB/RPC steps |
5-10s |
smaller counts |
~90-106s |
The common ~90-105s delay matches:
flow timeout 60s + stall buffer 30s + cron lag up to 15s
It does not match the configured per-step timeouts.
PGMQ Evidence
For requeued tasks, archived PGMQ messages often had high read_ct values before final completion:
- p50
read_ct: ~20
- p95
read_ct: ~25
- max
read_ct: 45
For a representative delayed task:
- PGMQ message
read_ct = 22
- The message became visible/readable repeatedly
- The related
pgflow.step_tasks row was still status = 'started'
- Only after
requeue_stalled_tasks() reset the row to queued could a worker start it again
This supports the mismatch: PGMQ visibility timeout uses step timeout, but stalled-row recovery uses flow timeout.
Relevant Code
In current schema code, start_tasks() uses the effective step timeout when setting visibility timeout:
coalesce(s.opt_timeout, f.opt_timeout) + 2 as vt_delay
But requeue_stalled_tasks() joins pgflow.flows only and checks:
st.started_at < now() - (f.opt_timeout * interval '1 second') - interval '30 seconds'
It does not join pgflow.steps and does not use coalesce(step.opt_timeout, flow.opt_timeout).
Expected Behavior
Stalled task detection should use the same effective timeout semantics as task start / PGMQ visibility timeout:
coalesce(step.opt_timeout, flow.opt_timeout)
Step-level timeout should override flow-level timeout. Flow timeout should be fallback only when step timeout is null.
Suggested Fix
Update pgflow.requeue_stalled_tasks() to join pgflow.steps and use effective timeout:
coalesce(s.opt_timeout, f.opt_timeout)
Conceptually:
from pgflow.step_tasks st
join pgflow.flows f on f.flow_slug = st.flow_slug
join pgflow.steps s
on s.flow_slug = st.flow_slug
and s.step_slug = st.step_slug
where st.status = 'started'
and st.started_at < now()
- (coalesce(s.opt_timeout, f.opt_timeout) * interval '1 second')
- interval '30 seconds'
Suggested Tests
Add pgTAP coverage for:
- Step timeout shorter than flow timeout requeues after
step_timeout + buffer, not flow_timeout + buffer.
- Step timeout longer than flow timeout does not requeue early at
flow_timeout + buffer.
- Null step timeout falls back to flow timeout.
- Boundary behavior around the stall buffer.
- Requeued task has message visibility set to immediate availability via
set_vt_batch(..., 0).
Secondary Consideration
The hardcoded 30s buffer may be too conservative for workloads with short step timeouts (5s-30s). However, even with the existing 30s buffer, using effective step timeout would significantly reduce stall recovery delay for short steps.
This issue is independent of whether the buffer should later become configurable.
requeue_stalled_tasks()uses flow timeout instead of effective step timeoutSummary
Stalled task recovery appears to ignore per-step
timeoutvalues and uses onlyflows.opt_timeoutwhen deciding when to requeue tasks stuck instarted.start_tasks()sets the PGMQ visibility timeout from the effective timeout:But
requeue_stalled_tasks()detects stalled tasks using only:This creates a mismatch where a message can become visible and be repeatedly read by workers, but
start_tasks()cannot start it because thestep_tasksrow remainsstatus = 'started'until the longer flow-level timeout expires.Observed Impact
In production, steps with configured timeouts of
5s,10s,20s,25s, and30sall show stalled-task recovery around90-105swhen the flow timeout is60sand the hardcoded stall buffer is30s.Example pattern from an affected run:
generateChunkMetadata30s60sattempts_count = 2requeued_count = 1This suggests task handler was not intrinsically slow; it was stuck in
starteduntil stall recovery requeued it.Aggregate Evidence
Over a recent 60-day window, excluding a known unrelated backfill day:
By step:
generateChunkMetadatainferOccurredAtgenerateEntryMetadatainferTagsplanChunksThe common ~90-105s delay matches:
It does not match the configured per-step timeouts.
PGMQ Evidence
For requeued tasks, archived PGMQ messages often had high
read_ctvalues before final completion:read_ct: ~20read_ct: ~25read_ct: 45For a representative delayed task:
read_ct = 22pgflow.step_tasksrow was stillstatus = 'started'requeue_stalled_tasks()reset the row toqueuedcould a worker start it againThis supports the mismatch: PGMQ visibility timeout uses step timeout, but stalled-row recovery uses flow timeout.
Relevant Code
In current schema code,
start_tasks()uses the effective step timeout when setting visibility timeout:But
requeue_stalled_tasks()joinspgflow.flowsonly and checks:It does not join
pgflow.stepsand does not usecoalesce(step.opt_timeout, flow.opt_timeout).Expected Behavior
Stalled task detection should use the same effective timeout semantics as task start / PGMQ visibility timeout:
Step-level timeout should override flow-level timeout. Flow timeout should be fallback only when step timeout is null.
Suggested Fix
Update
pgflow.requeue_stalled_tasks()to joinpgflow.stepsand use effective timeout:Conceptually:
Suggested Tests
Add pgTAP coverage for:
step_timeout + buffer, notflow_timeout + buffer.flow_timeout + buffer.set_vt_batch(..., 0).Secondary Consideration
The hardcoded
30sbuffer may be too conservative for workloads with short step timeouts (5s-30s). However, even with the existing30sbuffer, using effective step timeout would significantly reduce stall recovery delay for short steps.This issue is independent of whether the buffer should later become configurable.