Fix dagrun starvation#64109
Conversation
|
I'm struggling to understand how the problem has been solved from reading the code. Can you explain your solution for preventing the starvation? |
Sure, Assume dags a, b A
|
6b09818 to
fe5462d
Compare
|
This will have to wait until 3.2.0 -- This touches the core and I don't want to hurry until 3.2.0 is out. We have 1200+ commits in 3.2.0 |
There was a problem hiding this comment.
Pull request overview
This PR addresses scheduler DAG run starvation by changing how queued DagRuns are selected for transition to RUNNING when max_active_runs is low and there are large backlogs of queued runs.
Changes:
- Update
DagRun.get_queued_dag_runs_to_set_running()to limit queued candidates per DAG/backfill using arow_number()window so the scheduler doesn’t spend its per-loop budget examining non-runnable runs from a single DAG. - Adjust scheduler unit tests to validate the improved fairness/scheduling behavior under backlog conditions.
- Remove a max-active guard/logging in
_create_dag_runs()and add a TODO note around_set_exceeds_max_active_runs.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
airflow-core/src/airflow/models/dagrun.py |
Changes queued dagrun selection query to avoid starvation by limiting per DAG/backfill candidates. |
airflow-core/src/airflow/jobs/scheduler_job_runner.py |
Removes a creation-time max-active skip and adds a TODO comment near _set_exceeds_max_active_runs. |
airflow-core/tests/unit/jobs/test_scheduler_job.py |
Updates/reshapes scheduling tests to assert non-starved behavior and updated run counts. |
kaxil
left a comment
There was a problem hiding this comment.
The core idea — using row_number() to cap how many queued runs per DAG/backfill pass through the query — is a reasonable approach to solving the starvation problem in get_queued_dag_runs_to_set_running. However, this PR bundles in unrelated and incorrect changes to the DagRun creation path (_create_dag_runs), which is a separate concern from the QUEUED→RUNNING promotion path.
See inline comments for details.
Sure, no problem |
a9c77cd to
c47e216
Compare
|
Hello, I would appreciate a review of this PR again, I have updated everything according to the commets given |
|
@Nataneljpwd A few things need addressing before review — see our Pull Request quality criteria.
No rush. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting |
|
@Nataneljpwd can you fix the failing tests? |
|
@Nataneljpwd — There is 1 unresolved review thread on this PR from If yes, please mark the thread as resolved and ping the reviewer ( If you are still working on the thread, please reply with what is outstanding so the thread stays unresolved on purpose. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
I have left the comment on purpose for Kaxil to decide if it's resolved or not |
|
|
||
| assert running_dagrun_count == max_active_runs * len(dag_ids) | ||
|
|
||
| def test_no_more_dagruns_are_set_to_running_when_max_active_runs_exceeded(self, dag_maker, session): |
There was a problem hiding this comment.
I'm not convinced this test really adequately covers the case you claim it does.
With max_active_runs=1 and 6 runs already RUNNING, the query's new rn <= max_active_runs - num_running filter never gets a chance to matter — the candidates that reach _start_queued_dagruns are rejected by the existing Python guard at scheduler_job_runner.py:2345 (active_runs >= dag_run.max_active_runs), which this PR doesn't touch. So running_pre == running_post holds with or without your query change; i suspect this test would pass on main. To actually cover the fix, you need a case where the query's per-DAG cap is what limits candidates: e.g. one DAG at its max_active_runs with non-backfill runs already running plus a large queued backlog, alongside a second DAG with free capacity, then assert the second DAG isn't starved out of the 20-candidate budget
There was a problem hiding this comment.
Hello Ash, I am not sure I understood, as I have a test for it on line 3810, where I have 3 dag id's, i create 50 of each instance and I set the max_active_runs to 3, when I query, I use batches of 20 (the default), which does cover the changes I have made in the PR
talking about
test_runs_are_not_starved_by_max_active_runs_limit
correct me if I am wrong, but I think that is what you were talking about, in 1 query in the past it would only get 3 form dag1, for the next iteration get nothing, for the next 3 from dag2 and so on
There was a problem hiding this comment.
I am also pretty sure that now it is safe to remove the check in the scheduler job runner, as it is done in the query instead, and there is no way for a race condition as the rows are locked, though I left it as per Kaxil's request
|
@Nataneljpwd — There are 2 unresolved review thread(s) on this PR, and you have engaged with each one (post-review commits and/or in-thread replies). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation? If yes, reply here (a short "yes / ready" is fine) and an Apache Airflow maintainer will pick the PR up from the review queue on the next sweep. If you are still working on a thread, please reply with what is outstanding so the threads stay unresolved on purpose. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
Yes, it is ready, I am waiting for a response from Ash |
We have been experiencing severe dagrun starvation at our cluster, where when there were a lot of dagruns, and a low max_active_runs limit (hundreds to thousands runs with a limit in the 10s) this caused a lot of dags to get stuck in queued state without moving to running, causing those dagruns to timeout.
After investigation, we found that the reason was due to the _start_queued_dagruns method, where the query was returning dagruns which cannot be set to running due to the max_active_runs limit, meaning that other dagruns where starved.
A similar issue occurs when new dagruns are created in large batches (due to the nulls first), yet this is out of scope for the given pr, I will submit an additional PR soon.
closes #49508
Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.Important
🛠️ Maintainer triage note for @Nataneljpwd · by
@potiuk· 2026-06-17 18:53 UTCSome review feedback from
ashb,kaxilis waiting on you:The ball is in your court — you've been assigned to this PR. Reply or push a fix in each thread, then mark them resolved.
Automated triage — may be imperfect; a maintainer takes the next look.