Fix dagrun starvation by Nataneljpwd · Pull Request #64109 · apache/airflow

Nataneljpwd · 2026-03-23T16:45:40Z

We have been experiencing severe dagrun starvation at our cluster, where when there were a lot of dagruns, and a low max_active_runs limit (hundreds to thousands runs with a limit in the 10s) this caused a lot of dags to get stuck in queued state without moving to running, causing those dagruns to timeout.
After investigation, we found that the reason was due to the _start_queued_dagruns method, where the query was returning dagruns which cannot be set to running due to the max_active_runs limit, meaning that other dagruns where starved.

A similar issue occurs when new dagruns are created in large batches (due to the nulls first), yet this is out of scope for the given pr, I will submit an additional PR soon.

closes #49508

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Important

🛠️ Maintainer triage note for @Nataneljpwd · by @potiuk · 2026-06-17 18:53 UTC

Some review feedback from ashb, kaxil is waiting on you:

Please respond to the open review thread(s) — reply or push a fix in each, then mark them resolved.
See the Pull Request quality criteria.

The ball is in your court — you've been assigned to this PR. Reply or push a fix in each thread, then mark them resolved.

_{Automated triage — may be imperfect; a maintainer takes the next look.}

collinmcnulty · 2026-03-23T17:10:14Z

I'm struggling to understand how the problem has been solved from reading the code. Can you explain your solution for preventing the starvation?

Nataneljpwd · 2026-03-23T17:31:38Z

I'm struggling to understand how the problem has been solved from reading the code. Can you explain your solution for preventing the starvation?

Sure,
Instead of (as of now) querying N first runs, and then filtering on the max active runs, we query the first N runs where we (in SQL) check the the max active runs (before the limit is applied)
And so we skip a lot of runs which cannot be scheduled

Assume dags a, b
a - 3 max active runs
b - no limit (default to 16 from config)
If now the query result looked like so (small letter is schedulable, capital letter is schedulable according to ) where each row represents a run (the - determine the limit, all runs before the - are selected, all other are ignored) where the max dagruns to schedule per loop (the limit) is 5

A
A
A
a
a

B
B
B

Here (as of now) the last 3 dagruns are ommitted and ignored (starving runs from b)

After the change it will look like so:

A
A
A
B
B

B

Now we do schedule everything we queried without dagruns from a limiting us (the limit now becomes the max dagruns per loop to schedule configuration) and it is guaranteed that the runs queried will be able to run

Hope this explained it, if anything is not clear feel free to let me know, I will write a better explanation.

kaxil · 2026-03-24T12:16:16Z

This will have to wait until 3.2.0 -- This touches the core and I don't want to hurry until 3.2.0 is out.

We have 1200+ commits in 3.2.0

Copilot

Pull request overview

This PR addresses scheduler DAG run starvation by changing how queued DagRuns are selected for transition to RUNNING when max_active_runs is low and there are large backlogs of queued runs.

Changes:

Update DagRun.get_queued_dag_runs_to_set_running() to limit queued candidates per DAG/backfill using a row_number() window so the scheduler doesn’t spend its per-loop budget examining non-runnable runs from a single DAG.
Adjust scheduler unit tests to validate the improved fairness/scheduling behavior under backlog conditions.
Remove a max-active guard/logging in _create_dag_runs() and add a TODO note around _set_exceeds_max_active_runs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`airflow-core/src/airflow/models/dagrun.py`	Changes queued dagrun selection query to avoid starvation by limiting per DAG/backfill candidates.
`airflow-core/src/airflow/jobs/scheduler_job_runner.py`	Removes a creation-time max-active skip and adds a TODO comment near `_set_exceeds_max_active_runs`.
`airflow-core/tests/unit/jobs/test_scheduler_job.py`	Updates/reshapes scheduling tests to assert non-starved behavior and updated run counts.

kaxil

The core idea — using row_number() to cap how many queued runs per DAG/backfill pass through the query — is a reasonable approach to solving the starvation problem in get_queued_dag_runs_to_set_running. However, this PR bundles in unrelated and incorrect changes to the DagRun creation path (_create_dag_runs), which is a separate concern from the QUEUED→RUNNING promotion path.

See inline comments for details.

Nataneljpwd · 2026-03-24T13:10:19Z

This will have to wait until 3.2.0 -- This touches the core and I don't want to hurry until 3.2.0 is out.

We have 1200+ commits in 3.2.0

Sure, no problem

…ore to be set to running

Nataneljpwd · 2026-03-24T19:55:18Z

@kaxil, @eladkal I have addressed the review comments and would appreciate another review

…fix-dagrun-starvation

Nataneljpwd · 2026-05-21T16:14:03Z

Hello, I would appreciate a review of this PR again, I have updated everything according to the commets given

potiuk · 2026-05-24T18:03:56Z

@Nataneljpwd A few things need addressing before review — see our Pull Request quality criteria.

❌ CI fails: Integration and System Tests / Integration core otel (and possibly other checks — see the Checks tab for the full list).

No rush.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

eladkal · 2026-05-26T08:54:48Z

@Nataneljpwd can you fix the failing tests?

…fix-dagrun-starvation

potiuk · 2026-05-27T19:57:29Z

@Nataneljpwd — There is 1 unresolved review thread on this PR from kaxil, and you have engaged with it (you replied in-thread and have pushed commits since). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation?

If yes, please mark the thread as resolved and ping the reviewer (kaxil) for a final look. They will either label the PR ready for maintainer review or follow up with additional feedback.

If you are still working on the thread, please reply with what is outstanding so the thread stays unresolved on purpose.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

Nataneljpwd · 2026-05-27T20:24:14Z

@Nataneljpwd — There is 1 unresolved review thread on this PR from kaxil, and you have engaged with it (you replied in-thread and have pushed commits since). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation?

If yes, please mark the thread as resolved and ping the reviewer (kaxil) for a final look. They will either label the PR ready for maintainer review or follow up with additional feedback.

If you are still working on the thread, please reply with what is outstanding so the thread stays unresolved on purpose.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

I have left the comment on purpose for Kaxil to decide if it's resolved or not

eladkal · 2026-06-02T15:02:25Z

Will need a 2nd review from @kaxil and possibly @ashb

ashb · 2026-06-04T06:08:01Z

+
+        assert running_dagrun_count == max_active_runs * len(dag_ids)
+
+    def test_no_more_dagruns_are_set_to_running_when_max_active_runs_exceeded(self, dag_maker, session):


I'm not convinced this test really adequately covers the case you claim it does.

With max_active_runs=1 and 6 runs already RUNNING, the query's new rn <= max_active_runs - num_running filter never gets a chance to matter — the candidates that reach _start_queued_dagruns are rejected by the existing Python guard at scheduler_job_runner.py:2345 (active_runs >= dag_run.max_active_runs), which this PR doesn't touch. So running_pre == running_post holds with or without your query change; i suspect this test would pass on main. To actually cover the fix, you need a case where the query's per-DAG cap is what limits candidates: e.g. one DAG at its max_active_runs with non-backfill runs already running plus a large queued backlog, alongside a second DAG with free capacity, then assert the second DAG isn't starved out of the 20-candidate budget

Hello Ash, I am not sure I understood, as I have a test for it on line 3810, where I have 3 dag id's, i create 50 of each instance and I set the max_active_runs to 3, when I query, I use batches of 20 (the default), which does cover the changes I have made in the PR

talking about
test_runs_are_not_starved_by_max_active_runs_limit

correct me if I am wrong, but I think that is what you were talking about, in 1 query in the past it would only get 3 form dag1, for the next iteration get nothing, for the next 3 from dag2 and so on

I am also pretty sure that now it is safe to remove the check in the scheduler job runner, as it is done in the query instead, and there is no way for a race condition as the rows are locked, though I left it as per Kaxil's request

potiuk · 2026-06-09T09:45:48Z

@Nataneljpwd — There are 2 unresolved review thread(s) on this PR, and you have engaged with each one (post-review commits and/or in-thread replies). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation?

If yes, reply here (a short "yes / ready" is fine) and an Apache Airflow maintainer will pick the PR up from the review queue on the next sweep.

If you are still working on a thread, please reply with what is outstanding so the threads stay unresolved on purpose.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

Nataneljpwd · 2026-06-12T16:27:40Z

@Nataneljpwd — There are 2 unresolved review thread(s) on this PR, and you have engaged with each one (post-review commits and/or in-thread replies). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation?

If yes, reply here (a short "yes / ready" is fine) and an Apache Airflow maintainer will pick the PR up from the review queue on the next sweep.

If you are still working on a thread, please reply with what is outstanding so the threads stay unresolved on purpose.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

Yes, it is ready, I am waiting for a response from Ash

Nataneljpwd added 4 commits March 23, 2026 18:24

fixed dagrun starvation with the max_active_tasks limit

f922f2b

formatted files

85e4c6a

removed print

dd6b74d

removed redundent tests

ecdb722

Nataneljpwd requested review from XD-DENG and ashb as code owners March 23, 2026 16:45

boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label Mar 23, 2026

Nataneljpwd marked this pull request as draft March 23, 2026 16:48

Nataneljpwd added 2 commits March 23, 2026 21:20

fix mysql test

8356843

merge branch main

fe5462d

Nataneljpwd force-pushed the bugfix/fix-dagrun-starvation branch from 6b09818 to fe5462d Compare March 23, 2026 19:21

Nataneljpwd marked this pull request as ready for review March 23, 2026 21:01

eladkal requested a review from kaxil March 24, 2026 10:05

eladkal reviewed Mar 24, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py

kaxil requested a review from Copilot March 24, 2026 12:15

kaxil added this to the Airflow 3.2.1 milestone Mar 24, 2026

Copilot started reviewing on behalf of kaxil March 24, 2026 12:16 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated

Comment thread airflow-core/tests/unit/jobs/test_scheduler_job.py Outdated

Comment thread airflow-core/src/airflow/models/dagrun.py Outdated

kaxil requested changes Mar 24, 2026

View reviewed changes

eladkal added the type:bug-fix Changelog: Bug Fixes label Mar 24, 2026

Nataneljpwd added 2 commits March 24, 2026 21:35

address cr comments

1c0c28f

merge branch main

c47e216

Nataneljpwd force-pushed the bugfix/fix-dagrun-starvation branch from a9c77cd to c47e216 Compare March 24, 2026 19:35

Nataneljpwd and others added 2 commits March 24, 2026 21:52

added explicit test for more running dagruns than limit not causing m…

9c28d0d

…ore to be set to running

Merge branch 'main' into bugfix/fix-dagrun-starvation

9992724

Natanel Rudyuklakir added 3 commits May 21, 2026 19:08

resolved CR comments and cleaned up the comments

c432be9

update comment

ea29eb6

Merge branch 'main' of https://github.com/apache/airflow into bugfix/…

a3bdc84

…fix-dagrun-starvation

Nataneljpwd added 3 commits May 27, 2026 20:07

fixed failing tests

e2e6a1f

Merge branch 'main' of https://github.com/apache/airflow into bugfix/…

77ba9b2

…fix-dagrun-starvation

fixed backfill query

d68d3a1

Merge branch 'main' into bugfix/fix-dagrun-starvation

168e86b

Nataneljpwd requested review from eladkal and kaxil June 1, 2026 17:27

Merge branch 'main' into bugfix/fix-dagrun-starvation

f1ce411

potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Jun 3, 2026

ashb requested changes Jun 4, 2026

View reviewed changes

ashb removed the ready for maintainer review Set after triaging when all criteria pass. label Jun 4, 2026

Merge branch 'main' into bugfix/fix-dagrun-starvation

8bbc5b3

Nataneljpwd requested a review from ashb June 7, 2026 18:13

potiuk assigned Nataneljpwd Jun 17, 2026

Nataneljpwd added 2 commits June 23, 2026 00:15

Merge branch 'main' into bugfix/fix-dagrun-starvation

b9a3644

Merge branch 'main' into bugfix/fix-dagrun-starvation

86b6d99

vatsrahul1001 modified the milestones: Airflow 3.3.0, Airflow 3.3.1 Jun 25, 2026

Merge branch 'main' into bugfix/fix-dagrun-starvation

9ed23af

Nataneljpwd changed the title ~~Bugfix/fix dagrun starvation~~ Fix dagrun starvation Jul 2, 2026


		assert running_dagrun_count == max_active_runs * len(dag_ids)

		def test_no_more_dagruns_are_set_to_running_when_max_active_runs_exceeded(self, dag_maker, session):

Uh oh!

Conversation

Nataneljpwd commented Mar 23, 2026 • edited by potiuk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Was generative AI tooling used to co-author this PR?

Uh oh!

collinmcnulty commented Mar 23, 2026

Uh oh!

Nataneljpwd commented Mar 23, 2026

A A A a a

A A A B B

Uh oh!

Uh oh!

kaxil commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaxil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Nataneljpwd commented Mar 24, 2026

Uh oh!

Nataneljpwd commented Mar 24, 2026

Uh oh!

Nataneljpwd commented May 21, 2026

Uh oh!

potiuk commented May 24, 2026

Uh oh!

eladkal commented May 26, 2026

Uh oh!

potiuk commented May 27, 2026

Uh oh!

Nataneljpwd commented May 27, 2026

Uh oh!

eladkal commented Jun 2, 2026

Uh oh!

ashb Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Nataneljpwd Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

Nataneljpwd Jun 7, 2026

Choose a reason for hiding this comment

Uh oh!

potiuk commented Jun 9, 2026

Uh oh!

Nataneljpwd commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Nataneljpwd commented Mar 23, 2026 •

edited by potiuk

Loading

A
A
A
a
a

A
A
A
B
B