Skip to content

Fix dagrun starvation#64109

Open
Nataneljpwd wants to merge 35 commits into
apache:mainfrom
Nataneljpwd:bugfix/fix-dagrun-starvation
Open

Fix dagrun starvation#64109
Nataneljpwd wants to merge 35 commits into
apache:mainfrom
Nataneljpwd:bugfix/fix-dagrun-starvation

Conversation

@Nataneljpwd

@Nataneljpwd Nataneljpwd commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

We have been experiencing severe dagrun starvation at our cluster, where when there were a lot of dagruns, and a low max_active_runs limit (hundreds to thousands runs with a limit in the 10s) this caused a lot of dags to get stuck in queued state without moving to running, causing those dagruns to timeout.
After investigation, we found that the reason was due to the _start_queued_dagruns method, where the query was returning dagruns which cannot be set to running due to the max_active_runs limit, meaning that other dagruns where starved.

A similar issue occurs when new dagruns are created in large batches (due to the nulls first), yet this is out of scope for the given pr, I will submit an additional PR soon.

closes #49508


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Important

🛠️ Maintainer triage note for @Nataneljpwd · by @potiuk · 2026-06-17 18:53 UTC

Some review feedback from ashb, kaxil is waiting on you:

The ball is in your court — you've been assigned to this PR. Reply or push a fix in each thread, then mark them resolved.

Automated triage — may be imperfect; a maintainer takes the next look.

@boring-cyborg boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label Mar 23, 2026
@Nataneljpwd Nataneljpwd marked this pull request as draft March 23, 2026 16:48
@collinmcnulty

Copy link
Copy Markdown
Contributor

I'm struggling to understand how the problem has been solved from reading the code. Can you explain your solution for preventing the starvation?

@Nataneljpwd

Copy link
Copy Markdown
Contributor Author

I'm struggling to understand how the problem has been solved from reading the code. Can you explain your solution for preventing the starvation?

Sure,
Instead of (as of now) querying N first runs, and then filtering on the max active runs, we query the first N runs where we (in SQL) check the the max active runs (before the limit is applied)
And so we skip a lot of runs which cannot be scheduled

Assume dags a, b
a - 3 max active runs
b - no limit (default to 16 from config)
If now the query result looked like so (small letter is schedulable, capital letter is schedulable according to ) where each row represents a run (the - determine the limit, all runs before the - are selected, all other are ignored) where the max dagruns to schedule per loop (the limit) is 5

A
A
A
a
a

B
B
B

Here (as of now) the last 3 dagruns are ommitted and ignored (starving runs from b)

After the change it will look like so:

A
A
A
B
B

B

Now we do schedule everything we queried without dagruns from a limiting us (the limit now becomes the max dagruns per loop to schedule configuration) and it is guaranteed that the runs queried will be able to run

Hope this explained it, if anything is not clear feel free to let me know, I will write a better explanation.

@Nataneljpwd Nataneljpwd force-pushed the bugfix/fix-dagrun-starvation branch from 6b09818 to fe5462d Compare March 23, 2026 19:21
@Nataneljpwd Nataneljpwd marked this pull request as ready for review March 23, 2026 21:01
@eladkal eladkal requested a review from kaxil March 24, 2026 10:05
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py
@kaxil kaxil requested a review from Copilot March 24, 2026 12:15
@kaxil kaxil added this to the Airflow 3.2.1 milestone Mar 24, 2026
@kaxil

kaxil commented Mar 24, 2026

Copy link
Copy Markdown
Member

This will have to wait until 3.2.0 -- This touches the core and I don't want to hurry until 3.2.0 is out.

We have 1200+ commits in 3.2.0

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses scheduler DAG run starvation by changing how queued DagRuns are selected for transition to RUNNING when max_active_runs is low and there are large backlogs of queued runs.

Changes:

  • Update DagRun.get_queued_dag_runs_to_set_running() to limit queued candidates per DAG/backfill using a row_number() window so the scheduler doesn’t spend its per-loop budget examining non-runnable runs from a single DAG.
  • Adjust scheduler unit tests to validate the improved fairness/scheduling behavior under backlog conditions.
  • Remove a max-active guard/logging in _create_dag_runs() and add a TODO note around _set_exceeds_max_active_runs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
airflow-core/src/airflow/models/dagrun.py Changes queued dagrun selection query to avoid starvation by limiting per DAG/backfill candidates.
airflow-core/src/airflow/jobs/scheduler_job_runner.py Removes a creation-time max-active skip and adds a TODO comment near _set_exceeds_max_active_runs.
airflow-core/tests/unit/jobs/test_scheduler_job.py Updates/reshapes scheduling tests to assert non-starved behavior and updated run counts.

Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
Comment thread airflow-core/tests/unit/jobs/test_scheduler_job.py Outdated
Comment thread airflow-core/src/airflow/models/dagrun.py Outdated

@kaxil kaxil left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core idea — using row_number() to cap how many queued runs per DAG/backfill pass through the query — is a reasonable approach to solving the starvation problem in get_queued_dag_runs_to_set_running. However, this PR bundles in unrelated and incorrect changes to the DagRun creation path (_create_dag_runs), which is a separate concern from the QUEUED→RUNNING promotion path.

See inline comments for details.

Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
Comment thread airflow-core/src/airflow/models/dagrun.py
Comment thread airflow-core/src/airflow/models/dagrun.py
Comment thread airflow-core/tests/unit/jobs/test_scheduler_job.py
@Nataneljpwd

Copy link
Copy Markdown
Contributor Author

This will have to wait until 3.2.0 -- This touches the core and I don't want to hurry until 3.2.0 is out.

We have 1200+ commits in 3.2.0

Sure, no problem

@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Mar 24, 2026
@Nataneljpwd Nataneljpwd force-pushed the bugfix/fix-dagrun-starvation branch from a9c77cd to c47e216 Compare March 24, 2026 19:35
@Nataneljpwd

Copy link
Copy Markdown
Contributor Author

@kaxil, @eladkal I have addressed the review comments and would appreciate another review

@Nataneljpwd

Copy link
Copy Markdown
Contributor Author

Hello, I would appreciate a review of this PR again, I have updated everything according to the commets given

@potiuk

potiuk commented May 24, 2026

Copy link
Copy Markdown
Member

@Nataneljpwd A few things need addressing before review — see our Pull Request quality criteria.

  • CI fails: Integration and System Tests / Integration core otel (and possibly other checks — see the Checks tab for the full list).

No rush.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.


Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

@eladkal

eladkal commented May 26, 2026

Copy link
Copy Markdown
Contributor

@Nataneljpwd can you fix the failing tests?

@potiuk

potiuk commented May 27, 2026

Copy link
Copy Markdown
Member

@Nataneljpwd — There is 1 unresolved review thread on this PR from kaxil, and you have engaged with it (you replied in-thread and have pushed commits since). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation?

If yes, please mark the thread as resolved and ping the reviewer (kaxil) for a final look. They will either label the PR ready for maintainer review or follow up with additional feedback.

If you are still working on the thread, please reply with what is outstanding so the thread stays unresolved on purpose.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

@Nataneljpwd

Copy link
Copy Markdown
Contributor Author

@Nataneljpwd — There is 1 unresolved review thread on this PR from kaxil, and you have engaged with it (you replied in-thread and have pushed commits since). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation?

If yes, please mark the thread as resolved and ping the reviewer (kaxil) for a final look. They will either label the PR ready for maintainer review or follow up with additional feedback.

If you are still working on the thread, please reply with what is outstanding so the thread stays unresolved on purpose.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

I have left the comment on purpose for Kaxil to decide if it's resolved or not

@Nataneljpwd Nataneljpwd requested review from eladkal and kaxil June 1, 2026 17:27
@eladkal

eladkal commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Will need a 2nd review from @kaxil and possibly @ashb

@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Jun 3, 2026

assert running_dagrun_count == max_active_runs * len(dag_ids)

def test_no_more_dagruns_are_set_to_running_when_max_active_runs_exceeded(self, dag_maker, session):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this test really adequately covers the case you claim it does.

With max_active_runs=1 and 6 runs already RUNNING, the query's new rn <= max_active_runs - num_running filter never gets a chance to matter — the candidates that reach _start_queued_dagruns are rejected by the existing Python guard at scheduler_job_runner.py:2345 (active_runs >= dag_run.max_active_runs), which this PR doesn't touch. So running_pre == running_post holds with or without your query change; i suspect this test would pass on main. To actually cover the fix, you need a case where the query's per-DAG cap is what limits candidates: e.g. one DAG at its max_active_runs with non-backfill runs already running plus a large queued backlog, alongside a second DAG with free capacity, then assert the second DAG isn't starved out of the 20-candidate budget

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Ash, I am not sure I understood, as I have a test for it on line 3810, where I have 3 dag id's, i create 50 of each instance and I set the max_active_runs to 3, when I query, I use batches of 20 (the default), which does cover the changes I have made in the PR

talking about
test_runs_are_not_starved_by_max_active_runs_limit

correct me if I am wrong, but I think that is what you were talking about, in 1 query in the past it would only get 3 form dag1, for the next iteration get nothing, for the next 3 from dag2 and so on

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also pretty sure that now it is safe to remove the check in the scheduler job runner, as it is done in the query instead, and there is no way for a race condition as the rows are locked, though I left it as per Kaxil's request

@ashb ashb removed the ready for maintainer review Set after triaging when all criteria pass. label Jun 4, 2026
@Nataneljpwd Nataneljpwd requested a review from ashb June 7, 2026 18:13
@potiuk

potiuk commented Jun 9, 2026

Copy link
Copy Markdown
Member

@Nataneljpwd — There are 2 unresolved review thread(s) on this PR, and you have engaged with each one (post-review commits and/or in-thread replies). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation?

If yes, reply here (a short "yes / ready" is fine) and an Apache Airflow maintainer will pick the PR up from the review queue on the next sweep.

If you are still working on a thread, please reply with what is outstanding so the threads stay unresolved on purpose.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

@Nataneljpwd

Copy link
Copy Markdown
Contributor Author

@Nataneljpwd — There are 2 unresolved review thread(s) on this PR, and you have engaged with each one (post-review commits and/or in-thread replies). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation?

If yes, reply here (a short "yes / ready" is fine) and an Apache Airflow maintainer will pick the PR up from the review queue on the next sweep.

If you are still working on a thread, please reply with what is outstanding so the threads stay unresolved on purpose.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

Yes, it is ready, I am waiting for a response from Ash

@Nataneljpwd Nataneljpwd changed the title Bugfix/fix dagrun starvation Fix dagrun starvation Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DAG Runs from a single DAG can prevent scheduler from seeing other DAG's runs

9 participants