Skip to content

Conversation

@blktests-ci
Copy link

@blktests-ci blktests-ci bot commented Feb 3, 2026

Pull request for series with
subject: blk-cgroup: fix races and deadlocks
version: 2
url: https://patchwork.kernel.org/project/linux-block/list/?series=1050146

hailan94 and others added 7 commits February 4, 2026 04:09
… blkcg_mutex

blkg_destroy_all() iterates q->blkg_list without holding blkcg_mutex,
which can race with blkg_free_workfn() that removes blkgs from the list
while holding blkcg_mutex.

Add blkcg_mutex protection around the q->blkg_list iteration to prevent
potential list corruption or use-after-free issues.

Signed-off-by: Yu Kuai <[email protected]>
…mutex

bfq_end_wr_async() iterates q->blkg_list while only holding bfqd->lock,
but not blkcg_mutex. This can race with blkg_free_workfn() that removes
blkgs from the list while holding blkcg_mutex.

Add blkcg_mutex protection in bfq_end_wr() before taking bfqd->lock to
ensure proper synchronization when iterating q->blkg_list.

Signed-off-by: Yu Kuai <[email protected]>
When switching an IO scheduler on a block device, blkcg_activate_policy()
allocates blkg_policy_data (pd) for all blkgs attached to the queue.
However, blkcg_activate_policy() may race with concurrent blkcg deletion,
leading to use-after-free and memory leak issues.

The use-after-free occurs in the following race:

T1 (blkcg_activate_policy):
  - Successfully allocates pd for blkg1 (loop0->queue, blkcgA)
  - Fails to allocate pd for blkg2 (loop0->queue, blkcgB)
  - Enters the enomem rollback path to release blkg1 resources

T2 (blkcg deletion):
  - blkcgA is deleted concurrently
  - blkg1 is freed via blkg_free_workfn()
  - blkg1->pd is freed

T1 (continued):
  - Rollback path accesses blkg1->pd->online after pd is freed
  - Triggers use-after-free

In addition, blkg_free_workfn() frees pd before removing the blkg from
q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd
for a blkg that is being destroyed, leaving the newly allocated pd
unreachable when the blkg is finally freed.

Fix these races by extending blkcg_mutex coverage to serialize
blkcg_activate_policy() rollback and blkg destruction, ensuring pd
lifecycle is synchronized with blkg list visibility.

Link: https://lore.kernel.org/all/[email protected]/
Fixes: f1c006f ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
When switching IO schedulers on a block device, blkcg_activate_policy()
can race with concurrent blkcg deletion, leading to a use-after-free in
rcu_accelerate_cbs.

T1:                               T2:
                                  blkg_destroy
                                  kill(&blkg->refcnt) // blkg->refcnt=1->0
                                  blkg_release // call_rcu(__blkg_release)
                                  ...
                                  blkg_free_workfn
                                  ->pd_free_fn(pd)
elv_iosched_store
elevator_switch
...
iterate blkg list
blkg_get(blkg) // blkg->refcnt=0->1
                                  list_del_init(&blkg->q_node)
blkg_put(pinned_blkg) // blkg->refcnt=1->0
blkg_release // call_rcu again
rcu_accelerate_cbs // uaf

Fix this by checking hlist_unhashed(&blkg->blkcg_node) before getting
a reference to the blkg. This is the same check used in blkg_destroy()
to detect if a blkg has already been destroyed. If the blkg is already
unhashed, skip processing it since it's being destroyed.

Link: https://lore.kernel.org/all/[email protected]/
Fixes: f1c006f ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
Move the teardown sequence which offlines and frees per-policy
blkg_policy_data (pd) into a helper for readability.

No functional change intended.

Signed-off-by: Zheng Qixing <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Yu Kuai <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
…cy()

Some policies like iocost and iolatency perform percpu allocation in
pd_alloc_fn(). Percpu allocation with queue frozen can cause deadlock
because percpu memory reclaim may issue IO.

Now that q->blkg_list is protected by blkcg_mutex, restructure
blkcg_activate_policy() to allocate all pds before freezing the queue:
1. Allocate all pds with GFP_KERNEL before freezing the queue
2. Freeze the queue
3. Initialize and online all pds

Note: Future work is to remove all queue freezing before
blkcg_activate_policy() to fix the deadlocks thoroughly.

Signed-off-by: Yu Kuai <[email protected]>
The current rq_qos_mutex handling has an awkward pattern where callers
must acquire the mutex before calling rq_qos_add()/rq_qos_del(), and
blkg_conf_open_bdev_frozen() had to release and re-acquire the mutex
around queue freezing to maintain proper locking order (freeze queue
before mutex).

On the other hand, with rq_qos_mutex held after blkg_conf_prep(), there
are many possible deadlocks:

- allocating memory with GFP_KERNEL, like blk_throtl_init();
- allocating percpu memory, like pd_alloc_fn() for iocost/iolatency;

This patch refactors the locking by:

1. Moving queue freeze and rq_qos_mutex acquisition inside
   rq_qos_add()/rq_qos_del(), with the correct order: freeze first,
   then acquire mutex.

2. Removing external mutex handling from wbt_init() since rq_qos_add()
   now handles it internally.

3. Removing rq_qos_mutex handling from blkg_conf_open_bdev() entirely,
   making it only responsible for parsing MAJ:MIN and opening the bdev.

4. Removing blkg_conf_open_bdev_frozen() and blkg_conf_exit_frozen()
   functions which are no longer needed.

5. Updating ioc_qos_write() to use the simpler blkg_conf_open_bdev()
   and blkg_conf_exit() functions.

This eliminates the release-and-reacquire pattern and makes
rq_qos_add()/rq_qos_del() self-contained, which is cleaner and reduces
complexity. Each function now properly manages its own locking with
the correct order: queue freeze → mutex acquire → modify → mutex
release → queue unfreeze.

Signed-off-by: Yu Kuai <[email protected]>
@blktests-ci
Copy link
Author

blktests-ci bot commented Feb 3, 2026

Upstream branch: 130b8e3
series: https://patchwork.kernel.org/project/linux-block/list/?series=1050146
version: 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants