-
Notifications
You must be signed in to change notification settings - Fork 0
blk-cgroup: fix races and deadlocks #540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
blktests-ci
wants to merge
7
commits into
for-next_base
Choose a base branch
from
series/1050146=>for-next
base: for-next_base
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
+90
−161
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… blkcg_mutex blkg_destroy_all() iterates q->blkg_list without holding blkcg_mutex, which can race with blkg_free_workfn() that removes blkgs from the list while holding blkcg_mutex. Add blkcg_mutex protection around the q->blkg_list iteration to prevent potential list corruption or use-after-free issues. Signed-off-by: Yu Kuai <[email protected]>
…mutex bfq_end_wr_async() iterates q->blkg_list while only holding bfqd->lock, but not blkcg_mutex. This can race with blkg_free_workfn() that removes blkgs from the list while holding blkcg_mutex. Add blkcg_mutex protection in bfq_end_wr() before taking bfqd->lock to ensure proper synchronization when iterating q->blkg_list. Signed-off-by: Yu Kuai <[email protected]>
When switching an IO scheduler on a block device, blkcg_activate_policy() allocates blkg_policy_data (pd) for all blkgs attached to the queue. However, blkcg_activate_policy() may race with concurrent blkcg deletion, leading to use-after-free and memory leak issues. The use-after-free occurs in the following race: T1 (blkcg_activate_policy): - Successfully allocates pd for blkg1 (loop0->queue, blkcgA) - Fails to allocate pd for blkg2 (loop0->queue, blkcgB) - Enters the enomem rollback path to release blkg1 resources T2 (blkcg deletion): - blkcgA is deleted concurrently - blkg1 is freed via blkg_free_workfn() - blkg1->pd is freed T1 (continued): - Rollback path accesses blkg1->pd->online after pd is freed - Triggers use-after-free In addition, blkg_free_workfn() frees pd before removing the blkg from q->blkg_list. This allows blkcg_activate_policy() to allocate a new pd for a blkg that is being destroyed, leaving the newly allocated pd unreachable when the blkg is finally freed. Fix these races by extending blkcg_mutex coverage to serialize blkcg_activate_policy() rollback and blkg destruction, ensuring pd lifecycle is synchronized with blkg list visibility. Link: https://lore.kernel.org/all/[email protected]/ Fixes: f1c006f ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()") Signed-off-by: Zheng Qixing <[email protected]> Signed-off-by: Yu Kuai <[email protected]>
When switching IO schedulers on a block device, blkcg_activate_policy()
can race with concurrent blkcg deletion, leading to a use-after-free in
rcu_accelerate_cbs.
T1: T2:
blkg_destroy
kill(&blkg->refcnt) // blkg->refcnt=1->0
blkg_release // call_rcu(__blkg_release)
...
blkg_free_workfn
->pd_free_fn(pd)
elv_iosched_store
elevator_switch
...
iterate blkg list
blkg_get(blkg) // blkg->refcnt=0->1
list_del_init(&blkg->q_node)
blkg_put(pinned_blkg) // blkg->refcnt=1->0
blkg_release // call_rcu again
rcu_accelerate_cbs // uaf
Fix this by checking hlist_unhashed(&blkg->blkcg_node) before getting
a reference to the blkg. This is the same check used in blkg_destroy()
to detect if a blkg has already been destroyed. If the blkg is already
unhashed, skip processing it since it's being destroyed.
Link: https://lore.kernel.org/all/[email protected]/
Fixes: f1c006f ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Signed-off-by: Zheng Qixing <[email protected]>
Signed-off-by: Yu Kuai <[email protected]>
Move the teardown sequence which offlines and frees per-policy blkg_policy_data (pd) into a helper for readability. No functional change intended. Signed-off-by: Zheng Qixing <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Yu Kuai <[email protected]> Signed-off-by: Yu Kuai <[email protected]>
…cy() Some policies like iocost and iolatency perform percpu allocation in pd_alloc_fn(). Percpu allocation with queue frozen can cause deadlock because percpu memory reclaim may issue IO. Now that q->blkg_list is protected by blkcg_mutex, restructure blkcg_activate_policy() to allocate all pds before freezing the queue: 1. Allocate all pds with GFP_KERNEL before freezing the queue 2. Freeze the queue 3. Initialize and online all pds Note: Future work is to remove all queue freezing before blkcg_activate_policy() to fix the deadlocks thoroughly. Signed-off-by: Yu Kuai <[email protected]>
The current rq_qos_mutex handling has an awkward pattern where callers must acquire the mutex before calling rq_qos_add()/rq_qos_del(), and blkg_conf_open_bdev_frozen() had to release and re-acquire the mutex around queue freezing to maintain proper locking order (freeze queue before mutex). On the other hand, with rq_qos_mutex held after blkg_conf_prep(), there are many possible deadlocks: - allocating memory with GFP_KERNEL, like blk_throtl_init(); - allocating percpu memory, like pd_alloc_fn() for iocost/iolatency; This patch refactors the locking by: 1. Moving queue freeze and rq_qos_mutex acquisition inside rq_qos_add()/rq_qos_del(), with the correct order: freeze first, then acquire mutex. 2. Removing external mutex handling from wbt_init() since rq_qos_add() now handles it internally. 3. Removing rq_qos_mutex handling from blkg_conf_open_bdev() entirely, making it only responsible for parsing MAJ:MIN and opening the bdev. 4. Removing blkg_conf_open_bdev_frozen() and blkg_conf_exit_frozen() functions which are no longer needed. 5. Updating ioc_qos_write() to use the simpler blkg_conf_open_bdev() and blkg_conf_exit() functions. This eliminates the release-and-reacquire pattern and makes rq_qos_add()/rq_qos_del() self-contained, which is cleaner and reduces complexity. Each function now properly manages its own locking with the correct order: queue freeze → mutex acquire → modify → mutex release → queue unfreeze. Signed-off-by: Yu Kuai <[email protected]>
Author
|
Upstream branch: 130b8e3 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull request for series with
subject: blk-cgroup: fix races and deadlocks
version: 2
url: https://patchwork.kernel.org/project/linux-block/list/?series=1050146