Skip to content

DAOS-18541 rebuild: accumulate more OIDs per migrate RPC to reduce RPC count#17704

Open
wangshilong wants to merge 3 commits intomasterfrom
shilongw/DAOS-18541
Open

DAOS-18541 rebuild: accumulate more OIDs per migrate RPC to reduce RPC count#17704
wangshilong wants to merge 3 commits intomasterfrom
shilongw/DAOS-18541

Conversation

@wangshilong
Copy link
Contributor

@wangshilong wangshilong commented Mar 14, 2026

Fix yield-count accounting in the scanner, A send-side batching policy is also introduced: the send ULT defers flushing until at least REBUILD_SEND_BATCH_MIN OIDs are queued or REBUILD_SEND_BATCH_TIMEOUT_SEC seconds have elapsed.

Without batching, a fast scanner floods the destination rank with many small RPCs, exhausting IB receive buffers and triggering timeouts. This is especially severe during reintegration, where all OIDs are concentrated on a single target rank.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

Fix yield-count accounting in the scanner: rebuild_object() is a pure in-memory
btree insert and does not need to contribute yield pressure. A send-side batching
policy is also introduced: the send ULT defers flushing until at least REBUILD_SEND_BATCH_MIN
OIDs are queued or REBUILD_SEND_BATCH_TIMEOUT_SEC seconds have elapsed, preventing a flood
of small migrate RPCs when the scanner runs faster than the sender — particularly
under reintegration workloads.

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@wangshilong wangshilong requested review from a team as code owners March 14, 2026 15:27
@github-actions
Copy link

Ticket title is 'Rebuild stuck on Bear cluster'
Status is 'In Progress'
Labels: 'test_2.8'
https://daosio.atlassian.net/browse/DAOS-18541

@wangshilong wangshilong changed the title DAOS-18541 rebuild: reduce redundant migration OID RPCs DAOS-18541 rebuild: batch migration OID send RPCs Mar 14, 2026
@wangshilong wangshilong changed the title DAOS-18541 rebuild: batch migration OID send RPCs DAOS-18541 rebuild: increase migration OID batch size to reduce RPC flood Mar 15, 2026
@wangshilong wangshilong changed the title DAOS-18541 rebuild: increase migration OID batch size to reduce RPC flood DAOS-18541 rebuild: accumulate more OIDs per migrate RPC to reduce RPC count Mar 15, 2026
Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@wangshilong wangshilong requested a review from gnailzenh March 16, 2026 01:56
liuxuezhao
liuxuezhao previously approved these changes Mar 16, 2026
@daosbuild3
Copy link
Collaborator

Signed-off-by: Wang Shilong <shilong.wang@hpe.com>
@wangshilong wangshilong force-pushed the shilongw/DAOS-18541 branch from 5ec1fbe to f4154c1 Compare March 16, 2026 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants