Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777

PointKernel · 2025-12-05T18:57:46Z

This PR switches from cudaMemcpyAsync to cudaMemcpyBatchAsync to eliminate a performance regression caused by driver-side locking in the legacy memcpy path. Using the new batch-async API removes that locking overhead and restores the expected performance.

…ver locking bug

include/cuco/detail/utility/memcpy_async.cuh

sleeepyjack

LGTM
Could you share some details offline on why this fix is needed?

PointKernel · 2025-12-10T17:24:59Z

LGTM Could you share some details offline on why this fix is needed?

I’ve expanded the PR description with more context on the issue. In short, cudaMemcpyAsync incurs a costly driver lock when used across multiple streams, which leads to a noticeable performance hit. Switching to the new batch async API removes this locking behavior and resolves the regression.

bdice

The symbol cudaMemcpyBatchAsync exists for CUDA 12.8+ but this won't work with older CUDA 12 releases. We don't see this in cuCollections CI because it doesn't test minor version compatibility (cudf's CI builds with 12.9 and runs tests with 12.2).

@PointKernel and I discussed possible workarounds that would support CUDA 12.8+. All involve checking for and using the driver API cuMemcpyBatchAsync instead of the runtime API.

We might be able to use cudaGetDriverEntryPoint (deprecated in CUDA 13.0) to get the driver symbol for cuMemcpyBatchAsync and use that, but I think there are some sharp edges with this approach which led to the introduction of cudaGetDriverEntryPointByVersion.
- We can't use cudaGetDriverEntryPointByVersion because it requires CUDA 12.5+, which kind of defeats the purpose of supporting CUDA < 12.8.
CCCL has a somewhat complex solution for dlopen'ing the driver and getting cuProcGetAddress, which we could then use to check for cuMemcpyBatchAsync support. This definitely works but it's a lot of work to get right.

For now, we concluded it's easiest to just require CUDA 13, by using the condition on #if CUDART_VERSION > 13000. For CUDA 12.8 and 12.9, we would just use cudaMemcpyAsync as in the current state.

I plan to adopt the same solution for rapidsai/cudf#20800 until we can invest the additional engineering time to support CUDA 12.8-12.9.

…cpy-locking

bdice · 2026-01-22T21:21:22Z

Thanks! Looks good (though I lack approval power).

Repalce cudaMemcpyAsync with cudaMemcpyBatchAsync to get rid of a dri…

ee5addf

…ver locking bug

PointKernel requested a review from sleeepyjack as a code owner December 5, 2025 18:57

PointKernel added helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue labels Dec 5, 2025

PointKernel self-assigned this Dec 5, 2025

PointKernel added 3 commits December 5, 2025 11:02

Fix pre-12.8 compatibility

f034a26

Fix edge cases for cudaMemcpyBatchAsync

17aa550

Header cleanups

d735479

PointKernel added the Needs Review Awaiting reviews before merging label Dec 6, 2025

bdice reviewed Dec 6, 2025

View reviewed changes

include/cuco/detail/utility/memcpy_async.cuh Outdated Show resolved Hide resolved

include/cuco/detail/utility/memcpy_async.cuh Outdated Show resolved Hide resolved

PointKernel added 2 commits December 6, 2025 18:06

Update detail memcpy_async to return CUDA error

56246c6

Use C++ header instead of CUDA header

fbe6249

sleeepyjack approved these changes Dec 8, 2025

View reviewed changes

PointKernel changed the title ~~Repalce cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking~~ Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking Dec 15, 2025

PointKernel added 2 commits December 19, 2025 13:05

Resolve conflicts

669c84b

Merge branch 'dev' into fix-memcpy-locking

e1fb79b

bdice suggested changes Jan 22, 2026

View reviewed changes

PointKernel added 3 commits January 22, 2026 11:16

Merge remote-tracking branch 'upstream/dev' into fix-memcpy-locking

9be3a7f

Merge remote-tracking branch 'origin/fix-memcpy-locking' into fix-mem…

b52c511

…cpy-locking

Update memcpy util to use the batch API for 13.0 and +

9a6b4aa

PointKernel requested a review from bdice January 22, 2026 19:24

bdice approved these changes Jan 22, 2026

View reviewed changes

PointKernel merged commit 5ff4084 into NVIDIA:dev Jan 22, 2026
29 checks passed

PointKernel deleted the fix-memcpy-locking branch January 22, 2026 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777

Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777

Uh oh!

PointKernel commented Dec 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

sleeepyjack left a comment

Uh oh!

PointKernel commented Dec 10, 2025

Uh oh!

bdice left a comment

Uh oh!

bdice commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777

Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777

Uh oh!

Conversation

PointKernel commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sleeepyjack left a comment

Choose a reason for hiding this comment

Uh oh!

PointKernel commented Dec 10, 2025

Uh oh!

bdice left a comment

Choose a reason for hiding this comment

Uh oh!

bdice commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PointKernel commented Dec 5, 2025 •

edited

Loading