-
Notifications
You must be signed in to change notification settings - Fork 106
Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking #777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sleeepyjack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Could you share some details offline on why this fix is needed?
I’ve expanded the PR description with more context on the issue. In short, |
bdice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The symbol cudaMemcpyBatchAsync exists for CUDA 12.8+ but this won't work with older CUDA 12 releases. We don't see this in cuCollections CI because it doesn't test minor version compatibility (cudf's CI builds with 12.9 and runs tests with 12.2).
@PointKernel and I discussed possible workarounds that would support CUDA 12.8+. All involve checking for and using the driver API cuMemcpyBatchAsync instead of the runtime API.
- We might be able to use
cudaGetDriverEntryPoint(deprecated in CUDA 13.0) to get the driver symbol forcuMemcpyBatchAsyncand use that, but I think there are some sharp edges with this approach which led to the introduction ofcudaGetDriverEntryPointByVersion.- We can't use
cudaGetDriverEntryPointByVersionbecause it requires CUDA 12.5+, which kind of defeats the purpose of supporting CUDA < 12.8.
- We can't use
- CCCL has a somewhat complex solution for dlopen'ing the driver and getting
cuProcGetAddress, which we could then use to check forcuMemcpyBatchAsyncsupport. This definitely works but it's a lot of work to get right.
For now, we concluded it's easiest to just require CUDA 13, by using the condition on #if CUDART_VERSION > 13000. For CUDA 12.8 and 12.9, we would just use cudaMemcpyAsync as in the current state.
I plan to adopt the same solution for rapidsai/cudf#20800 until we can invest the additional engineering time to support CUDA 12.8-12.9.
|
Thanks! Looks good (though I lack approval power). |
This PR switches from
cudaMemcpyAsynctocudaMemcpyBatchAsyncto eliminate a performance regression caused by driver-side locking in the legacy memcpy path. Using the new batch-async API removes that locking overhead and restores the expected performance.