Skip to content

Conversation

@PointKernel
Copy link
Member

@PointKernel PointKernel commented Dec 5, 2025

This PR switches from cudaMemcpyAsync to cudaMemcpyBatchAsync to eliminate a performance regression caused by driver-side locking in the legacy memcpy path. Using the new batch-async API removes that locking overhead and restores the expected performance.

@PointKernel PointKernel added helps: rapids Helps or needed by RAPIDS topic: performance Performance related issue labels Dec 5, 2025
@PointKernel PointKernel self-assigned this Dec 5, 2025
@PointKernel PointKernel added the Needs Review Awaiting reviews before merging label Dec 6, 2025
Copy link
Collaborator

@sleeepyjack sleeepyjack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Could you share some details offline on why this fix is needed?

@PointKernel
Copy link
Member Author

LGTM Could you share some details offline on why this fix is needed?

I’ve expanded the PR description with more context on the issue. In short, cudaMemcpyAsync incurs a costly driver lock when used across multiple streams, which leads to a noticeable performance hit. Switching to the new batch async API removes this locking behavior and resolves the regression.

@PointKernel PointKernel changed the title Repalce cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking Replace cudaMemcpyAsync with cudaMemcpyBatchAsync to avoid locking Dec 15, 2025
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The symbol cudaMemcpyBatchAsync exists for CUDA 12.8+ but this won't work with older CUDA 12 releases. We don't see this in cuCollections CI because it doesn't test minor version compatibility (cudf's CI builds with 12.9 and runs tests with 12.2).

@PointKernel and I discussed possible workarounds that would support CUDA 12.8+. All involve checking for and using the driver API cuMemcpyBatchAsync instead of the runtime API.

For now, we concluded it's easiest to just require CUDA 13, by using the condition on #if CUDART_VERSION > 13000. For CUDA 12.8 and 12.9, we would just use cudaMemcpyAsync as in the current state.

I plan to adopt the same solution for rapidsai/cudf#20800 until we can invest the additional engineering time to support CUDA 12.8-12.9.

@PointKernel PointKernel requested a review from bdice January 22, 2026 19:24
@bdice
Copy link
Contributor

bdice commented Jan 22, 2026

Thanks! Looks good (though I lack approval power).

@PointKernel PointKernel merged commit 5ff4084 into NVIDIA:dev Jan 22, 2026
29 checks passed
@PointKernel PointKernel deleted the fix-memcpy-locking branch January 22, 2026 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

helps: rapids Helps or needed by RAPIDS Needs Review Awaiting reviews before merging topic: performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants