Feature/qgpu optimize nonbonded ww using openmm method#68
Feature/qgpu optimize nonbonded ww using openmm method#68goodstudyqaq wants to merge 2 commits intofeature/qgpufrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the CUDA water-water nonbonded force calculation kernel to use a warp-synchronous algorithm inspired by OpenMM. The new implementation replaces the previous block-based approach with a more efficient warp-level computation that leverages shuffle operations for better performance.
Key Changes:
- Replaced template-based block computation with warp-synchronous tile-based algorithm
- Introduced helper functions for index mapping (idx2xy) and warp shuffle operations for double/coord_t types
- Simplified grid/block configuration from 2D to 1D with tile-based work distribution
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Multiple blank lines should be removed. This appears to be accidental whitespace that reduces code readability.
|
|
||
|
|
||
|
|
||
| // printf("WW E_vdw: %f, E_coul: %f\n", WW_evdw_TOT, WW_ecoul_TOT); |
There was a problem hiding this comment.
Commented-out printf statement should be removed before merging. Debug code should not remain in production code.
| // printf("WW E_vdw: %f, E_coul: %f\n", WW_evdw_TOT, WW_ecoul_TOT); |
| coord_t invalid = {-1e9, -1e9, -1e9}; | ||
| coord_t row = (rowIdx < N ? W[rowIdx] : invalid); | ||
|
|
||
| // "col" state that will be rotated around the warp | ||
| int colIdx = colIdx0; | ||
| coord_t col = (colIdx < N ? W[colIdx] : invalid); |
There was a problem hiding this comment.
Variable name 'invalid' is vague and could be more descriptive. Consider renaming to 'invalidCoord' or 'outOfBoundsMarker' to clarify its purpose as a sentinel value for out-of-range atoms.
| coord_t invalid = {-1e9, -1e9, -1e9}; | |
| coord_t row = (rowIdx < N ? W[rowIdx] : invalid); | |
| // "col" state that will be rotated around the warp | |
| int colIdx = colIdx0; | |
| coord_t col = (colIdx < N ? W[colIdx] : invalid); | |
| coord_t invalidCoord = {-1e9, -1e9, -1e9}; | |
| coord_t row = (rowIdx < N ? W[rowIdx] : invalidCoord); | |
| // "col" state that will be rotated around the warp | |
| int colIdx = colIdx0; | |
| coord_t col = (colIdx < N ? W[colIdx] : invalidCoord); |
| __device__ __forceinline__ void idx2xy(int n, int t, int& x, int& y) { | ||
| x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f); | ||
| y = t - (x * n - (x * (x - 1) >> 1)); | ||
| if (y < 0) { | ||
| x--; | ||
| y += (n - x); | ||
| } | ||
| y += x; | ||
| } |
There was a problem hiding this comment.
The idx2xy function lacks documentation explaining its algorithm and parameters. Add a comment describing that this maps a linear tile index to 2D coordinates in the upper triangle of a matrix, and explain the meaning of parameters n (number of blocks per dimension) and t (tile index).
| __device__ __forceinline__ double shfl(double v, int srcLane, unsigned mask = 0xffffffffu) { | ||
| int2 a = *reinterpret_cast<int2*>(&v); | ||
| a.x = __shfl_sync(mask, a.x, srcLane); | ||
| a.y = __shfl_sync(mask, a.y, srcLane); | ||
| return *reinterpret_cast<double*>(&a); | ||
| } | ||
|
|
||
| // Initialize sum arrays | ||
| #pragma unroll | ||
| for (int i = cur_thread_num; i < y_cal_num && block_y_left_begin + i < NY; i += thread_num) { | ||
| sum_row_x[i] = sum_row_x[y_cal_num + i] = 0; | ||
| sum_row_y[i] = sum_row_y[y_cal_num + i] = 0; | ||
| sum_row_z[i] = sum_row_z[y_cal_num + i] = 0; | ||
| } | ||
| __syncthreads(); | ||
| // Initialize column sums | ||
| double sum_col_x[2 * Block_x], sum_col_y[2 * Block_x], sum_col_z[2 * Block_x]; | ||
| #pragma unroll | ||
| for (int i = 0; i < Block_x; i++) { | ||
| sum_col_x[i] = sum_col_x[Block_x + i] = 0; | ||
| sum_col_y[i] = sum_col_y[Block_x + i] = 0; | ||
| sum_col_z[i] = sum_col_z[Block_x + i] = 0; | ||
| } | ||
| __device__ __forceinline__ coord_t shfl_coord(coord_t v, int srcLane, unsigned mask = 0xffffffffu) { | ||
| v.x = shfl(v.x, srcLane, mask); | ||
| v.y = shfl(v.y, srcLane, mask); | ||
| v.z = shfl(v.z, srcLane, mask); | ||
| return v; | ||
| } |
There was a problem hiding this comment.
The shfl and shfl_coord helper functions lack documentation. Add comments explaining that these functions enable warp shuffle operations for double and coord_t types respectively, which are essential for the warp-synchronous algorithm.
| @@ -1,5 +1,6 @@ | |||
| #include "cuda/include/CudaContext.cuh" | |||
| #include "cuda/include/CudaNonbondedWWForce.cuh" | |||
| #include <iostream> | |||
There was a problem hiding this comment.
The iostream header is included but not used in this file. This adds unnecessary compilation overhead and should be removed.
| #include <iostream> |
| evdw_sum += __shfl_down_sync(0xffffffffu, evdw_sum, offset); | ||
| ecoul_sum += __shfl_down_sync(0xffffffffu, ecoul_sum, offset); | ||
| } | ||
| // printf("evdw_sum: %f, ecoul_sum: %f\n", evdw_sum, ecoul_sum); |
There was a problem hiding this comment.
Commented-out printf statement should be removed before merging. Debug code should not remain in production code.
| // printf("evdw_sum: %f, ecoul_sum: %f\n", evdw_sum, ecoul_sum); |
| p[i] = W[block_x_left_begin + i]; | ||
| p[x_cal_num + i] = W[N - (block_x_left_begin + i)]; | ||
| __device__ __forceinline__ void idx2xy(int n, int t, int& x, int& y) { | ||
| x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f); |
There was a problem hiding this comment.
The idx2xy function uses float precision (floorf, sqrtf) but operates on integer inputs. For large values of n and t, this could lead to precision loss and incorrect index calculations. Consider using double precision (floor, sqrt) or an integer-only algorithm for robustness.
| x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f); | |
| // Use double precision for the quadratic inversion to avoid precision loss for large n, t. | |
| double dn = static_cast<double>(n); | |
| double dt = static_cast<double>(t); | |
| double tmp = 2.0 * dn + 1.0; | |
| double disc = tmp * tmp - 8.0 * dt; | |
| double xd = floor((tmp - sqrt(disc)) * 0.5); | |
| x = static_cast<int>(xd); |
| // Map tile -> (x,y) with y>=x (upper triangle) | ||
| int x, y; | ||
| idx2xy((N + 31) >> 5, tile, x, y); // nBlocks = ceil(N/32) | ||
|
|
||
| // Optimized reduction using warp-level primitives | ||
| #pragma unroll | ||
| for (unsigned w = 16; w >= 1; w /= 2) { | ||
| ecoul_tot += __shfl_down_sync(0xffffffff, ecoul_tot, w); | ||
| evdw_tot += __shfl_down_sync(0xffffffff, evdw_tot, w); | ||
| } | ||
| const int baseX = x << 5; // x*32 | ||
| const int baseY = y << 5; // y*32 |
There was a problem hiding this comment.
Add a comment explaining the grid decomposition strategy. The calculation (N + 31) >> 5 computes the number of 32-atom blocks (ceiling division by 32), and idx2xy maps the linear tile index to upper-triangle coordinates. This is a key part of the algorithm and deserves documentation for maintainability.
|
|
||
| const int N_ITERATION_Y = 32; // Keep at 32 for better memory access pattern | ||
| const int N_ITERATION_X = 1; // Keep at 1 for better memory coalescing | ||
| const int thread_num = 128; |
There was a problem hiding this comment.
The magic number 128 for thread_num should be defined as a named constant (e.g., THREAD_BLOCK_SIZE) to improve code maintainability and make it clear that this value is related to warpsPerBlock calculation in the kernel (line 74).
No description provided.