Conversation
There was a problem hiding this comment.
Pull request overview
Adds a ping-pong (“latency testing”) mode to TransferBench, including parsing a ping/pong pair syntax and reporting per-lap latency in the client output.
Changes:
- Extend
Transfer/ config options to support ping-pong laps and a configurable flag stride. - Implement ping-pong execution behavior across CPU/GFX/DMA/NIC paths, including flag allocation/offsetting and timing adjustments.
- Update client result tables and environment variables to display latency-oriented ping-pong results and configure stride.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
src/header/TransferBench.hpp |
Adds ping-pong transfer fields, parsing, resource linking, flag wait/slot logic, execution changes across executors, and iteration timing adjustments. |
src/client/Utilities.hpp |
Updates result rendering to hide pong rows and show ping-pong latency rows/iteration details. |
src/client/EnvVars.hpp |
Adds PINGPONG_STRIDE env var and wires it into cfg.general.pingpongStride. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int32_t exeSubSlot = 0; ///< Executor subslot | ||
| int numSubExecs = 0; ///< Number of subExecutors to use for this Transfer | ||
| int laps = 0; ///< 0 = one-way transfer, >0 = ping (forward), <0 = pong (backward). abs(laps) = lap count | ||
| MemDevice flag = {}; ///< Memory device for pingpong signal flag (only used when laps != 0) |
There was a problem hiding this comment.
Transfer::flag is default-initialized with {}, which sets memType to MEM_CPU (0) rather than MEM_NULL. For pingpong transfers created via other code paths (not the parser), this makes it impossible to reliably detect “flag not set” and can cause the pingpong validation to incorrectly pass. Initialize flag.memType to MEM_NULL by default (or add an explicit boolean/optional to indicate presence).
| MemDevice flag = {}; ///< Memory device for pingpong signal flag (only used when laps != 0) | |
| MemDevice flag = {MEM_NULL, 0, 0}; ///< Memory device for pingpong signal flag (only used when laps != 0) |
|
|
||
| // Pingpong flag checks | ||
| if (t.laps != 0) { | ||
| if (t.flag.memType == MEM_NULL) { |
There was a problem hiding this comment.
Pingpong validation only checks t.flag.memType == MEM_NULL and memRank, but it never validates cfg.general.pingpongStride (positive, multiple-of-8) even though later code assumes 8-byte alignment for atomic loads and uses the stride in modulo/address arithmetic. Add explicit validation for pingpongStride (and ideally validate it against lap count / allocation size) so invalid values don’t lead to misalignment, slot reuse, or out-of-bounds writes.
| if (t.flag.memType == MEM_NULL) { | |
| if (cfg.general.pingpongStride <= 0) { | |
| errors.push_back({ERR_FATAL, | |
| "Transfer %d: Pingpong transfer requires pingpongStride to be positive (instead of %d)", | |
| i, cfg.general.pingpongStride}); | |
| hasFatalError = true; | |
| } else if ((cfg.general.pingpongStride % 8) != 0) { | |
| errors.push_back({ERR_FATAL, | |
| "Transfer %d: Pingpong transfer requires pingpongStride to be a multiple of 8 bytes (instead of %d)", | |
| i, cfg.general.pingpongStride}); | |
| hasFatalError = true; | |
| } else if (t.flag.memType == MEM_NULL) { |
| if (t.laps != 0) | ||
| dstAllocBytes = std::max(dstAllocBytes, std::min((size_t)1024, (size_t)8 * abs(t.laps))); |
There was a problem hiding this comment.
For pingpong transfers, dstAllocBytes is derived from 8 * abs(t.laps) but does not incorporate cfg.general.pingpongStride. If pingpongStride is larger than 8, multiple laps will alias the same flag slot (and stale non-zero values will cause waits to pass immediately), which breaks ordering/latency measurement. Size the allocation (and the wrap-around size used by FlagSlot) based on pingpongStride * abs(laps) (or store/compare a per-lap sequence value instead of waiting on “non-zero”).
| if (t.laps != 0) | |
| dstAllocBytes = std::max(dstAllocBytes, std::min((size_t)1024, (size_t)8 * abs(t.laps))); | |
| if (t.laps != 0) { | |
| size_t const pingpongLaps = static_cast<size_t>(t.laps < 0 ? -static_cast<int64_t>(t.laps) | |
| : static_cast<int64_t>(t.laps)); | |
| size_t const pingpongBytes = static_cast<size_t>(cfg.general.pingpongStride) * pingpongLaps; | |
| dstAllocBytes = std::max(dstAllocBytes, pingpongBytes); | |
| } |
| // Cross-link flagMem for pingpong pairs (ping and pong are consecutive in transfers) | ||
| for (size_t i = 0; i + 1 < transfers.size(); i++) { | ||
| if (transfers[i].laps > 0 && transfers[i+1].laps < 0) { | ||
| auto pingRss = transferResources[i]; | ||
| auto pongRss = transferResources[i + 1]; | ||
| pingRss->flagMem = pongRss->dstMem[0]; | ||
| pongRss->flagMem = pingRss->dstMem[0]; | ||
| int stride = cfg.general.pingpongStride; |
There was a problem hiding this comment.
This block assumes transferResources[i] corresponds to transfers[i], but transferResources is built by iterating executorMap and pushing each executor’s resources, so its order is by executor (map key), not by transferIdx. As a result, ping/pong pairing may link the wrong resources (wrong dstMem[0]), causing incorrect synchronization or crashes. Build an indexable structure keyed by transferIdx (e.g., std::vector<TransferResources*> byTransferIdx(transfers.size())) and use that for pingpong cross-linking.
| // Zero flag memory for pingpong transfers so each lap's slot starts at 0 | ||
| for (size_t i = 0; i + 1 < transfers.size(); i++) { | ||
| if (transfers[i].laps > 0 && transfers[i+1].laps < 0) { | ||
| size_t allocBytes = std::min((size_t)1024, (size_t)8 * abs(transfers[i].laps)); | ||
| for (int k = 0; k < 2; k++) { | ||
| void* dst = transferResources[i + k]->dstMem[0]; | ||
| if (!dst) continue; |
There was a problem hiding this comment.
The per-iteration flag-zeroing loop also indexes transferResources[i + k], inheriting the same ordering bug as the pingpong cross-linking logic (resources are not guaranteed to be in transfer-index order). Use a mapping by transferIdx here as well to ensure the correct ping/pong buffers are cleared.
| MemType mt = transfers[i + k].dsts[0].memType; | ||
| if (mt == MEM_CPU || mt == MEM_NULL) | ||
| memset(dst, 0, allocBytes); | ||
| else | ||
| hipMemset(dst, 0, allocBytes); | ||
| } |
There was a problem hiding this comment.
For GPU flag buffers, this uses hipMemset(dst, ...) without selecting the correct device for dst and without checking the return code (unlike other HIP calls in this file). This can fail or write to the wrong device depending on the current device, and failures will be silently ignored. Set the device based on the flag buffer’s memIndex and propagate/append the HIP error (and ensure the memset completes before launching transfers if required).
| { | ||
| // Replace any round brackets or '->' with spaces, | ||
| // Replace round brackets, '->', and ':' with spaces, but preserve '+' | ||
| for (int i = 1; line[i]; i++) |
There was a problem hiding this comment.
for (int i = 1; line[i]; i++) reads line[1] unconditionally, which is out-of-bounds for empty strings or 1-character lines (e.g., blank lines in a config file). Guard on line.size() > 1 and iterate using line.size() rather than relying on a null terminator.
| for (int i = 1; line[i]; i++) | |
| for (std::string::size_type i = 1; i < line.size(); i++) |
| Transfer const& pong = transfers[idx + 1]; | ||
| if (isMultiRank) { | ||
| table.Set(rowIdx, 4, " %s->R%d%c%d->%s <+> %s->R%d%c%d->%s", | ||
| MemDevicesToStr(t.srcs).c_str(), | ||
| exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex, | ||
| MemDevicesToStr(t.dsts).c_str(), | ||
| MemDevicesToStr(pong.srcs).c_str(), | ||
| pong.exeDevice.exeRank, ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex, | ||
| MemDevicesToStr(pong.dsts).c_str()); | ||
| } else { | ||
| table.Set(rowIdx, 4, " %s->%c%d->%s <+> %s->%c%d->%s", | ||
| MemDevicesToStr(t.srcs).c_str(), | ||
| ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex, | ||
| MemDevicesToStr(t.dsts).c_str(), | ||
| MemDevicesToStr(pong.srcs).c_str(), | ||
| ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex, | ||
| MemDevicesToStr(pong.dsts).c_str()); |
There was a problem hiding this comment.
Transfer const& pong = transfers[idx + 1]; assumes the pong half always exists immediately after the ping index. If a ping transfer is last in the list, or transfer ordering differs (e.g., future parser changes / manual API usage), this will read out of bounds. Add a bounds/consistency check (e.g., idx + 1 < transfers.size() and transfers[idx+1].laps < 0) before dereferencing.
| Transfer const& pong = transfers[idx + 1]; | |
| if (isMultiRank) { | |
| table.Set(rowIdx, 4, " %s->R%d%c%d->%s <+> %s->R%d%c%d->%s", | |
| MemDevicesToStr(t.srcs).c_str(), | |
| exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex, | |
| MemDevicesToStr(t.dsts).c_str(), | |
| MemDevicesToStr(pong.srcs).c_str(), | |
| pong.exeDevice.exeRank, ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex, | |
| MemDevicesToStr(pong.dsts).c_str()); | |
| } else { | |
| table.Set(rowIdx, 4, " %s->%c%d->%s <+> %s->%c%d->%s", | |
| MemDevicesToStr(t.srcs).c_str(), | |
| ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex, | |
| MemDevicesToStr(t.dsts).c_str(), | |
| MemDevicesToStr(pong.srcs).c_str(), | |
| ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex, | |
| MemDevicesToStr(pong.dsts).c_str()); | |
| bool const hasPong = (static_cast<size_t>(idx + 1) < transfers.size()) && (transfers[idx + 1].laps < 0); | |
| if (hasPong) { | |
| Transfer const& pong = transfers[idx + 1]; | |
| if (isMultiRank) { | |
| table.Set(rowIdx, 4, " %s->R%d%c%d->%s <+> %s->R%d%c%d->%s", | |
| MemDevicesToStr(t.srcs).c_str(), | |
| exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex, | |
| MemDevicesToStr(t.dsts).c_str(), | |
| MemDevicesToStr(pong.srcs).c_str(), | |
| pong.exeDevice.exeRank, ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex, | |
| MemDevicesToStr(pong.dsts).c_str()); | |
| } else { | |
| table.Set(rowIdx, 4, " %s->%c%d->%s <+> %s->%c%d->%s", | |
| MemDevicesToStr(t.srcs).c_str(), | |
| ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex, | |
| MemDevicesToStr(t.dsts).c_str(), | |
| MemDevicesToStr(pong.srcs).c_str(), | |
| ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex, | |
| MemDevicesToStr(pong.dsts).c_str()); | |
| } | |
| } else { | |
| if (isMultiRank) { | |
| table.Set(rowIdx, 4, " %s->R%d%c%d->%s", | |
| MemDevicesToStr(t.srcs).c_str(), | |
| exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex, | |
| MemDevicesToStr(t.dsts).c_str()); | |
| } else { | |
| table.Set(rowIdx, 4, " %s->%c%d->%s", | |
| MemDevicesToStr(t.srcs).c_str(), | |
| ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex, | |
| MemDevicesToStr(t.dsts).c_str()); | |
| } |
| if (transfers[idx].laps >= 0) displayCount++; | ||
| numRows += 1 + displayCount; | ||
| if (ev.showIterations) { | ||
| numRows += (numTimedIterations + 1); |
There was a problem hiding this comment.
When showIterations is enabled, numRows is only incremented by (numTimedIterations + 1) once per executor, but the rendering logic appends per-iteration rows for each displayed transfer. Because TableHelper::Set silently ignores out-of-bounds writes, this can truncate iteration output whenever an executor has multiple transfers. Update numRows to account for (numTimedIterations + 1) per displayed transfer (regular + pingpong).
| numRows += (numTimedIterations + 1); | |
| numRows += displayCount * (numTimedIterations + 1); |
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist