Latency Testing by AtlantaPepsi · Pull Request #252 · ROCm/TransferBench

AtlantaPepsi · 2026-04-06T09:04:43Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Adds a ping-pong (“latency testing”) mode to TransferBench, including parsing a ping/pong pair syntax and reporting per-lap latency in the client output.

Changes:

Extend Transfer / config options to support ping-pong laps and a configurable flag stride.
Implement ping-pong execution behavior across CPU/GFX/DMA/NIC paths, including flag allocation/offsetting and timing adjustments.
Update client result tables and environment variables to display latency-oriented ping-pong results and configure stride.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File	Description
`src/header/TransferBench.hpp`	Adds ping-pong transfer fields, parsing, resource linking, flag wait/slot logic, execution changes across executors, and iteration timing adjustments.
`src/client/Utilities.hpp`	Updates result rendering to hide pong rows and show ping-pong latency rows/iteration details.
`src/client/EnvVars.hpp`	Adds `PINGPONG_STRIDE` env var and wires it into `cfg.general.pingpongStride`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T19:10:27Z

src/header/TransferBench.hpp

    int32_t           exeSubSlot  = 0;          ///< Executor subslot
    int               numSubExecs = 0;          ///< Number of subExecutors to use for this Transfer
+    int               laps        = 0;          ///< 0 = one-way transfer, >0 = ping (forward), <0 = pong (backward). abs(laps) = lap count
+    MemDevice         flag        = {};         ///< Memory device for pingpong signal flag (only used when laps != 0)


Transfer::flag is default-initialized with {}, which sets memType to MEM_CPU (0) rather than MEM_NULL. For pingpong transfers created via other code paths (not the parser), this makes it impossible to reliably detect “flag not set” and can cause the pingpong validation to incorrectly pass. Initialize flag.memType to MEM_NULL by default (or add an explicit boolean/optional to indicate presence).

Suggested change

MemDevice flag = {}; ///< Memory device for pingpong signal flag (only used when laps != 0)

MemDevice flag = {MEM_NULL, 0, 0}; ///< Memory device for pingpong signal flag (only used when laps != 0)

Copilot · 2026-04-08T19:10:27Z

src/header/TransferBench.hpp


+      // Pingpong flag checks
+      if (t.laps != 0) {
+        if (t.flag.memType == MEM_NULL) {


Pingpong validation only checks t.flag.memType == MEM_NULL and memRank, but it never validates cfg.general.pingpongStride (positive, multiple-of-8) even though later code assumes 8-byte alignment for atomic loads and uses the stride in modulo/address arithmetic. Add explicit validation for pingpongStride (and ideally validate it against lap count / allocation size) so invalid values don’t lead to misalignment, slot reuse, or out-of-bounds writes.

Suggested change

if (t.flag.memType == MEM_NULL) {

if (cfg.general.pingpongStride <= 0) {

errors.push_back({ERR_FATAL,

"Transfer %d: Pingpong transfer requires pingpongStride to be positive (instead of %d)",

i, cfg.general.pingpongStride});

hasFatalError = true;

} else if ((cfg.general.pingpongStride % 8) != 0) {

errors.push_back({ERR_FATAL,

"Transfer %d: Pingpong transfer requires pingpongStride to be a multiple of 8 bytes (instead of %d)",

i, cfg.general.pingpongStride});

hasFatalError = true;

} else if (t.flag.memType == MEM_NULL) {

Copilot · 2026-04-08T19:10:28Z

src/header/TransferBench.hpp

+        if (t.laps != 0)
+          dstAllocBytes = std::max(dstAllocBytes, std::min((size_t)1024, (size_t)8 * abs(t.laps)));


For pingpong transfers, dstAllocBytes is derived from 8 * abs(t.laps) but does not incorporate cfg.general.pingpongStride. If pingpongStride is larger than 8, multiple laps will alias the same flag slot (and stale non-zero values will cause waits to pass immediately), which breaks ordering/latency measurement. Size the allocation (and the wrap-around size used by FlagSlot) based on pingpongStride * abs(laps) (or store/compare a per-lap sequence value instead of waiting on “non-zero”).

Suggested change

if (t.laps != 0)

dstAllocBytes = std::max(dstAllocBytes, std::min((size_t)1024, (size_t)8 * abs(t.laps)));

if (t.laps != 0) {

size_t const pingpongLaps = static_cast<size_t>(t.laps < 0 ? -static_cast<int64_t>(t.laps)

: static_cast<int64_t>(t.laps));

size_t const pingpongBytes = static_cast<size_t>(cfg.general.pingpongStride) * pingpongLaps;

dstAllocBytes = std::max(dstAllocBytes, pingpongBytes);

}

Copilot · 2026-04-08T19:10:29Z

src/header/TransferBench.hpp

+    // Cross-link flagMem for pingpong pairs (ping and pong are consecutive in transfers)
+    for (size_t i = 0; i + 1 < transfers.size(); i++) {
+      if (transfers[i].laps > 0 && transfers[i+1].laps < 0) {
+        auto pingRss = transferResources[i];
+        auto pongRss = transferResources[i + 1];
+        pingRss->flagMem        = pongRss->dstMem[0];
+        pongRss->flagMem        = pingRss->dstMem[0];
+        int stride              = cfg.general.pingpongStride;


This block assumes transferResources[i] corresponds to transfers[i], but transferResources is built by iterating executorMap and pushing each executor’s resources, so its order is by executor (map key), not by transferIdx. As a result, ping/pong pairing may link the wrong resources (wrong dstMem[0]), causing incorrect synchronization or crashes. Build an indexable structure keyed by transferIdx (e.g., std::vector<TransferResources*> byTransferIdx(transfers.size())) and use that for pingpong cross-linking.

Copilot · 2026-04-08T19:10:29Z

src/header/TransferBench.hpp

+      // Zero flag memory for pingpong transfers so each lap's slot starts at 0
+      for (size_t i = 0; i + 1 < transfers.size(); i++) {
+        if (transfers[i].laps > 0 && transfers[i+1].laps < 0) {
+          size_t allocBytes = std::min((size_t)1024, (size_t)8 * abs(transfers[i].laps));
+          for (int k = 0; k < 2; k++) {
+            void* dst = transferResources[i + k]->dstMem[0];
+            if (!dst) continue;


The per-iteration flag-zeroing loop also indexes transferResources[i + k], inheriting the same ordering bug as the pingpong cross-linking logic (resources are not guaranteed to be in transfer-index order). Use a mapping by transferIdx here as well to ensure the correct ping/pong buffers are cleared.

Copilot · 2026-04-08T19:10:29Z

src/header/TransferBench.hpp

+            MemType mt = transfers[i + k].dsts[0].memType;
+            if (mt == MEM_CPU || mt == MEM_NULL)
+              memset(dst, 0, allocBytes);
+            else
+              hipMemset(dst, 0, allocBytes);
+          }


For GPU flag buffers, this uses hipMemset(dst, ...) without selecting the correct device for dst and without checking the return code (unlike other HIP calls in this file). This can fail or write to the wrong device depending on the current device, and failures will be silently ignored. Set the device based on the flag buffer’s memIndex and propagate/append the HIP error (and ensure the memset completes before launching transfers if required).

Copilot · 2026-04-08T19:10:29Z

src/header/TransferBench.hpp

  {
-    // Replace any round brackets or '->' with spaces,
+    // Replace round brackets, '->', and ':' with spaces, but preserve '+'
    for (int i = 1; line[i]; i++)


for (int i = 1; line[i]; i++) reads line[1] unconditionally, which is out-of-bounds for empty strings or 1-character lines (e.g., blank lines in a config file). Guard on line.size() > 1 and iterate using line.size() rather than relying on a null terminator.

Suggested change

for (int i = 1; line[i]; i++)

for (std::string::size_type i = 1; i < line.size(); i++)

Copilot · 2026-04-08T19:10:30Z

src/client/Utilities.hpp

+          Transfer const& pong = transfers[idx + 1];
+          if (isMultiRank) {
+            table.Set(rowIdx, 4, " %s->R%d%c%d->%s <+> %s->R%d%c%d->%s",
+                      MemDevicesToStr(t.srcs).c_str(),
+                      exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
+                      MemDevicesToStr(t.dsts).c_str(),
+                      MemDevicesToStr(pong.srcs).c_str(),
+                      pong.exeDevice.exeRank, ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,
+                      MemDevicesToStr(pong.dsts).c_str());
+          } else {
+            table.Set(rowIdx, 4, " %s->%c%d->%s <+> %s->%c%d->%s",
+                      MemDevicesToStr(t.srcs).c_str(),
+                      ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,
+                      MemDevicesToStr(t.dsts).c_str(),
+                      MemDevicesToStr(pong.srcs).c_str(),
+                      ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,
+                      MemDevicesToStr(pong.dsts).c_str());


Transfer const& pong = transfers[idx + 1]; assumes the pong half always exists immediately after the ping index. If a ping transfer is last in the list, or transfer ordering differs (e.g., future parser changes / manual API usage), this will read out of bounds. Add a bounds/consistency check (e.g., idx + 1 < transfers.size() and transfers[idx+1].laps < 0) before dereferencing.

Suggested change

Transfer const& pong = transfers[idx + 1];

if (isMultiRank) {

table.Set(rowIdx, 4, " %s->R%d%c%d->%s <+> %s->R%d%c%d->%s",

MemDevicesToStr(t.srcs).c_str(),

exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,

MemDevicesToStr(t.dsts).c_str(),

MemDevicesToStr(pong.srcs).c_str(),

pong.exeDevice.exeRank, ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,

MemDevicesToStr(pong.dsts).c_str());

} else {

table.Set(rowIdx, 4, " %s->%c%d->%s <+> %s->%c%d->%s",

MemDevicesToStr(t.srcs).c_str(),

ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,

MemDevicesToStr(t.dsts).c_str(),

MemDevicesToStr(pong.srcs).c_str(),

ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,

MemDevicesToStr(pong.dsts).c_str());

bool const hasPong = (static_cast<size_t>(idx + 1) < transfers.size()) && (transfers[idx + 1].laps < 0);

if (hasPong) {

Transfer const& pong = transfers[idx + 1];

if (isMultiRank) {

table.Set(rowIdx, 4, " %s->R%d%c%d->%s <+> %s->R%d%c%d->%s",

MemDevicesToStr(t.srcs).c_str(),

exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,

MemDevicesToStr(t.dsts).c_str(),

MemDevicesToStr(pong.srcs).c_str(),

pong.exeDevice.exeRank, ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,

MemDevicesToStr(pong.dsts).c_str());

} else {

table.Set(rowIdx, 4, " %s->%c%d->%s <+> %s->%c%d->%s",

MemDevicesToStr(t.srcs).c_str(),

ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,

MemDevicesToStr(t.dsts).c_str(),

MemDevicesToStr(pong.srcs).c_str(),

ExeTypeStr[pong.exeDevice.exeType], pong.exeDevice.exeIndex,

MemDevicesToStr(pong.dsts).c_str());

}

} else {

if (isMultiRank) {

table.Set(rowIdx, 4, " %s->R%d%c%d->%s",

MemDevicesToStr(t.srcs).c_str(),

exeDevice.exeRank, ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,

MemDevicesToStr(t.dsts).c_str());

} else {

table.Set(rowIdx, 4, " %s->%c%d->%s",

MemDevicesToStr(t.srcs).c_str(),

ExeTypeStr[t.exeDevice.exeType], t.exeDevice.exeIndex,

MemDevicesToStr(t.dsts).c_str());

}

Copilot · 2026-04-08T19:10:30Z

src/client/Utilities.hpp

+        if (transfers[idx].laps >= 0) displayCount++;
+      numRows += 1 + displayCount;
      if (ev.showIterations) {
        numRows += (numTimedIterations + 1);


When showIterations is enabled, numRows is only incremented by (numTimedIterations + 1) once per executor, but the rendering logic appends per-iteration rows for each displayed transfer. Because TableHelper::Set silently ignores out-of-bounds writes, this can truncate iteration output whenever an executor has multiple transfers. Update numRows to account for (numTimedIterations + 1) per displayed transfer (regular + pingpong).

Suggested change

numRows += (numTimedIterations + 1);

numRows += displayCount * (numTimedIterations + 1);

AtlantaPepsi added 3 commits April 6, 2026 08:57

correct till this point

7626ca1

rollback untested: Output

e7ed98d

rollback untested: Stride

68c164c

AtlantaPepsi requested a review from Copilot April 8, 2026 19:00

Copilot started reviewing on behalf of AtlantaPepsi April 8, 2026 19:02 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency Testing#252

Latency Testing#252
AtlantaPepsi wants to merge 3 commits intoROCm:candidatefrom
AtlantaPepsi:pingpong

AtlantaPepsi commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	MemDevice flag = {}; ///< Memory device for pingpong signal flag (only used when laps != 0)
	MemDevice flag = {MEM_NULL, 0, 0}; ///< Memory device for pingpong signal flag (only used when laps != 0)

-        if (t.flag.memType == MEM_NULL) {
+        if (cfg.general.pingpongStride <= 0) {
+          errors.push_back({ERR_FATAL,
+            "Transfer %d: Pingpong transfer requires pingpongStride to be positive (instead of %d)",
+            i, cfg.general.pingpongStride});
+          hasFatalError = true;
+        } else if ((cfg.general.pingpongStride % 8) != 0) {
+          errors.push_back({ERR_FATAL,
+            "Transfer %d: Pingpong transfer requires pingpongStride to be a multiple of 8 bytes (instead of %d)",
+            i, cfg.general.pingpongStride});
+          hasFatalError = true;
+        } else if (t.flag.memType == MEM_NULL) {

		if (t.laps != 0)
		dstAllocBytes = std::max(dstAllocBytes, std::min((size_t)1024, (size_t)8 * abs(t.laps)));

-        if (t.laps != 0)
-          dstAllocBytes = std::max(dstAllocBytes, std::min((size_t)1024, (size_t)8 * abs(t.laps)));
+        if (t.laps != 0) {
+          size_t const pingpongLaps = static_cast<size_t>(t.laps < 0 ? -static_cast<int64_t>(t.laps)
+                                                                     :  static_cast<int64_t>(t.laps));
+          size_t const pingpongBytes = static_cast<size_t>(cfg.general.pingpongStride) * pingpongLaps;
+          dstAllocBytes = std::max(dstAllocBytes, pingpongBytes);
+        }

	for (int i = 1; line[i]; i++)
	for (std::string::size_type i = 1; i < line.size(); i++)

	numRows += (numTimedIterations + 1);
	numRows += displayCount * (numTimedIterations + 1);

Conversation

AtlantaPepsi commented Apr 6, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants