feat: support Broadcast example with OpenMPI backend implementation by GordonYang1 · Pull Request #5 · InfiniTensor/InfiniCCL

GordonYang1 · 2026-05-13T03:14:48Z

Summary

This PR adds a new collective communication operator infiniBroadcast along with its OpenMPI backend implementation, and a complete boundary-case test suite under examples/broadcast.cc.

In addition, this PR includes two small fixes for the existing OMPI backend that were discovered while testing on a real multi-GPU machine. They are logically independent from Broadcast and can be cherry-picked separately if needed.

Public API

infiniResult_t infiniBroadcast(const void *sendbuff, void *recvbuff,
                               size_t count, infiniDataType_t datatype,
                               int root, infiniComm_t comm, void *stream);

Changes

Broadcast Operator Support
- add the Broadcast base class in src/base/broadcast.h with parameter validation (root range, null pointer, count = 0 short-circuit);
- add the OpenMPI backend implementation in src/ompi/impl/broadcast.h, using a host-staging path (consistent with the existing AllReduce implementation);
- chunk the buffer by INT_MAX bytes and call MPI_Bcast repeatedly, so that arbitrarily large count values are supported;
- support both in-place (sendbuff == recvbuff) and out-of-place calling conventions;
- in out-of-place mode, allow non-root ranks to pass nullptr as sendbuff (this convention is exercised by the test suite).
Broadcast Example
- add examples/broadcast.cc covering 7 boundary cases:
  - count = 0: no-op short-circuit;
  - out-of-place, root = size - 1;
  - out-of-place, non-root sendbuff = nullptr;
  - in-place, root = 0;
  - in-place, root = size - 1;
  - large count (> INT_MAX bytes), gated by INFINI_BROADCAST_LARGE=1;
  - invalid root (-1 and size) → infiniInvalidArgument.
Incidental Fixes (independent of Broadcast)
- fix(ompi): handle fp16 and bf16 scaling in all_reduce — __half and __nv_bfloat16 do not support C++ operator*=, so the Avg path previously failed to compile on NVIDIA. The fix converts the value to float for scaling and then back to the original type.
- fix(ompi): call MPI_Finalize during finalize — FinalizeImpl previously only called MPI_Finalized to query the state and never actually called MPI_Finalize(), which caused OpenMPI/PRRTE to print "exited improperly" warnings on process exit.

Known Issues & Future Work

The OMPI backend uses the host-staging path. This PR does not cover GPU-direct or NCCL paths; a NCCL-based Broadcast backend can be added in a follow-up PR.
The INT_MAX-byte chunking pattern currently only lives in Broadcast. AllReduce could benefit from the same treatment in the future.
Broadcast::Execute's argument-validation style is kept consistent with AllReduce::Execute — invalid input returns kInvalidArgument and emits a LOG line. The "invalid root" test case intentionally triggers these log lines, which is expected output, not a failure.

Logs & Screenshots

Test environment:

4× NVIDIA GPUs (3× H100 PCIe + 1× A100 SXM), CUDA 11.8, GCC 11, OpenMPI 4.x (conda environment).
4-rank real multi-GPU run; each rank is bound to a distinct physical GPU via OMPI_COMM_WORLD_LOCAL_RANK → cudaSetDevice.
During the large-count case, nvidia-smi shows the broadcast process on all 4 GPUs simultaneously, each occupying about 2 GB of device memory.
All 7 cases PASS, including the > INT_MAX-byte chunking path triggered by INFINI_BROADCAST_LARGE=1.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 96de1f83bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-13T03:18:30Z

+#include <cuda_fp16.h>
+#include <cuda_bf16.h>


Remove unconditional CUDA headers from OMPI AllReduce

Including cuda_fp16.h/cuda_bf16.h directly in the OMPI AllReduce implementation makes every OMPI build depend on CUDA headers, even when WITH_NVIDIA=OFF. Since OMPI headers are still pulled into the generated backend manifest for non-NVIDIA configurations, CPU-only/OpenMPI builds fail at compile time when those headers are unavailable. Please gate these includes and the CUDA-specific conversion branch behind a NVIDIA/device-specific path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-13T03:18:30Z

+      CHECK_STATUS(Rt, Rt::Memcpy(host_buf, send_buff, total_bytes,
+                                  Rt::MemcpyDeviceToHost));
+      CHECK_STATUS(Rt, Rt::StreamSynchronize(static_cast<Rt::Stream>(stream)));


Free staging buffer on early-return error paths

host_buf is allocated before the CHECK_STATUS calls, but those macros return immediately on runtime copy/sync failures, skipping std::free(host_buf). If broadcast is retried after transient runtime errors, each failure leaks host memory. Consider using RAII (e.g., smart pointer/scope guard) or explicit cleanup before each error return.

Useful? React with 👍 / 👎.

GordonYang1 added 4 commits May 12, 2026 20:24

fix(ompi): handle fp16 and bf16 scaling in all_reduce

f2de8bb

fix(ompi): call MPI_Finalize during finalize

e08c190

feat(ompi): add Broadcast collective

6207d21

test(examples): add broadcast example with boundary cases

96de1f8

chatgpt-codex-connector Bot reviewed May 13, 2026

View reviewed changes

GordonYang1 changed the title ~~Feat/broadcast~~ feat: support Broadcast example with OpenMPI backend implementation May 14, 2026

fix(ompi): include communicator before comm init base

9cdab2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support Broadcast example with OpenMPI backend implementation#5

feat: support Broadcast example with OpenMPI backend implementation#5
GordonYang1 wants to merge 5 commits into
InfiniTensor:masterfrom
GordonYang1:feat/broadcast

GordonYang1 commented May 13, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		#include <cuda_fp16.h>
		#include <cuda_bf16.h>

Conversation

GordonYang1 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Public API

Changes

Known Issues & Future Work

Logs & Screenshots

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GordonYang1 commented May 13, 2026 •

edited

Loading