feat: support Broadcast example with OpenMPI backend implementation#5
feat: support Broadcast example with OpenMPI backend implementation#5GordonYang1 wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 96de1f83bc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| #include <cuda_fp16.h> | ||
| #include <cuda_bf16.h> |
There was a problem hiding this comment.
Remove unconditional CUDA headers from OMPI AllReduce
Including cuda_fp16.h/cuda_bf16.h directly in the OMPI AllReduce implementation makes every OMPI build depend on CUDA headers, even when WITH_NVIDIA=OFF. Since OMPI headers are still pulled into the generated backend manifest for non-NVIDIA configurations, CPU-only/OpenMPI builds fail at compile time when those headers are unavailable. Please gate these includes and the CUDA-specific conversion branch behind a NVIDIA/device-specific path.
Useful? React with 👍 / 👎.
| CHECK_STATUS(Rt, Rt::Memcpy(host_buf, send_buff, total_bytes, | ||
| Rt::MemcpyDeviceToHost)); | ||
| CHECK_STATUS(Rt, Rt::StreamSynchronize(static_cast<Rt::Stream>(stream))); |
There was a problem hiding this comment.
Free staging buffer on early-return error paths
host_buf is allocated before the CHECK_STATUS calls, but those macros return immediately on runtime copy/sync failures, skipping std::free(host_buf). If broadcast is retried after transient runtime errors, each failure leaks host memory. Consider using RAII (e.g., smart pointer/scope guard) or explicit cleanup before each error return.
Useful? React with 👍 / 👎.
Summary
This PR adds a new collective communication operator
infiniBroadcastalong with its OpenMPI backend implementation, and a complete boundary-case test suite underexamples/broadcast.cc.In addition, this PR includes two small fixes for the existing OMPI backend that were discovered while testing on a real multi-GPU machine. They are logically independent from Broadcast and can be cherry-picked separately if needed.
Public API
Changes
Broadcastbase class insrc/base/broadcast.hwith parameter validation (root range, null pointer,count = 0short-circuit);src/ompi/impl/broadcast.h, using a host-staging path (consistent with the existing AllReduce implementation);INT_MAXbytes and callMPI_Bcastrepeatedly, so that arbitrarily largecountvalues are supported;sendbuff == recvbuff) and out-of-place calling conventions;nullptrassendbuff(this convention is exercised by the test suite).examples/broadcast.cccovering 7 boundary cases:count = 0: no-op short-circuit;root = size - 1;sendbuff = nullptr;root = 0;root = size - 1;> INT_MAXbytes), gated byINFINI_BROADCAST_LARGE=1;-1andsize) →infiniInvalidArgument.fix(ompi): handle fp16 and bf16 scaling in all_reduce—__halfand__nv_bfloat16do not support C++operator*=, so theAvgpath previously failed to compile on NVIDIA. The fix converts the value tofloatfor scaling and then back to the original type.fix(ompi): call MPI_Finalize during finalize—FinalizeImplpreviously only calledMPI_Finalizedto query the state and never actually calledMPI_Finalize(), which caused OpenMPI/PRRTE to print "exited improperly" warnings on process exit.Known Issues & Future Work
INT_MAX-byte chunking pattern currently only lives in Broadcast. AllReduce could benefit from the same treatment in the future.Broadcast::Execute's argument-validation style is kept consistent withAllReduce::Execute— invalid input returnskInvalidArgumentand emits aLOGline. The "invalid root" test case intentionally triggers these log lines, which is expected output, not a failure.Logs & Screenshots
Test environment:
OMPI_COMM_WORLD_LOCAL_RANK→cudaSetDevice.nvidia-smishows thebroadcastprocess on all 4 GPUs simultaneously, each occupying about 2 GB of device memory.> INT_MAX-byte chunking path triggered byINFINI_BROADCAST_LARGE=1.