Skip to content

Dynamic Loading IB verbs#253

Draft
AtlantaPepsi wants to merge 1 commit intoROCm:candidatefrom
AtlantaPepsi:ibdl
Draft

Dynamic Loading IB verbs#253
AtlantaPepsi wants to merge 1 commit intoROCm:candidatefrom
AtlantaPepsi:ibdl

Conversation

@AtlantaPepsi
Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an IBV_DIRECT build/runtime switch intended to support resolving libibverbs symbols either via direct linkage or via dlopen/dlsym, and adds a runtime check to block NIC executor usage when verbs aren’t available.

Changes:

  • Added IBV_DIRECT mode plumbing (Makefile + CMake option) to toggle direct symbol binding vs dlsym resolution.
  • Introduced pfn_* function pointers and updated some verbs call sites/macros to call through them.
  • Added System::IbvLoaded() and a guard to error out when NIC executor is requested but verbs aren’t loaded.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File Description
src/header/TransferBench.hpp Adds libibverbs dynamic-loading infrastructure (pfn_*, Ibvdl()), runtime loaded-state tracking, and uses pfn_* in some verbs calls.
Makefile Adds DISABLE_IBV_DIRECT flag and sets -DIBV_DIRECT=1 by default when NIC exec is enabled.
CMakeLists.txt Adds ENABLE_IBV_DIRECT option and defines IBV_DIRECT=1 when enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +745 to +768
namespace {
IBV_FN(ibv_alloc_pd, ibv_pd*, (ibv_context*));
IBV_FN(ibv_close_device, int, (ibv_context*));
IBV_FN(ibv_create_cq, ibv_cq*, (ibv_context*, int, void*, ibv_comp_channel*, int));
IBV_FN(ibv_create_qp, ibv_qp*, (ibv_pd*, ibv_qp_init_attr*));
IBV_FN(ibv_dealloc_pd, int, (ibv_pd*));
IBV_FN(ibv_dereg_mr, int, (ibv_mr*));
IBV_FN(ibv_destroy_cq, int, (ibv_cq*));
IBV_FN(ibv_destroy_qp, int, (ibv_qp*));
IBV_FN(ibv_free_device_list, void, (ibv_device**));
IBV_FN(ibv_get_device_list, ibv_device**, (int*));
IBV_FN(ibv_get_device_name, const char*, (ibv_device*));
IBV_FN(ibv_modify_qp, int, (ibv_qp*, ibv_qp_attr*, int));
IBV_FN(ibv_open_device, ibv_context*, (ibv_device*));
IBV_FN(ibv_poll_cq, int, (ibv_cq*, int, ibv_wc*));
IBV_FN(ibv_post_send, int, (ibv_qp*, ibv_send_wr*, ibv_send_wr**));
IBV_FN(ibv_query_device, int, (ibv_context*, ibv_device_attr*));
IBV_FN(ibv_query_gid, int, (ibv_context*, uint8_t, int, ibv_gid*));
IBV_FN(ibv_query_port, int, (ibv_context*, uint8_t, ibv_port_attr*));
#ifdef HAVE_DMABUF_SUPPORT
IBV_FN(ibv_reg_dmabuf_mr, ibv_mr*, (ibv_pd*, uint64_t, size_t, uint64_t, int, int));
#endif
IBV_FN(ibv_reg_mr, ibv_mr*, (ibv_pd*, void*, size_t, int));
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new IBV_FN() function-pointer declarations are not guarded by NIC_EXEC_ENABLED, but IBV_FN (and ibv_* types) only exist when NIC_EXEC_ENABLED includes <infiniband/verbs.h>. As-is, builds with NIC executor disabled will fail to compile at this block. Wrap the IBV_FN declarations (and any dependent IBV_* call macros) in #ifdef NIC_EXEC_ENABLED, or provide a no-op/alternative definition when NIC support is off.

Copilot uses AI. Check for mistakes.
Comment on lines +2821 to +2827
// Should only be called with IBV_DIRECT guard
static void* Ibvdl() {
static void* ibvLibHandle = nullptr;
if (ibvLibHandle) return ibvLibHandle;

void *handle = dlopen("libibverbs.so.1", RTLD_NOW);
if (handle != nullptr) {
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ibvdl() uses dlopen/dlsym/dlclose but is compiled unconditionally under NIC_EXEC_ENABLED. When IBV_DIRECT=1, <dlfcn.h> is not included, so these calls will be undeclared and compilation will fail. Guard Ibvdl() itself (and any dl* usage) with #if !IBV_DIRECT (or include <dlfcn.h> regardless of IBV_DIRECT).

Copilot uses AI. Check for mistakes.
Comment on lines +2853 to +2860
for (size_t i = 0; i < sizeof(symbols) / sizeof(symbols[0]); i++) {
*symbols[i].ppfn = dlsym(handle, symbols[i].name);
if (*symbols[i].ppfn == nullptr) {
// Log("[WARN] Failed to load symbol %s", symbols[i].name);
dlclose(handle);
break;
}
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ibvdl() currently treats partial symbol resolution as success: on a dlsym failure it dlclose()s the handle but still leaves handle non-null and later assigns it to the cached ibvLibHandle. This can return a closed handle and leave some pfn_* pointers non-null/invalid. Only mark the library loaded if all symbols are resolved; otherwise set handle to nullptr and reset any already-assigned function pointers.

Suggested change
for (size_t i = 0; i < sizeof(symbols) / sizeof(symbols[0]); i++) {
*symbols[i].ppfn = dlsym(handle, symbols[i].name);
if (*symbols[i].ppfn == nullptr) {
// Log("[WARN] Failed to load symbol %s", symbols[i].name);
dlclose(handle);
break;
}
}
auto resetSymbols = [&symbols]() {
for (size_t i = 0; i < sizeof(symbols) / sizeof(symbols[0]); i++) {
*symbols[i].ppfn = nullptr;
}
};
bool loadedAllSymbols = true;
for (size_t i = 0; i < sizeof(symbols) / sizeof(symbols[0]); i++) {
*symbols[i].ppfn = dlsym(handle, symbols[i].name);
if (*symbols[i].ppfn == nullptr) {
// Log("[WARN] Failed to load symbol %s", symbols[i].name);
loadedAllSymbols = false;
break;
}
}
if (!loadedAllSymbols) {
resetSymbols();
dlclose(handle);
handle = nullptr;
}

Copilot uses AI. Check for mistakes.
static vector<IbvDevice> ibvDeviceList = {};

#if !defined(IBV_DIRECT)
if (ibvLibHandle == nullptr) {
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetIbvDeviceList() checks ibvLibHandle == nullptr, but ibvLibHandle here refers to no visible variable (the System member isn’t accessible from this free function, and the Ibvdl() local static isn’t in scope). This is a compile error. Use a reachable source of truth (e.g., System::Get().IbvLoaded(), store the handle in a shared internal static with external linkage in this anonymous namespace, or call Ibvdl() and cache its result) instead of referencing an out-of-scope identifier.

Suggested change
if (ibvLibHandle == nullptr) {
if (Ibvdl() == nullptr) {

Copilot uses AI. Check for mistakes.
Comment on lines +5949 to +5952
if (ibvLibHandle) {
dlclose(ibvLibHandle);
ibvLibHandle = nullptr;
}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System::~System() calls dlclose(ibvLibHandle) unconditionally. When IBV_DIRECT=1, <dlfcn.h> is not included, so dlclose will be undeclared and compilation can fail. Also, Ibvdl() keeps its own internal static handle, so closing only the System member can leave Ibvdl() returning a stale (already-closed) handle on future calls. Guard dlclose with #if !IBV_DIRECT and keep a single owner for the handle (either System or Ibvdl, but not both).

Suggested change
if (ibvLibHandle) {
dlclose(ibvLibHandle);
ibvLibHandle = nullptr;
}
#ifdef NIC_EXEC_ENABLED
#if !IBV_DIRECT
// Ibvdl() owns the dynamically loaded libibverbs handle. Avoid closing a
// cached copy here so future Ibvdl() calls cannot observe a stale handle.
ibvLibHandle = nullptr;
#endif
#endif

Copilot uses AI. Check for mistakes.
Comment on lines 100 to +103
LDFLAGS += -libverbs
NIC_ENABLED = 1
ifneq ($(DISABLE_IBV_DIRECT),1)
COMMON_FLAGS += -DIBV_DIRECT=1
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When DISABLE_IBV_DIRECT=1 (dlsym path), the build still unconditionally links -libverbs. Any remaining direct ibv_* calls (and even an unused -libverbs without --as-needed) can keep libibverbs as a hard runtime dependency, undermining the goal of optional runtime loading. Consider omitting -libverbs when using the dlsym path (and ensure all ibv_* usage goes through pfn_*), and add -ldl for toolchains/glibc versions where dlopen is not in libc.

Suggested change
LDFLAGS += -libverbs
NIC_ENABLED = 1
ifneq ($(DISABLE_IBV_DIRECT),1)
COMMON_FLAGS += -DIBV_DIRECT=1
NIC_ENABLED = 1
ifneq ($(DISABLE_IBV_DIRECT),1)
COMMON_FLAGS += -DIBV_DIRECT=1
LDFLAGS += -libverbs
else
LDFLAGS += -ldl

Copilot uses AI. Check for mistakes.
Comment on lines 325 to +327
target_link_libraries(TransferBench PRIVATE ${IBVERBS_LIBRARY})
target_compile_definitions(TransferBench PRIVATE NIC_EXEC_ENABLED)
if(ENABLE_IBV_DIRECT)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENABLE_IBV_DIRECT=OFF claims to resolve symbols via dlsym at runtime, but the target still links against ${IBVERBS_LIBRARY} unconditionally when IBVERBS_FOUND. This keeps libibverbs as a hard dependency and defeats optional loading. If runtime optionality is intended, avoid linking ${IBVERBS_LIBRARY} when ENABLE_IBV_DIRECT is OFF and ensure all ibv_* usage is through the loaded function pointers.

Suggested change
target_link_libraries(TransferBench PRIVATE ${IBVERBS_LIBRARY})
target_compile_definitions(TransferBench PRIVATE NIC_EXEC_ENABLED)
if(ENABLE_IBV_DIRECT)
target_compile_definitions(TransferBench PRIVATE NIC_EXEC_ENABLED)
if(ENABLE_IBV_DIRECT)
target_link_libraries(TransferBench PRIVATE ${IBVERBS_LIBRARY})

Copilot uses AI. Check for mistakes.
Comment on lines 2880 to 2883
// Query the number of IBV devices
int numIbvDevices = 0;
ibv_device** deviceList = ibv_get_device_list(&numIbvDevices);
ibv_device** deviceList = pfn_ibv_get_device_list(&numIbvDevices);

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within GetIbvDeviceList(), most verbs calls were converted to pfn_* (e.g., pfn_ibv_get_device_list/pfn_ibv_open_device/etc.), but the function still calls ibv_free_device_list(deviceList) directly later in the same function. In the dlsym path this reintroduces a direct libibverbs symbol reference and breaks the “all calls through pfn_*” model. Switch that cleanup call to pfn_ibv_free_device_list as well.

Copilot uses AI. Check for mistakes.
Comment on lines +2821 to +2822
// Should only be called with IBV_DIRECT guard
static void* Ibvdl() {
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above Ibvdl() says it “should only be called with IBV_DIRECT guard”, but Ibvdl() is the dlsym/dlopen path and should only be used when IBV_DIRECT is disabled. Update the comment (and ideally add a compile-time guard) to avoid confusion about which mode this helper supports.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants