Skip to content

[BUG] JACCL segfaults in ibv_reg_mr (null PD) when RDMA absent; distributed_config says "RDMA not enabled" despite rdma_ctl status=enabled, survives cold boot (macOS 26.5.1 / 25F80) #3777

Description

@andrewkriley

Environment

M4 Mac mini (Mac16,10), 16 GB, macOS 26.5.1 (25F80), direct Thunderbolt cable (40 Gb/s link confirmed), MLX 0.31.2. JACCL driven in practice via EXO 1.0.71 (bundles libjaccl/libmlx/libibverbs).

(1) distributed_config reports "RDMA not enabled" despite rdma_ctl status: enabled

mlx.distributed_config --over thunderbolt --backend jaccl --hosts localhost,mac-mini-1
  ...
  [ERROR] <host> does not seem to have RDMA enabled

on both hosts, even though:

  • rdma_ctl statusenabled
  • nvram rdma-enable1
  • ibv_devices → empty (No IB devices found); /dev/infiniband is absent

We exhausted the local enablement path: ran rdma_ctl enable (twice) + reboot, then a full coordinated cold boot of both nodes (powered fully off with the TB cable attached, cold-powered both back up). ibv_devices still enumerates zero devices on macOS 25F80. The Thunderbolt link itself is healthy (system_profiler SPThunderboltDataType: peer connected, 40 Gb/s).

Question: what is actually required for ibv_devices to enumerate rdma_enX after enabling on macOS 26.5 (25F80)? Did a point OS update change the enablement procedure (e.g. must rdma_ctl enable be re-run in Recovery, or is there a newer host requirement)? The docs say enablement can't be done over SSH, but a Recovery re-enable + cold boot here has not produced any verbs device.

(2) JACCL segfaults in this state instead of erroring cleanly

When the JACCL backend is initialised while no RDMA device is enumerated, it crashes rather than reporting the missing device:

EXC_BAD_ACCESS (SIGSEGV), KERN_INVALID_ADDRESS at 0x0
  libibverbs.dylib  ibv_reg_mr_iova2
  libjaccl.dylib    jaccl::SharedBuffer::register_to_protection_domain(ibv_pd*)
  libjaccl.dylib    jaccl::MeshGroup::allocate_buffers()
  libjaccl.dylib    jaccl::MeshGroup::initialize()
  libmlx.dylib      mlx::core::distributed::jaccl::init(bool)

This looks like ibv_alloc_pd returns null (no device) and the returned ibv_pd* is used unchecked, so ibv_reg_mr_iova2 dereferences null. A guard on the PD (and/or a clear "no RDMA device enumerated" error from jaccl::init) would turn a hard crash into a diagnosable failure.

Note on reproducibility

We could not produce a standalone pip-MLX repro: Apple gates the RDMA IOKit user client behind the private entitlement com.apple.private.IORDMAFamilyUC (carried by /usr/bin/ibv_devices), which an adhoc-signed Python can't hold (AMFI rejects it). So mlx.launch --backend jaccl / mlx.distributed_config can't fully exercise RDMA outside an Apple-provisioned app. The segfault above is observed via EXO, which bundles the same libjaccl/libmlx.

Possibly related: #3162 (AppleThunderboltRDMA MR limits).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions