Environment
2× M4 Mac mini (Mac16,10), 16 GB, macOS 26.5.1 (25F80), direct Thunderbolt cable (40 Gb/s link confirmed), MLX 0.31.2. JACCL driven in practice via EXO 1.0.71 (bundles libjaccl/libmlx/libibverbs).
(1) distributed_config reports "RDMA not enabled" despite rdma_ctl status: enabled
mlx.distributed_config --over thunderbolt --backend jaccl --hosts localhost,mac-mini-1
...
[ERROR] <host> does not seem to have RDMA enabled
on both hosts, even though:
rdma_ctl status → enabled
nvram rdma-enable → 1
ibv_devices → empty (No IB devices found); /dev/infiniband is absent
We exhausted the local enablement path: ran rdma_ctl enable (twice) + reboot, then a full coordinated cold boot of both nodes (powered fully off with the TB cable attached, cold-powered both back up). ibv_devices still enumerates zero devices on macOS 25F80. The Thunderbolt link itself is healthy (system_profiler SPThunderboltDataType: peer connected, 40 Gb/s).
Question: what is actually required for ibv_devices to enumerate rdma_enX after enabling on macOS 26.5 (25F80)? Did a point OS update change the enablement procedure (e.g. must rdma_ctl enable be re-run in Recovery, or is there a newer host requirement)? The docs say enablement can't be done over SSH, but a Recovery re-enable + cold boot here has not produced any verbs device.
(2) JACCL segfaults in this state instead of erroring cleanly
When the JACCL backend is initialised while no RDMA device is enumerated, it crashes rather than reporting the missing device:
EXC_BAD_ACCESS (SIGSEGV), KERN_INVALID_ADDRESS at 0x0
libibverbs.dylib ibv_reg_mr_iova2
libjaccl.dylib jaccl::SharedBuffer::register_to_protection_domain(ibv_pd*)
libjaccl.dylib jaccl::MeshGroup::allocate_buffers()
libjaccl.dylib jaccl::MeshGroup::initialize()
libmlx.dylib mlx::core::distributed::jaccl::init(bool)
This looks like ibv_alloc_pd returns null (no device) and the returned ibv_pd* is used unchecked, so ibv_reg_mr_iova2 dereferences null. A guard on the PD (and/or a clear "no RDMA device enumerated" error from jaccl::init) would turn a hard crash into a diagnosable failure.
Note on reproducibility
We could not produce a standalone pip-MLX repro: Apple gates the RDMA IOKit user client behind the private entitlement com.apple.private.IORDMAFamilyUC (carried by /usr/bin/ibv_devices), which an adhoc-signed Python can't hold (AMFI rejects it). So mlx.launch --backend jaccl / mlx.distributed_config can't fully exercise RDMA outside an Apple-provisioned app. The segfault above is observed via EXO, which bundles the same libjaccl/libmlx.
Possibly related: #3162 (AppleThunderboltRDMA MR limits).
Environment
2× M4 Mac mini (Mac16,10), 16 GB, macOS 26.5.1 (25F80), direct Thunderbolt cable (40 Gb/s link confirmed), MLX 0.31.2. JACCL driven in practice via EXO 1.0.71 (bundles
libjaccl/libmlx/libibverbs).(1)
distributed_configreports "RDMA not enabled" despiterdma_ctl status: enabledon both hosts, even though:
rdma_ctl status→enablednvram rdma-enable→1ibv_devices→ empty (No IB devices found);/dev/infinibandis absentWe exhausted the local enablement path: ran
rdma_ctl enable(twice) + reboot, then a full coordinated cold boot of both nodes (powered fully off with the TB cable attached, cold-powered both back up).ibv_devicesstill enumerates zero devices on macOS 25F80. The Thunderbolt link itself is healthy (system_profiler SPThunderboltDataType: peer connected, 40 Gb/s).Question: what is actually required for
ibv_devicesto enumeraterdma_enXafter enabling on macOS 26.5 (25F80)? Did a point OS update change the enablement procedure (e.g. mustrdma_ctl enablebe re-run in Recovery, or is there a newer host requirement)? The docs say enablement can't be done over SSH, but a Recovery re-enable + cold boot here has not produced any verbs device.(2) JACCL segfaults in this state instead of erroring cleanly
When the JACCL backend is initialised while no RDMA device is enumerated, it crashes rather than reporting the missing device:
This looks like
ibv_alloc_pdreturns null (no device) and the returnedibv_pd*is used unchecked, soibv_reg_mr_iova2dereferences null. A guard on the PD (and/or a clear "no RDMA device enumerated" error fromjaccl::init) would turn a hard crash into a diagnosable failure.Note on reproducibility
We could not produce a standalone pip-MLX repro: Apple gates the RDMA IOKit user client behind the private entitlement
com.apple.private.IORDMAFamilyUC(carried by/usr/bin/ibv_devices), which an adhoc-signed Python can't hold (AMFI rejects it). Somlx.launch --backend jaccl/mlx.distributed_configcan't fully exercise RDMA outside an Apple-provisioned app. The segfault above is observed via EXO, which bundles the samelibjaccl/libmlx.Possibly related: #3162 (AppleThunderboltRDMA MR limits).