MeshService: null-check sender node before dereferencing has_user on hot RX path#10229
MeshService: null-check sender node before dereferencing has_user on hot RX path#10229nightjoker7 wants to merge 5 commits intomeshtastic:developfrom
Conversation
…hot RX path In handleFromRadio(), the "heard a new node, send our NodeInfo" branch dereferences nodeDB->getMeshNode(mp->from)->has_user without first confirming the lookup succeeded. The preceding updateFrom(*mp) call does NOT guarantee mp->from is now in the DB. It early-returns in three cases: when mp.from equals our own nodenum (local loopback / MQTT echo), when getOrCreateMeshNode fails because the DB is full and no non-favorite non-ignored non-verified entry is available for eviction, and when mp.from == 0 on an encrypted-tag packet. In any of those paths, getMeshNode returns NULL on the next line and ->has_user dereferences it, producing a null-pointer crash. The trailing !nodeDB->isFull() check does not guard this because short-circuit evaluation reaches the dereference first. Add a nullptr check to the short-circuit chain, matching the pattern already used correctly in Router.cpp, RoutingModule.cpp, and ReliableRouter.cpp. If the sender is not in the DB we simply skip the NodeInfo-send branch, which is the correct behavior.
There was a problem hiding this comment.
Pull request overview
Fixes a potential null-pointer dereference in the mesh RX pipeline by guarding nodeDB->getMeshNode(mp->from) before accessing ->has_user in MeshService::handleFromRadio(), preventing hard faults when updateFrom() doesn’t create/retain a NodeDB entry.
Changes:
- Add a
getMeshNode(mp->from) != nullptrshort-circuit check before dereferencinghas_userin the hot RX path.
| } else if (mp->which_payload_variant == meshtastic_MeshPacket_decoded_tag && nodeDB->getMeshNode(mp->from) != nullptr && | ||
| !nodeDB->getMeshNode(mp->from)->has_user && nodeInfoModule && !isPreferredRebroadcaster && !nodeDB->isFull()) { |
There was a problem hiding this comment.
getMeshNode(mp->from) is now called twice in the same condition. NodeDB::getMeshNode() performs a linear scan over numMeshNodes, so on this hot RX path this doubles the lookup work for packets where the node exists. Consider caching the pointer in a local variable (or otherwise restructuring the condition) so the node lookup happens once and the subsequent has_user check uses the same result.
|
Adding production evidence from a small deployment — this race fires in normal operation in multi-channel RF areas, not just theoretical edge cases. Observed 2026-04-22 on two Station-G2 nodes (ROUTER_LATE + CLIENT_BASE) in an urban mesh:
Loudest example: sender After patching (this PR + #10219 + #10226), same rogue packets continue to arrive (91+ from that one sender alone since patch), no crashes on the Real-world class of trigger: any mesh near |
|
Post-hoc correction — the crash rate I cited was amplified by an independent bug in our meshlogger client that I only caught after the first comment. Want to be intellectually honest about what the evidence actually supports. The Python Once we patched the client (L4 pre-probe before handshake, explicit Revised framing: this PR is not "the sole fix for DoS-over-RF." In-field crash rate depends heavily on client behavior too. But the underlying null-deref —
My data should be read as "this bug does fire in production, just not always at the rate I originally reported" — not "this bug alone causes the crash rate I observed." The null-check is still correct and the fault class is still real under normal operation; a better-behaved client just hits it less often. |
|
Client-side fix for the meshlogger bug I mentioned above — filed as meshtastic/python#918. Independent of the firmware null-deref here, but was amplifying the trigger rate in my deployment. |
…om-node-null-deref
Review feedback from @thebentern (via Copilot) on PR meshtastic#10229: the null-check and has_user read called getMeshNode(mp->from) twice in the same condition, doubling the linear-scan work on every RX packet. Cache the result in a local fromNode pointer, then do the null-check and has_user read against the cached pointer. Same behaviour; one scan instead of two. Also eliminates the TOCTOU window where a concurrent task could evict the node between calls.
|
Addressed in a6a93d281: cached |
Summary
Null-check
nodeDB->getMeshNode(mp->from)before dereferencing->has_userinMeshService::handleFromRadio().Problem
In
src/mesh/MeshService.cpp::handleFromRadio():The code assumes
updateFrom(*mp)guaranteesmp->fromis in the NodeDB on the next line, but it does not. Looking atNodeDB::updateFrom():Three paths through
updateFromleave the DB unchanged:mp.from == getNodeNum()— the packet is from ourselves (can happen on loopback / MQTT echo / local re-injection).getOrCreateMeshNode()returnsnullptr— the DB is atMAX_NUM_NODES, heap is belowMINIMUM_SAFE_FREE_HEAP, and no non-favorite / non-ignored / non-verified node exists to evict.mp.from == 0— skipped entirely.In any of these cases,
getMeshNode(mp->from)on the next line returnsNULLand dereferencing->has_useris a null-pointer dereference → hard fault / watchdog reset.The
!nodeDB->isFull()clause at the end of the sameifdoes not protect against this — short-circuit evaluation means thehas_userdereference happens first.This path runs on every inbound radio packet, so it is one of the hottest code paths in the firmware. Any node on a crowded mesh that pushes the DB to capacity is at risk.
Fix
Add the standard
nodeDB->getMeshNode(mp->from) != nullptrguard earlier in the short-circuit chain, mirroring the pattern already used correctly elsewhere (e.g.Router.cpp::perhapsSetChannel,RoutingModule::isRoutingRequest,ReliableRouter.cpp::shouldRelayPkiEncrypted):If
getMeshNodereturns null, we skip the "send NodeInfo to unknown node" path entirely — which is the correct behavior, because we have no record of the peer to respond to anyway.Testing
develop.Risk
Minimal. One-line change that adds a null-check before an existing dereference on a hot path. No behavior change on the happy path; previously-crashing path now takes the same "skip NodeInfo send" branch it would take if the node were already known.