Skip to content

MeshService: null-check sender node before dereferencing has_user on hot RX path#10229

Open
nightjoker7 wants to merge 5 commits intomeshtastic:developfrom
nightjoker7:fix/meshservice-from-node-null-deref
Open

MeshService: null-check sender node before dereferencing has_user on hot RX path#10229
nightjoker7 wants to merge 5 commits intomeshtastic:developfrom
nightjoker7:fix/meshservice-from-node-null-deref

Conversation

@nightjoker7
Copy link
Copy Markdown
Contributor

Summary

Null-check nodeDB->getMeshNode(mp->from) before dereferencing ->has_user in MeshService::handleFromRadio().

Problem

In src/mesh/MeshService.cpp::handleFromRadio():

nodeDB->updateFrom(*mp); // update our DB state based off sniffing every RX packet from the radio
...
} else if (mp->which_payload_variant == meshtastic_MeshPacket_decoded_tag && !nodeDB->getMeshNode(mp->from)->has_user &&
           nodeInfoModule && !isPreferredRebroadcaster && !nodeDB->isFull()) {

The code assumes updateFrom(*mp) guarantees mp->from is in the NodeDB on the next line, but it does not. Looking at NodeDB::updateFrom():

void NodeDB::updateFrom(const meshtastic_MeshPacket &mp)
{
    if (mp.from == getNodeNum()) {
        LOG_DEBUG("Ignore update from self");
        return;                               // <-- early return, no insert
    }
    if (mp.which_payload_variant == meshtastic_MeshPacket_decoded_tag && mp.from) {
        ...
        meshtastic_NodeInfoLite *info = getOrCreateMeshNode(getFrom(&mp));
        if (!info) {
            return;                           // <-- early return on alloc failure
        }
        ...
    }
    // (nothing inserted for encrypted-tag packets or mp.from == 0)
}

Three paths through updateFrom leave the DB unchanged:

  1. mp.from == getNodeNum() — the packet is from ourselves (can happen on loopback / MQTT echo / local re-injection).
  2. getOrCreateMeshNode() returns nullptr — the DB is at MAX_NUM_NODES, heap is below MINIMUM_SAFE_FREE_HEAP, and no non-favorite / non-ignored / non-verified node exists to evict.
  3. Encrypted-tag packet with mp.from == 0 — skipped entirely.

In any of these cases, getMeshNode(mp->from) on the next line returns NULL and dereferencing ->has_user is a null-pointer dereference → hard fault / watchdog reset.

The !nodeDB->isFull() clause at the end of the same if does not protect against this — short-circuit evaluation means the has_user dereference happens first.

This path runs on every inbound radio packet, so it is one of the hottest code paths in the firmware. Any node on a crowded mesh that pushes the DB to capacity is at risk.

Fix

Add the standard nodeDB->getMeshNode(mp->from) != nullptr guard earlier in the short-circuit chain, mirroring the pattern already used correctly elsewhere (e.g. Router.cpp::perhapsSetChannel, RoutingModule::isRoutingRequest, ReliableRouter.cpp::shouldRelayPkiEncrypted):

} else if (mp->which_payload_variant == meshtastic_MeshPacket_decoded_tag && nodeDB->getMeshNode(mp->from) != nullptr &&
           !nodeDB->getMeshNode(mp->from)->has_user && nodeInfoModule && !isPreferredRebroadcaster && !nodeDB->isFull()) {

If getMeshNode returns null, we skip the "send NodeInfo to unknown node" path entirely — which is the correct behavior, because we have no record of the peer to respond to anyway.

Testing

  • Patched firmware running on a 4-node fleet (RAK4631 / Heltec Station G2 / T114 / Heltec V3) for several days with a 250-node-capacity NodeDB close to saturation. No crashes on inbound traffic of any portnum.
  • Builds cleanly against develop.

Risk

Minimal. One-line change that adds a null-check before an existing dereference on a hot path. No behavior change on the happy path; previously-crashing path now takes the same "skip NodeInfo send" branch it would take if the node were already known.

…hot RX path

In handleFromRadio(), the "heard a new node, send our NodeInfo" branch
dereferences nodeDB->getMeshNode(mp->from)->has_user without first
confirming the lookup succeeded.

The preceding updateFrom(*mp) call does NOT guarantee mp->from is now
in the DB. It early-returns in three cases: when mp.from equals our
own nodenum (local loopback / MQTT echo), when getOrCreateMeshNode
fails because the DB is full and no non-favorite non-ignored
non-verified entry is available for eviction, and when mp.from == 0
on an encrypted-tag packet. In any of those paths, getMeshNode returns
NULL on the next line and ->has_user dereferences it, producing a
null-pointer crash. The trailing !nodeDB->isFull() check does not
guard this because short-circuit evaluation reaches the dereference
first.

Add a nullptr check to the short-circuit chain, matching the pattern
already used correctly in Router.cpp, RoutingModule.cpp, and
ReliableRouter.cpp. If the sender is not in the DB we simply skip the
NodeInfo-send branch, which is the correct behavior.
@github-actions github-actions Bot added the bugfix Pull request that fixes bugs label Apr 21, 2026
@thebentern thebentern requested a review from Copilot April 21, 2026 20:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a potential null-pointer dereference in the mesh RX pipeline by guarding nodeDB->getMeshNode(mp->from) before accessing ->has_user in MeshService::handleFromRadio(), preventing hard faults when updateFrom() doesn’t create/retain a NodeDB entry.

Changes:

  • Add a getMeshNode(mp->from) != nullptr short-circuit check before dereferencing has_user in the hot RX path.

Comment thread src/mesh/MeshService.cpp Outdated
Comment on lines +97 to +98
} else if (mp->which_payload_variant == meshtastic_MeshPacket_decoded_tag && nodeDB->getMeshNode(mp->from) != nullptr &&
!nodeDB->getMeshNode(mp->from)->has_user && nodeInfoModule && !isPreferredRebroadcaster && !nodeDB->isFull()) {
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getMeshNode(mp->from) is now called twice in the same condition. NodeDB::getMeshNode() performs a linear scan over numMeshNodes, so on this hot RX path this doubles the lookup work for packets where the node exists. Consider caching the pointer in a local variable (or otherwise restructuring the condition) so the node lookup happens once and the subsequent has_user check uses the same result.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

@nightjoker7
Copy link
Copy Markdown
Contributor Author

Adding production evidence from a small deployment — this race fires in normal operation in multi-channel RF areas, not just theoretical edge cases.

Observed 2026-04-22 on two Station-G2 nodes (ROUTER_LATE + CLIENT_BASE) in an urban mesh:

  • Both nodes rebooted independently within 19 minutes of each other (04:55 and 05:14 UTC), after 10.25h and 3.7h of uptime respectively. A neighbor node went down ~2h later after 6.47 days up
  • NodeDB at MAX_NUM_NODES (250/250) on both — exactly the isFull() condition the PR describes
  • 12h of meshlogger captures: 417 packets (9.2% of total RX traffic) from 10+ distinct senders not in NodeDB whose payloads fail channel-key decryption because they're on a different Meshtastic channel (channel hash 115 vs our 45). Their LoRa demodulates, firmware enters handleFromRadio, updateFrom() short-circuits because getOrCreateMeshNode() returns null (DB full + heap pressure) — then exactly the getMeshNode(mp->from)->has_user deref this PR null-checks

Loudest example: sender 0xF67C9BA3, ~5/hr all day, peaks +10 dB SNR (1 hop away).

After patching (this PR + #10219 + #10226), same rogue packets continue to arrive (91+ from that one sender alone since patch), no crashes on the MeshService::handleFromRadio path.

Real-world class of trigger: any mesh near MAX_NUM_NODES in an RF area with multiple Meshtastic channels will hit this at a measurable rate. The null-check converts a remote-DoS-over-RF into a benign decode-fail.

@nightjoker7
Copy link
Copy Markdown
Contributor Author

Post-hoc correction — the crash rate I cited was amplified by an independent bug in our meshlogger client that I only caught after the first comment. Want to be intellectually honest about what the evidence actually supports.

The Python meshtastic.tcp_interface.TCPInterface(host) call leaks a socket + reader thread every time the config handshake times out. Our meshlogger was retrying on a flat 15-second interval, and each retry triggers a full config-dump + NodeDB-dump on hw71's 250-entry DB. That's a compound reconnect storm happening on top of the cross-channel RF traffic — the crash rate in my first comment reflects both pressures combined.

Once we patched the client (L4 pre-probe before handshake, explicit connectNow=False with iface.close() on handshake timeout, 5-minute handshake-timeout backoff, exponential backoff otherwise) and flashed firmware with this PR + #10226 + #10219, post-flash window has been stable for 3+ hours under the same cross-channel senders still broadcasting.

Revised framing: this PR is not "the sole fix for DoS-over-RF." In-field crash rate depends heavily on client behavior too. But the underlying null-deref — getMeshNode(mp->from)->has_user on a nullptr return when NodeDB is full and getOrCreateMeshNode() can't evict — is still a real hard-fault path that any deployment can hit when:

  • NodeDB sits near MAX_NUM_NODES (common on routers in active mesh areas)
  • Packets arrive from senders not in local NodeDB (cross-channel, corruption, cold-boot race, MQTT downlink of previously-unheard nodes)
  • Any concurrent heap pressure is present (phone reconnect dump, traceroute burst, MQTT flood, etc.)

My data should be read as "this bug does fire in production, just not always at the rate I originally reported" — not "this bug alone causes the crash rate I observed." The null-check is still correct and the fault class is still real under normal operation; a better-behaved client just hits it less often.

@nightjoker7
Copy link
Copy Markdown
Contributor Author

Client-side fix for the meshlogger bug I mentioned above — filed as meshtastic/python#918. Independent of the firmware null-deref here, but was amplifying the trigger rate in my deployment.

Review feedback from @thebentern (via Copilot) on PR meshtastic#10229: the null-check
and has_user read called getMeshNode(mp->from) twice in the same condition,
doubling the linear-scan work on every RX packet.

Cache the result in a local fromNode pointer, then do the null-check and
has_user read against the cached pointer. Same behaviour; one scan instead
of two. Also eliminates the TOCTOU window where a concurrent task could
evict the node between calls.
@nightjoker7
Copy link
Copy Markdown
Contributor Author

Addressed in a6a93d281: cached getMeshNode(mp->from) in a local fromNode pointer, one scan per packet, null-check against the cached pointer. Same behaviour as before with no TOCTOU window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Pull request that fixes bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants