Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save mwolter805/ac228fb36471d722d096c82a0e980c21 to your computer and use it in GitHub Desktop.

Select an option

Save mwolter805/ac228fb36471d722d096c82a0e980c21 to your computer and use it in GitHub Desktop.
Forensics — meshcore_py Combined Report

SDK target: meshcore-dev/meshcore_py @ fbf84cb (v2.3.6, upstream main HEAD)

Firmware target: meshcore-dev/MeshCorecompanion_radio at tag companion-v1.14.1

Mode: consolidation of read-only forensic audits. No source files were modified in any audit round.

This report merges the findings from all six prior forensic audit sweeps into a single ordered list, deduplicated where rounds found the same issue, and organized by fix-group — i.e. issues that should be addressed together in the same PR. Severity is shown inline on each finding so priority within a group is still visible.

Total raw findings across all rounds: 48. After deduplication (M01 is the same bug as the re-worded F21) and after retiring one mitigated item (F15), there are 46 active unique findings.


Severity rubric

  • S1 — Critical. Crashes, unrecoverable state, reconnect storms, data corruption, or bugs that directly break downstream consumers in observable ways on normal code paths.
  • S2 — High. Resource leaks, race conditions, silent protocol drops, systemic error-handling gaps, or bugs that cause intermittent misbehavior downstream.
  • S3 — Medium. Latent bugs reachable only on specific error paths, broken error handling, silent value corruption on malformed input, feature gaps, deprecated API usage that will bite on Python upgrade.
  • Info / Low. Code quality, dead code, debug leftovers, maintainability concerns, minor asymmetries — no functional impact alone but worth cleaning up.

Executive summary

Severity Count
S1 Critical 4
S2 High 14
S3 Medium 18
Info / Low 10
Mitigated (F15) 1
Total unique active findings 46
Fix-group Findings Recommended PR order
G1 — Reader / parser crash-safety 11 1st (highest leverage)
G2 — Error-response handling 7 2nd
G3 — Reconnect path 4 3rd
G4 — Transport symmetry & fall-throughs 7 4th
G5 — Asyncio lifecycle hygiene 4 5th
G6 — Missing protocol surface / feature gaps 6 6th
G7 — Standalone bugs and cleanup 7 last (catch-all)

Top structural themes

  1. Reader dispatch chain has zero crash protection (F06). reader.py is a ~900-line unguarded elif chain. A single parser bug, a single truncated frame, or a firmware-side format change tears down handle_rx silently because the reader runs as a detached task (F05) with no try/except. Fixing F06 alone neutralizes the crash surface of F10, F11, N07, N08, R01, R02, NEW-B, and NEW-C — the single highest-leverage change in the SDK and the reason G1 leads the fix order.

  2. expected_events lists systemically omit EventType.ERROR. Command wrappers subscribe to the success event type but never to ERROR, so firmware error responses either hang for the 5 s DEFAULT_TIMEOUT or cause the caller to index into an error-dict payload and crash. This pattern shows up in F21/M01 (send_msg_with_retry), M02 (send_appstart bootstrap path), and M04 (six device.py setters). G2 fixes this systemically.

  3. Reconnect path is broken end-to-end. F01 (TCP connect() returns a Future), F02 (no send_appstart() after transport reconnect), F03 (self-overwriting reconnect task), and N11 (tautological success check) combine to produce the exact "reconnect storm" and "serialization error" symptoms previously reported as meshcore-ha issue #204. G3 fixes the entire chain in one pass.

  4. Fire-and-forget tasks everywhere (F05). Six asyncio.create_task(...) sites across the three transports and the event dispatcher do not retain task references. On Python < 3.11 these can be GC'd mid-execution; on any Python, exceptions inside them are swallowed as "Task exception was never retrieved." G5 cleans this up.

  5. Short-read silent corruption. io.BytesIO.read(N) does NOT raise on EOF — it returns whatever is left, and int.from_bytes(b"", ...) returns 0. Several parser sites rely on implicit length guards that don't actually guard (N07, NEW-C, R02, and the ADVERT branch of NEW-B). The failure mode is silent garbage, not exceptions — and therefore even harder to diagnose. All cleaned up in G1.


Findings — organized by fix-group

Each fix-group below should be addressed in a single PR (or a small cluster of related PRs). Severity tags appear inline on every finding so you can still triage within a group. Within each group, findings are listed in approximate fix-order — umbrella fixes first, then the specific items they cover.


G1 — Reader / parser crash-safety

Theme. The reader's giant elif chain has no outer crash protection, and several parser sites read past the end of buffers without raising. Wrapping reader.py:65-921 and the meshcore_parser.py entry points in catch-all try/except blocks is the single highest-leverage change in the SDK. Tighten length guards on the silent-corruption sites in the same PR.

Recommended PR order within the group: F06 (umbrella wrap) first, then the individual parser fixes. The umbrella wrap turns most of the others from crashes into logged warnings, so even partial completion of the group is safe.


F06 — handle_rx parse errors silently crash per-frame tasks with no visibility — S2

File/line: src/meshcore/reader.py:65-921

Impact. The only crash protection in the ~900-line elif dispatch is a single try/except IndexError wrapping dbuf.read(1)[0] at lines 67-71. Everything after is unguarded. IndexError, struct.error, NameError (F10, F11), KeyError, etc. all tear down the detached handle_rx task, and because the task is detached (F05), asyncio swallows the exception as "Task exception was never retrieved." A single malformed frame can silently drop any legitimate frame parsed in the same task generation.

Recommended fix. Wrap the entire handle_rx body in a try/except that logs the full traceback plus the raw frame hex. This one change neutralizes the crash surface of F10, F11, N07, N08, R01, NEW-B, and NEW-C.


F10 — reader.py:556 pbuf.read(6).hex() NameError on every PUSH_CODE_LOGIN_FAILS1

File/line: src/meshcore/reader.py:549-562

elif packet_type_value == PacketType.LOGIN_FAILED.value:
    res = {}
    attributes = {}
    dbuf.read(1)
    if len(data) > 7:
        res["pubkey_prefix"] = pbuf.read(6).hex()   # ← pbuf undefined
    ...

Impact. Firmware emits PUSH_CODE_LOGIN_FAIL as a fixed 8-byte frame, which trivially satisfies len(data) > 7. Every remote authentication failure hits the pbuf reference and raises NameError. Because handle_rx runs as a detached task (F05) with no overarching try/except (F06), the exception is swallowed silently. LOGIN_FAILED is never delivered to subscribers; any code awaiting it times out.

Recommended fix. pbufdbuf on line 556. One-character fix. The sibling LOGIN_SUCCESS handler a few lines above already uses dbuf.read(6).hex() correctly.


F11 — reader.py:718 except e: is a broken except clause — S3

File/line: src/meshcore/reader.py:705-725 (ALLOWED_REPEAT_FREQ handler)

try:
    while cont:
        min = int.from_bytes(dbuf.read(4), "little", signed=False)
        ...
except e:                 # ← 'e' is undefined
    print(e)

Impact. except e: expects an exception class, not a variable name. If the try body ever raises, Python raises NameError on the except clause and tears down handle_rx. Currently unreachable in practice (int.from_bytes(b"", ...) returns 0 without raising), but a future edit that introduces anything raisable in the try body triggers the trap.

Recommended fix. except e:except Exception as e:, and replace print(e) with a logger.warning call.


N07 — BATTERY handler produces silent value corruption on truncated frame — S3

File/line: src/meshcore/reader.py:341-347

elif packet_type_value == PacketType.BATTERY.value:
    ...
    if len(data) > 3:
        result["used_kb"] = int.from_bytes(dbuf.read(4), "little", signed=False)
        result["total_kb"] = int.from_bytes(dbuf.read(4), "little", signed=False)

Impact. Firmware emits 11 bytes for a full RESP_CODE_BATT_AND_STORAGE. The len > 3 guard passes a short 4–10 byte frame through, and io.BytesIO.read(N) does NOT raise on short reads — it returns the residual bytes, and int.from_bytes(b"", ...) returns 0. Failure mode is silent corruption, not a crash. A truncated frame produces bogus used_kb and total_kb = 0 values. (Earlier audits incorrectly said this raised — corrected mechanism.)

Recommended fix. Tighten the check to len(data) >= 11 before reading the storage fields.


N08 — CONTROL_DATA handler reads payload[0] without bounds check — Info

File/line: src/meshcore/reader.py:882

Impact. payload = dbuf.read(); payload_type = payload[0] with no prior len(payload) > 0 check. IndexError on a zero-length payload. Unguarded per F06.

Recommended fix. Add if len(payload) == 0: return.


NEW-C — STATUS_RESPONSE push path has no length guard — S3

File/line: src/meshcore/reader.py:564-575

Impact. The BINARY_RESPONSE path at reader.py:750 gates parse_status(...) with len(response_data) >= 52. The STATUS_RESPONSE push path at 565 calls parse_status(data, offset=8) with no length check. parse_status reads up to 60 bytes via slices; short frames silently produce int.from_bytes(b"", ...) = 0 for every missing field — same silent-garbage class as N07. A truncated STATUS_RESPONSE produces a fully-populated dict filled with zeros rather than raising. HA sensor telemetry silently goes to zero.

Recommended fix. Add if len(data) < 60: logger.debug(...); return in front of the parse_status call, matching the gate at line 750.


R01 — parsePacketPayload crashes on short LOG_DATA payload — S3

File/line: src/meshcore/meshcore_parser.py:47-58, called from reader.py:610

Impact. reader.py LOG_DATA branch only requires len(data) > 3 before calling parsePacketPayload(payload, log_data). With len(data) == 4, payload is exactly 1 byte. Inside the parser, header = pbuf.read(1)[0] succeeds, but the subsequent path_byte = pbuf.read(1)[0] reads nothing and raises IndexError. io.BytesIO.read(4) for transport_code returns b"" silently (same class as N07).

Recommended fix. Validate len(payload) >= minimum before parsing, or wrap parsePacketPayload in a try/except in the reader.


NEW-B — parsePacketPayload ADVERT branch IndexError on short advert — S3

File/line: src/meshcore/meshcore_parser.py:152-170

Impact. Same failure class as R01. pk_buf.read(1)[0] at line 156 (flags byte) IndexErrors when the advert payload is shorter than 101 bytes (the ADVERT header before flags). Optional lat/lon/feat1/feat2 reads at 163-168 then silently produce zeros on short reads (same silent-garbage issue as N07). One malformed advert crashes the log parser; consumers lose packet processing for the duration of the handler.

Recommended fix. Wrap the ADVERT branch in try/except (IndexError, ValueError) with a debug log, consistent with the defensive pattern elsewhere in parsePacketPayload.


R02 — MeshcorePacketParser decrypts txt_type from a zero-length slice — Info

File/line: src/meshcore/meshcore_parser.py:131

attempt = uncrypted[4] & 3
txt_type = int.from_bytes(uncrypted[4:4], "little", signed=False) >> 2

Impact. uncrypted[4:4] is the empty slice; int.from_bytes(b"") returns 0, so txt_type is always 0 in decrypted channel messages. No exception, just a silently corrupted field on every decrypted packet. Almost certainly a typo for uncrypted[4:5].

Recommended fix. uncrypted[4:4]uncrypted[4:5]. Add a parser unit test that decrypts a known plaintext with non-zero txt_type.


F12 — Leftover print(res) debug statement — Info

File/line: src/meshcore/reader.py:530 (RAW_DATA handler)

Cosmetic. Pollutes stdout during RAW_DATA events. Surrounding lines correctly use logger.debug.

Recommended fix. Remove or replace with logger.debug(res).


N10 — Reader uses magic numbers 16 and 17 instead of enum values — Info

File/line: src/meshcore/reader.py:223, 288

Impact. PacketType.CONTACT_MSG_RECV_V3 = 16 and CHANNEL_MSG_RECV_V3 = 17 are declared in packets.py:94-95, but reader.py hardcodes the integer literals 16 and 17. Drift risk if the enum is ever renumbered.

Recommended fix. Replace the literals with PacketType.X.value.


G2 — Error-response handling

Theme. Command wrappers systematically subscribe to the success event type but never to EventType.ERROR. The result is either a 5-second hang on DEFAULT_TIMEOUT or a KeyError from indexing into an error dict. This is one mechanical pattern repeated in multiple files. Sweep every await self.send(..., [EventType.X]) site, add EventType.ERROR to the list, and add .type guards before keyed payload access. F22 (the API shape) is the umbrella; the others are concrete instances.


F22 — wait_for_event can match only a single EventType; callers must inspect .typeS2

File/line: src/meshcore/events.py:238, src/meshcore/commands/base.py:88-132

Impact. wait_for_event takes one event type; wait_for_events and send return the first matched event without disambiguation. Callers who pass [X, ERROR] must inspect .type on the return value, and many don't. F21/M01, M02, M04, and N06 are direct consequences.

Recommended fix. Either extend wait_for_event to accept a list, or document the .type-check contract on every caller. The latter is smaller and pairs naturally with the call-site fixes below.


F21 — send_msg_with_retry falls through to KeyError on ERROR response (= M01) — S2

File/line: src/meshcore/commands/messaging.py:147-150

result = await self.send_msg(dst, msg)
if result.type == EventType.ERROR:
    logger.error(f"Error sending message: {result.payload}")
exp_ack = result.payload["expected_ack"].hex()   # KeyError on error path

Impact. The error-check logs but does not return. Execution falls through to result.payload["expected_ack"].hex(), which raises KeyError because the ERROR payload is {"reason": "..."}. The retry loop never runs; the caller sees an exception instead of the intended retry behavior. M01 and the re-worded F21 describe the same code at the same site — deduplicated here.

Recommended fix. After the error log, either return result or continue the retry loop with back-off.


M02 — send_appstart does not include EventType.ERROR in expected events — S2

File/line: src/meshcore/commands/device.py:15-16

return await self.send(b"\x01\x03      mccli", [EventType.SELF_INFO])

Impact. If the firmware returns RESP_CODE_ERR (version mismatch, unsupported feature flag), wait_for_event never matches and the command hangs until DEFAULT_TIMEOUT (5 s) fires. Since send_appstart is called on every initial connect, this converts a recoverable error into a 5-second blocking hang on the bootstrap path.

Recommended fix. Add EventType.ERROR to the expected-events list and check .type on return.


M04 — Six device.py setters KeyError when send_appstart returns ERROR — S2

File/line: src/meshcore/commands/device.py:131-159 (set_telemetry_mode_base, set_telemetry_mode_loc, set_telemetry_mode_env, set_manual_add_contacts, set_advert_loc_policy, set_multi_acks)

infos = (await self.send_appstart()).payload
infos["feat"] = ...    # writes succeed on an error dict
... set_other_params_from_infos(infos)   # KeyError on infos["telemetry_mode_base"] & 0b11

Impact. None of these methods check result.type. On an ERROR path the dict mutation succeeds; the crash happens one frame deeper inside set_other_params_from_infos when it bitmasks fields the error dict doesn't contain. User-visible symptom is a random KeyError from what looks like an unrelated setter.

Recommended fix. Check result.type != EventType.ERROR before accessing keyed fields. Better still, fetch self_info from cached state rather than round-tripping appstart per setter.


N06 — send_anon_req() crashes with TypeError when contact prefix not found — S3

File/line: src/meshcore/commands/base.py:266-274

Impact. After contact = self._get_contact_by_prefix(...), the code logs logger.error("No contact found") if contact is None — but doesn't return. Execution falls through to contact["out_path_len"], which raises TypeError: 'NoneType' object is not subscriptable. Per F06, the exception unwinds out of the async task and is absorbed by the event loop; the caller's await hangs for DEFAULT_TIMEOUT.

Recommended fix. return Event(EventType.ERROR, {"reason": "contact_not_found"}) immediately after the log line. Same pattern as G2's other fixes.


F14 — commands/messaging.py:258 references e outside any except block — S3

File/line: src/meshcore/commands/messaging.py:258 (else branch of path-flag dispatch in send_trace)

Impact. logger.error(f"Invalid path format: {e}") in a branch where e is not in scope. Reachable only on an unrecognized path_hash_len. Latent NameError on the error path. Same flavor of broken error-handling as F21 — logging an error but failing while doing it.

Recommended fix. Capture the path-format error locally or restructure so e is bound in the scope that references it.


F20 — Unreachable return after exception handlers — S3

File/line: src/meshcore/commands/base.py:132return Event(EventType.ERROR, {})

Impact. Placed after the try/except/except block at the end of wait_for_events. Every branch already returns; this line is dead code. Worth removing in the same pass that touches wait_for_events for the F22 / G2 work.

Recommended fix. Delete.


G3 — Reconnect path

Theme. The reconnect chain is broken end-to-end and is the source of meshcore-ha issue #204 ("Future serialization errors" + "reconnect storm"). All four findings in this group should be fixed in a single PR — splitting them risks an intermediate state where reconnect partially works but produces wrong-shaped events. Inject a reconnect callback into ConnectionManager, restructure _attempt_reconnect into a single internal loop, revert the TCP Future wrap, and unify the success contract across the three transports.


F01 — TCP connect() returns asyncio.Future instead of host string — S1

File/line: src/meshcore/tcp_cx.py:52-65

async def connect(self):
    loop = asyncio.get_running_loop()
    await loop.create_connection(
        lambda: self.MCClientProtocol(self), self.host, self.port
    )
    logger.info("TCP Connection started")
    future = asyncio.Future()
    future.set_result(self.host)
    return future

Impact. ConnectionManager.connect() and _attempt_reconnect put the returned value directly into the CONNECTED event payload. On TCP the payload contains {"result": <asyncio.Future>}, and any downstream serializer that walks it hits the C-implemented _asyncio.Future (no __dict__, no JSON encoder path) and crashes. In meshcore-ha this manifests as the recorder's sanitize_event_data() failing on the fallthrough hasattr(data, "__dict__") branch. Both BLE and serial return plain strings; only TCP wraps. Introduced as an orphan 3-line hunk in commit 6fbf158 (2025-04-14).

Recommended fix. Revert to return self.host. Four lines changed, one file.


F02 — ConnectionManager._attempt_reconnect never calls send_appstart()S1

File/line: src/meshcore/connection_manager.py:111-150

Impact. Firmware requires CMD_APP_START after every transport-level connection to initialize the session. MeshCore.connect() does this; _attempt_reconnect bypasses MeshCore and calls self.connection.connect() directly — no handshake, no SELF_INFO refresh. On TCP, TCPConnection.send() then fires tcp_no_response after 5 unanswered sends, handle_disconnect re-enters _attempt_reconnect, and the reconnect storm begins. Present since commit dabc343 (2025-06-30); every release from v2.0.0 onward carries it.

Recommended fix. Inject a reconnect callback into ConnectionManager.__init__. MeshCore supplies its own reconnect routine that calls send_appstart() after the transport layer returns. Preserves the transport-agnostic abstraction.


F03 — _reconnect_task reference overwrite race — S2

File/line: src/meshcore/connection_manager.py:100, 134, 145

Impact. _attempt_reconnect tail-recurses via asyncio.create_task and re-assigns self._reconnect_task from inside its own running coroutine. The current task becomes orphaned — disconnect() cancels only the newest pointer, leaving a previous-generation attempt in flight. That in-flight attempt can succeed and set _is_connected = True after the caller thinks the session is closed. On Python < 3.11 the orphaned task can also be GC'd mid-execution.

Recommended fix. Restructure _attempt_reconnect into a single while self._reconnect_attempts < self.max_reconnect_attempts: loop that holds one persistent task for the whole reconnect session. The G3 PR should do this at the same time as F02 — both touch the same code.


N11 — ConnectionManager success check is tautological for TCP and serial — Info

File/line: src/meshcore/connection_manager.py:122-123

Impact. _attempt_reconnect checks if result is not None: to decide whether reconnect succeeded. TCP's connect() returns a resolved Future (never None); serial's connect() returns self.port (always truthy). Only BLE can return None. Combined with F01, the success check effectively means "did this raise an exception" — not what the code reads as.

Recommended fix. Unify the return contract across the three transports — return bool, or raise on failure. Update _attempt_reconnect to use a clearly-defined success condition. This belongs in the G3 PR because it touches the same _attempt_reconnect code.


G4 — Transport symmetry & fall-throughs

Theme. TCP, BLE, and serial transports diverge on how they signal failure, how they handle transport.write() errors, and how they recover from oversize frames. Make the three transports symmetric: every send() path should fire _disconnect_callback(reason) on a dead transport, every write() should be wrapped in try/except, and the oversize-frame recovery path should not fall through to dispatching empty frames. Bound the serial connect-wait with a timeout while the transport files are open.


F04 — Transport send() failure signaling is asymmetric; transport.write() has no try/except — S2

File/line: src/meshcore/tcp_cx.py:119-140, src/meshcore/serial_cx.py:125-132, src/meshcore/ble_cx.py:171-178

Impact. Only TCP invokes _disconnect_callback("tcp_transport_lost") / _disconnect_callback("tcp_no_response") from its send() path. BLE and serial return silently on a dead transport. transport.write(pkt) is also unwrapped in all three transports, so an underlying OSError or ConnectionResetError propagates to the caller instead of triggering reconnect. Consumers watching DISCONNECTED events will never see the failure on BLE or serial when a send() is attempted against a dead transport.

Recommended fix. Wrap transport.write() in try/except and fire _disconnect_callback(reason) on failure in all three transports, matching TCP's pattern. NEW-A is the serial-specific instance of this same problem.


NEW-A — SerialConnection.send swallows transport-lost without firing disconnect — S3

File/line: src/meshcore/serial_cx.py:125-132

Impact. TCP's send() fires _disconnect_callback("tcp_transport_lost") when self.transport is falsy. Serial's send() just logs and returns. Loss of the serial transport between frames is not observed until connection_lost fires from the protocol side — which doesn't happen on all failure modes (e.g. some USB drivers on pull-out). Symmetric transports should have symmetric error signalling. Same-PR with F04.

Recommended fix. Mirror TCP's disconnect-callback behavior when self.transport is missing.


F18 — serial_cx.py:69 await self._connected_event.wait() has no timeout — S2

File/line: src/meshcore/serial_cx.py:69

Impact. After create_serial_connection, connect() awaits _connected_event.wait() with no timeout. If the serial device opens but connection_made is never called (driver bug, USB adapter glitch), connect() hangs indefinitely. Common on RPi USB-serial deployments.

Recommended fix. Add a timeout argument defaulting to ~10 s; raise asyncio.TimeoutError on expiry.


M06 — Oversize-frame recovery falls through to empty dispatch — S3

File/line: src/meshcore/tcp_cx.py:92-99, src/meshcore/serial_cx.py:98-105

if self.frame_expected_size > 300:   # invalid size
    self.header = b""
    self.inframe = b""
    self.frame_expected_size = 0
    if len(data) > 0:
        self.handle_rx(data)
        return
# falls through here with inframe = b""

Impact. When an oversize header is detected and data is empty after the header, the method resets state and falls through to upbound = self.frame_expected_size - len(self.inframe) == 0, then calls reader.handle_rx(b""). The reader's try/except IndexError at lines 67-71 absorbs this specific empty-frame case, so the current behavior is a wasted dispatch, not a crash. However, if the IndexError guard is ever moved or any branch is reordered before it, this becomes a guaranteed crash.

Recommended fix. Add a bare return after the reset when data is empty. Two-line fix in both transports.


F16 — BLE handle_disconnect resets client without re-attaching disconnect callback — S2

File/line: src/meshcore/ble_cx.py:146-155

Impact. On disconnect the handler resets self.client to self._user_provided_client (the initially passed-in BleakClient). That client's disconnected_callback is not re-registered, so a subsequent BLE disconnect will not trigger the connection manager's handle_disconnect. After one successful reconnect cycle, subsequent disconnects may be missed. Severity nuance: the next connect() call does re-wrap the client and re-install the callback, so the loss is narrower than first reported — state tracking is lost between disconnect and the next connect attempt.

Recommended fix. Re-register disconnected_callback on the reset client, or preserve a single self.client and reset its internal state instead of swapping references.


F17 — BLE pairing failure is swallowed with a warning — S3

File/line: src/meshcore/ble_cx.py:113-121

Impact. If BLE pairing fails during connect(), the exception is caught as a warning and connect() returns normally. The transport may be in a half-usable state (connected but unpaired). Subsequent sends that require paired state fail with confusing errors. Consumer sees "connected" but no data flowing.

Recommended fix. Either re-raise or return a pairing-failed sentinel that ConnectionManager surfaces in the CONNECTED event payload.


N04 — TCP disconnect detection counts TCP segments, not MeshCore frames — S3

File/line: src/meshcore/tcp_cx.py:41

Impact. _receive_count += 1 inside data_received() increments per TCP read, not per completed MeshCore frame. The threshold heuristic _send_count - _receive_count >= 5 is therefore skewed by fragmentation: bunched frames inflate imbalance (false disconnect on catch-up bursts), split frames mask real imbalance (stall not detected).

Recommended fix. Move the counter into handle_rx() so it increments per completed frame, or replace the counter with a periodic keepalive/timeout model.


G5 — Asyncio lifecycle hygiene

Theme. The SDK uses asyncio.create_task without retaining references in seven places, calls task_done() before the dispatched callback completes, binds queue/lock primitives to the wrong loop, and uses the deprecated get_event_loop(). These are all hygiene issues with a common fix pattern: track tasks in a set, defer primitive construction to start(), and replace deprecated APIs. One PR touching every transport plus events.py and commands/base.py.


F05 — Pervasive fire-and-forget asyncio.create_task with no stored reference — S2

File/line (six sites):

  1. src/meshcore/tcp_cx.py:50 — disconnect callback in connection_lost

  2. src/meshcore/tcp_cx.py:111self.reader.handle_rx(self.inframe) in handle_rx

  3. src/meshcore/ble_cx.py:158_disconnect_callback("ble_disconnect")

  4. src/meshcore/ble_cx.py:169self.reader.handle_rx(data)

  5. src/meshcore/serial_cx.py:47_disconnect_callback("serial_disconnect")

  6. src/meshcore/serial_cx.py:117self.reader.handle_rx(self.inframe)

  7. src/meshcore/events.py:200_execute_callback for coroutine subscribers

    Impact. Python's asyncio docs explicitly warn that create_task return values must be retained because the event loop holds only a weak reference. Every inbound frame is parsed by a new detached task (sites 2/4/6), and every async subscriber callback is dispatched via site 7. Under GC pressure on Python < 3.11 any of these can be silently cancelled mid-execution. Exceptions thrown inside them are also swallowed as "Task exception was never retrieved."

    Recommended fix. Introduce self._background_tasks: set[asyncio.Task] = set() on each transport and on EventDispatcher, add each created task to it, and use task.add_done_callback(self._background_tasks.discard) to auto-remove on completion. Standard pattern.


F07 — EventDispatcher marks queue items done before async callbacks complete — S2

File/line: src/meshcore/events.py:200, 207

for handler in handlers:
    if asyncio.iscoroutinefunction(handler):
        asyncio.create_task(self._execute_callback(handler, event))
    else:
        handler(event)
self.queue.task_done()   # line 207 — async callback may still be running

Impact. For async callbacks, _dispatch_loop fires a task and immediately calls queue.task_done() without waiting. The await queue.join() in EventDispatcher.stop() returns as soon as all queued items are marked done, even if their async callbacks are still executing. On shutdown, stop() can return and higher-level disconnect logic can proceed while async handlers are still touching objects that are being torn down. One source of the "Task was destroyed but it is pending" warning in downstream consumers.

Recommended fix. Track in-flight callback tasks and await them on stop, or only call task_done() after the callback completes. Pairs naturally with the F05 background-task set.


F08 — EventDispatcher / CommandHandlerBase bind asyncio.Queue / Lock to the import-time loop — S2

File/line: src/meshcore/events.py:133, src/meshcore/commands/base.py:67

Impact. On Python 3.9/3.10, asyncio.Queue() and asyncio.Lock() bind to the running event loop at construction time. If the SDK is constructed from a sync factory before an event loop exists, or from a different loop than the one that later uses it, both primitives become loop-bound to the wrong loop and raise RuntimeError: <Queue ...> is bound to a different event loop on first use. On Python 3.10+ the primitives are loop-agnostic, so this is a compat landmine rather than a present failure.

Recommended fix. Defer queue and lock construction until start() is called, or document "must be constructed from within an async context."


F19 — asyncio.get_event_loop() inside an async function (deprecated) — S3

File/line: src/meshcore/commands/base.py:173

Impact. Since Python 3.10 this emits DeprecationWarning; in Python 3.12 it raises in some contexts. The call is inside send() which is always async, so get_running_loop() is the correct replacement.

Recommended fix. get_event_loop()get_running_loop(). One-word change.


G6 — Missing protocol surface / feature gaps

Theme. The firmware emits push frames the SDK doesn't recognize, has command opcodes the SDK doesn't wrap, and accepts a request shape the SDK builds incorrectly. None of these crash anything — they're features that are silently inaccessible. Worth fixing as one PR that adds the missing enum entries, reader branches, and command wrappers in matched pairs. Verify against MyMesh.cpp before submitting.


N01 — PUSH_CODE_CONTACT_DELETED (0x8F) has no SDK handler — S2

File/line: src/meshcore/packets.py:122 (declaration), src/meshcore/reader.py (no handler branch)

Impact. Firmware emits PUSH_CODE_CONTACT_DELETED (1-byte code + 32-byte pubkey, 33 bytes total) from MyMesh::onContactOverwrite() (MyMesh.cpp:325-334) whenever the oldest contact is overwritten after the contact table fills. The enum entry exists but no elif matches it in reader.py, so the frame falls through to the final else: logger.debug("Unhandled...") branch. Home Assistant has no way to learn that a contact was deleted — the contact cache goes silently stale.

Recommended fix. Add a PacketType.CONTACT_DELETED.value handler that dispatches EventType.CONTACT_DELETED with the pubkey payload.


N02 — PUSH_CODE_CONTACTS_FULL (0x90) not defined in SDK enum, unhandled — S2

File/line: src/meshcore/packets.py (missing entry)

Impact. Firmware defines PUSH_CODE_CONTACTS_FULL = 0x90 and emits a 1-byte push from onContactsFull() (MyMesh.cpp:336) when the contacts table reaches capacity. The SDK's PacketType enum stops at CONTACT_DELETED = 0x8F; no entry for 0x90. Numeric value 0x90 also happens to match ControlType.NODE_DISCOVER_RESP in a different namespace — a maintenance hazard but not a literal conflict.

Recommended fix. Add PacketType.CONTACTS_FULL = 0x90 and a reader branch that dispatches EventType.CONTACTS_FULL.


N03 — CMD_GET_TUNING_PARAMS (43) / RESP_CODE_TUNING_PARAMS (23) unhandled — S3

File/line: src/meshcore/packets.py:63, packets.py:101, no handler in reader.py, no wrapper in commands/device.py

Impact. Firmware emits a 9-byte RESP_CODE_TUNING_PARAMS (MyMesh.cpp:1307-1313) with rx_delay and airtime_factor. The enum entries exist but the SDK has no wrapper to request and no reader branch to parse — the feature is inaccessible from Python even though both ends have the code.

Recommended fix. Add DeviceCommands.get_tuning() (send b"\x2b") and a TUNING_PARAMS reader branch.


N05 — send_trace() emits 10 bytes; firmware requires len > 10 strict — S3

File/line: src/meshcore/commands/messaging.py:218-298, MyMesh/MyMesh.cpp:1620

Impact. SDK builds cmd(1) + tag(4) + auth(4) + flags(1) = 10 bytes when path is empty. Firmware uses strict len > 10, so the 10-byte form is silently rejected — no error frame, no ACK. SDK hits DEFAULT_TIMEOUT and returns. send_trace() with no path (the default single-hop usage) never lands.

Recommended fix. Pad the SDK's trace packet by one zero byte when path is empty. Safer than changing firmware's gate.


N09 — Five firmware CMDs have no SDK command wrapper — S3

File/line: src/meshcore/packets.py:45-66, src/meshcore/commands/*.py

Impact. CommandType enum declares SEND_RAW_DATA = 25, HAS_CONNECTION = 28, GET_CONTACT_BY_KEY = 30, GET_TUNING_PARAMS = 43, FACTORY_RESET = 51. None of these have user-facing methods in any commands/*.py file. Feature-gap. FACTORY_RESET in particular is a destructive operation that deserves a confirm-style API rather than hand-crafted frames.

Recommended fix. Implement wrappers for each. GET_TUNING_PARAMS is also covered by N03 above.


R04 — commands/device.py uses literal command bytes despite CommandType enum (CMD_GET_STATS missing entirely) — Info

File/line: src/meshcore/commands/device.py (multiple), src/meshcore/packets.py:20-74

Impact. Most device commands use literal bytes (b"\x07", b"\x06", b"\x38\x00", etc.) instead of CommandType.X.value. More notably, CMD_GET_STATS = 56 (0x38) is missing entirely from the CommandType enum, but three wrappers reference it (get_stats_core, get_stats_radio, get_stats_packets). This is the inverse of N09: N09 is enum entries with no wrapper; R04 is wrappers that bypass (or contradict) the enum.

Recommended fix. Add CMD_GET_STATS = 56 to CommandType. Either replace literals with enum references everywhere or document that the enum is informational. Belongs in the G6 PR because it touches the same enum file.


G7 — Standalone bugs and cleanup

Theme. Items that don't cluster naturally with the other groups. Most are localized one-line or one-method fixes. F13 is the only S1 in this group — it's standalone because it's a deprecated public API that no other finding overlaps with. Bump DEFAULT_TIMEOUT while you're in commands/base.py for the F19 fix in G5. Do this PR last (or batch it with G5 since several touch the same files).


F13 — commands/binary.py:78 req_mma NameError on undefined start, endS1

File/line: src/meshcore/commands/binary.py:76-78

async def req_mma(self, contact, timeout=0, min_timeout=0):
    logger.error("*** please consider using req_mma_sync instead of req_mma")
    return await self.req_mma_sync(contact, start, end, timeout, min_timeout)

Impact. The legacy wrapper takes (contact, timeout, min_timeout) — no start or end — then immediately calls req_mma_sync(contact, start, end, ...). Neither name is defined. Any caller using this deprecated public API crashes with NameError: name 'start' is not defined on the first call.

Recommended fix. Either add start and end to the signature and forward them, or delete the deprecated method entirely. The migration warning suggests deletion.


F09 — DEFAULT_TIMEOUT = 5.0 too short for slow-path operations — S3

File/line: src/meshcore/commands/base.py:61

Impact. Used for every send() call path. Too short for high-latency mesh operations (path-resolving messaging, long binary responses, remote auth). Also the source of the "hanging tests" reported in an earlier handoff — test_send_appstart and test_send_advert aren't hanging; they're falling through to this 5 s timeout because their mock dispatchers don't wire up matching responses. pytest --timeout=30 runs both to completion.

Recommended fix. Bump to 15 s (or expose per-call override more prominently). Fix the test harness to wire mock responses or pass timeout=0.1.


M03 — set_flood_scope can raise UnboundLocalErrorS3

File/line: src/meshcore/commands/messaging.py:300-323

if scope is None:
    scope_key = b"\0" * 16
elif isinstance(scope, str):
    scope_key = bytes.fromhex(scope)
elif isinstance(scope, bytes):
    scope_key = bytes(scope)
# no else
...
return await self.send(b"\x36\x00" + scope_key, [...])

Impact. When scope is any other type (int, bytearray, contact dict), scope_key is never bound and the concatenation raises UnboundLocalError instead of a clearer TypeError.

Recommended fix. Add else: raise TypeError(...) branch.


M05 — update_contact path-hash-mode shift is dead code — Info

File/line: src/meshcore/commands/contact.py:119

Impact. path_hash_mode = contact["out_path_len"] >> 6 — but reader.py:112 masks out_path_len & 0x3F before storage, so the high bits are always zero and path_hash_mode is always 0. Lines 120-125 then override it via send_device_query. Dead code today; latent trap if the masking rule changes elsewhere without touching this file.

Recommended fix. Remove the line, or move the mask into a constant both sites share.


M07 — get_contacts returns None but is typed -> EventInfo

File/line: src/meshcore/commands/contact.py:20, 46, 53, 67

Impact. Three paths return None (timeout at line 46, event is None at line 53, asyncio.TimeoutError at line 67); one path returns Event(EventType.ERROR, ...); the happy path returns the event. Annotation says -> Event. The only in-tree caller is meshcore.py:486, which discards the return value, so no current crash — but downstream consumers that do result.type == ... will AttributeError on None.

Recommended fix. Annotate as Optional[Event], or normalize the None branches to Event(EventType.ERROR, {"reason": "timeout"}).


R03 — send_binary_req registers the pending request after send() returns — race window — Info

File/line: src/meshcore/commands/base.py:240-259

Impact. After send() returns MSG_SENT, the code calls self._reader.register_binary_request(...). If a BINARY_RESPONSE arrives between the send() return and the registration call, the reader logs "No tracked request found" and does NOT dispatch the type-specific event (STATUS_RESPONSE, TELEMETRY_RESPONSE, etc.). The caller's subsequent wait_for_event(...) then times out. Theoretical for radio responses (round-trip takes seconds), but a TCP-companion proxy or wired path can return BINARY_RESPONSE very quickly and hit the sub-millisecond gap.

Recommended fix. Pre-register the binary request before send(), with a placeholder tag that gets patched once MSG_SENT returns. Or hold the dispatcher lock across both steps.


R05 — MeshCore.subscribe callback type annotation conflicts with dispatcher contract — Info

File/line: src/meshcore/meshcore.py:206-211 vs. src/meshcore/events.py:138-162

Impact. meshcore.py:209 types callbacks as Callable[[Event], Coroutine[Any, Any, None]] (async only). events.py:141 types them as Callable[[Event], Union[None, asyncio.Future]] (sync or async). The dispatcher's _process_events handles both. Type-checkers (mypy, pyright) will flag any sync handler the caller passes through MeshCore.subscribe, even though it works at runtime.

Recommended fix. Widen meshcore.py's annotation to match events.py.


Mitigated (recommend closure)


F15 — commands/base.py send() — subscribe / cancel / unsubscribe now correct

File/line: src/meshcore/commands/base.py:162-227

Status. The original audit rounds flagged send() as lacking per-subscription timeout and cancellation. Reading the current source: subscriptions are created via dispatcher.subscribe at lines 174-183, pending futures are cancelled at 202-203, and subscriptions are unsubscribed in a finally block at 224-227. The single outer asyncio.wait timeout is intentional — "first matching event wins." This pattern is correct. The helper wait_for_events at lines 88-132 is a separate code path that uses dispatcher.wait_for_event (which has its own try/finally), so it is also safe.

Recommendation. Mark F15 as closed in the next pass. No fix needed.


Cross-references to meshcore-ha impact

  • meshcore-ha issue #204 — "Future serialization errors" is F01; "reconnect storm" is F02. F03 compounds both. Downstream defensive fix: add an asyncio.Future branch to sanitize_event_data() in custom_components/meshcore/utils.py.
  • "Task was destroyed but it is pending" warnings in HA logs — F05 + F07.
  • Silent missed events under load — F05 garbage-collecting detached tasks; F06 swallowing parse exceptions; N01/N02 dropping push frames the SDK doesn't know about.
  • Hang on send_trace() with default args — N05.
  • Silent zero values on telemetry sensors — N07, NEW-C.

Methodology

All six audit rounds were strict read-only against meshcore_py/ at commit fbf84cb (v2.3.6) and MeshCore/examples/companion_radio/MyMesh.cpp at tag companion-v1.14.1. No source files were modified in any round. Each successive round re-read every previously-cited file and line, noted behavioral matches or divergences from the earlier wording, and performed an additional sweep for new issues. Three findings received wording corrections across rounds — the corrected mechanics are used in this combined report:

ID Original claim Corrected mechanism
N07 "A short/corrupt frame raises in the .read() calls" io.BytesIO.read(N) returns residual bytes without raising; int.from_bytes(b"") returns 0. Failure mode is silent corruption, not exception.
M07 "HA integration meshcore.py:486 crashes with AttributeError" That call site discards the return value, so no current crash. Contract violation is real but latent.
F21 "send_msg_with_retry doesn't reset subscriptions between attempts" Subscription management is correct; the real bug is that the error-log falls through to result.payload["expected_ack"] and KeyErrors.

One finding (F15) was flagged for closure: commands/base.py:send() now subscribes before writing, cancels pending futures on timeout, and unsubscribes in a finally block.


Appendix — Finding IDs by origin

  • F01–F22: introduced in rounds 1–2 (Forensics - meshcore_py SDK audit.md and ...firmware protocol audit.md). F21/F22 re-described in round 4; F21 is now the same bug as M01.

  • N01–N11: introduced in round 3 (...reverification...).

  • M01–M07: introduced in round 4 (...Re-Reverification...). M01 deduplicated against F21.

  • R01–R05: introduced in round 5 (...Re-Re-Reverification...).

  • NEW-A, NEW-B, NEW-C: introduced in round 6 (...Re-Re-Re-Reverification...).

    Total raw: 22 + 11 + 7 + 5 + 3 = 48. Deduplicated (−M01) and with F15 retired: 46 active unique findings.


Appendix — Severity index (cross-reference)

For triage by severity, the findings in each fix-group sort to:

  • S1 Critical (4): F01, F02, F10, F13 (in groups G3, G3, G1, G7)
  • S2 High (14): F03, F04, F05, F06, F07, F08, F16, F18, F21, F22, M02, M04, N01, N02 (in groups G3, G4, G5, G1, G5, G5, G4, G4, G2, G2, G2, G2, G6, G6)
  • S3 Medium (18): F09, F11, F14, F17, F19, F20, M03, M06, N03, N04, N05, N06, N07, N09, R01, NEW-A, NEW-B, NEW-C
  • Info / Low (10): F12, M05, M07, N08, N10, N11, R02, R03, R04, R05

Addendum — Test infrastructure issue discovered during remediation

Discovered during: Remediation work on G7 branch (feature/g7-standalone-bugs-and-cleanup)

This is not a forensic finding from the original audit rounds — it is a pre-existing test infrastructure defect discovered while validating the G1–G7 fix branches.

Summary

The upstream test suite (tests/unit/test_commands.py) requires an undocumented --timeout=60 flag from the pytest-timeout plugin (not listed in pyproject.toml dev dependencies) to complete without failures. Without it, 28 of 74 tests hang indefinitely. With it, the suite takes ~8 minutes — almost entirely dead-wait time.

Root cause: The mock_dispatcher fixture's fake_subscribe function records the event handler but never calls it. When command methods call send() with expected_events, the internal asyncio.wait() blocks for the full DEFAULT_TIMEOUT (15.0s after the F09 bump, previously 5.0s) waiting for futures that never resolve. The correct pattern (setup_event_response()) already exists in the same file and is used by exactly one test — it just isn't wired into the default fixture.

Impact on submission strategy

This issue affects test run times for all branches, so the fix is submitted upstream as its own PR before G1–G7. The standalone branch (feature/fix-test-timeout-waste) touches only tests/unit/test_commands.py — zero production code, zero overlap with any G1–G7 branch. Once merged, all subsequent PRs benefit from fast CI test runs (<5 seconds vs ~8 minutes).

The updated submission order and full technical details are documented in:

  • Proposed - Fix test_commands.py Timeout Waste.md — full investigation, root cause, and fix specification
  • Proposed - meshcore_py Forensics Remediation.md §3.4 — updated submission order (test fix → G1 → G3 → G2 → G4 → G5 → G6 → G7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment