Forensics Report — Companion Device Communication Reliability

Date: 2026-04-01 Reporter: Matt Wolter Status: Root causes identified and fixed across three coordinated PRs Applies to: meshcore-ha Home Assistant integration communicating with MeshCore companion firmware over TCP

Executive Summary

The MeshCore-HA integration communicates with a companion radio device over a single TCP/Serial connection using a binary frame protocol. Three distinct but interrelated issues cause unreliable command execution — most visibly, managed repeater status timeouts after HA restarts and silently dropped sent messages. All three issues stem from a mismatch between the firmware's single-request architecture and the integration's concurrent command patterns.

The core constraint: The companion firmware tracks outstanding mesh requests with five global flags (pending_login, pending_status, pending_telemetry, pending_discovery, pending_req). Every outbound request handler calls clearPendingReqs() before storing its own tag — wiping all five flags. This means the firmware can only track one outstanding mesh request at a time. If any second request is sent before the first response arrives from the mesh, the first request's tracking is erased and its response is silently dropped.

Three PRs address the three layers of this problem:

PR	Branch	What It Fixes
PR 1	`feature/fix-repeater-status`	Incorrect timeout handling and missing error checks on repeater/telemetry requests
PR 2	`feature/conditional-msg-poll`	Unnecessary `get_msg()` polling every 5 seconds (~17,280 commands/day) that wastes link time and increases collision probability
PR 3 (Future)	`feature/command-serialization`	No mutual exclusion on the shared connection — many command call sites have no lock protection

PRs 1 and 2 are independent of each other and can be reviewed/merged in any order. PR 3 depends on both (it builds on the SDK sync methods from PR 1 and the reduced polling from PR 2) and will be submitted after PRs 1 and 2 are merged.

Firmware Architecture — Why This Matters
Issue 1: Repeater Status Timeouts
Issue 2: Excessive get_msg() Polling
Issue 3: No Command Serialization
How the Three Fixes Work Together
Event Dispatch Is Not Affected
Appendix: Firmware Pending Request Flags

1. Firmware Architecture

Source: MeshCore/examples/companion_radio/MyMesh.cpp, MeshCore/src/Dispatcher.cpp, MeshCore/docs/companion_protocol.md

Single-Threaded Main Loop

The companion firmware runs a single-threaded main loop with no preemption:

MyMesh::loop() {
  1. BaseChatMesh::loop()      // Radio: receive/transmit mesh packets
  2. checkSerialInterface()    // Serial/TCP: read one command frame, dispatch it
  3. Lazy contact writes       // Save contacts to flash if dirty
  4. UI connection status      // Update display
}

Key behaviors:

One command per iteration. checkSerialInterface() reads at most one complete frame per call. Two commands sent back-to-back are processed in consecutive loop iterations, not simultaneously.
Radio before serial. Incoming mesh packets (including those that trigger PUSH_CODE_MSG_WAITING) are processed before serial commands in the same iteration.
Non-blocking binary requests. When the firmware sends a mesh request (status, telemetry, login), it stores a tracking tag and returns immediately with RESP_CODE_SENT. The actual mesh response arrives asynchronously via the radio path and is matched against the stored tag.

The clearPendingReqs() Constraint

The firmware tracks outstanding requests with five per-type flags. Every outbound request handler calls clearPendingReqs() first:

void clearPendingReqs() {
    pending_login = pending_status = pending_telemetry = pending_discovery = pending_req = 0;
}

Called by: CMD_SEND_LOGIN, CMD_SEND_ANON_REQ, CMD_REQ_STATUS, CMD_REQ_TELEMETRY, CMD_SEND_PATH_DISCOVERY_REQ, CMD_SEND_BINARY_REQ.

Consequence: Only one pending mesh request of any type can exist. Sending a login wipes a pending status. Sending a status wipes a pending telemetry. The response to the wiped request is silently dropped when it arrives — the firmware has no tag to match it against.

Companion Protocol Contract

The companion protocol documentation (companion_protocol.md) explicitly states:

Command-Response Matching:

Send one command at a time

Wait for a response before sending another command

Use a timeout (typically 5 seconds)

Match response to command by type

TCP/Serial Interface

The TCP/Serial server accepts one client at a time (new connections forcefully disconnect the existing client). It uses a 4-frame send queue — if the queue is full, frames are silently dropped. This is shared between command responses and asynchronous push notifications from the radio side.

2. Repeater Status Timeouts

PR 1: feature/fix-repeater-status

Symptom

After an HA core restart, managed repeaters fail to return status data:

Error requesting status from repeater ca.cv.main-st: None

The None is a timeout — no STATUS_RESPONSE event was received. Restarting HA core a second time resolves the issue.

Root Cause: Three Contributing Factors

Factor 1: No error checking on send.

The coordinator called send_binary_req() and discarded the return value:

# Before (broken):
await self.api.mesh_core.commands.send_binary_req(contact, BinaryReqType.STATUS)
# Return value discarded — no check for ERROR (no path, write failure, etc.)

result = await self.api.mesh_core.wait_for_event(
    EventType.STATUS_RESPONSE,
    attribute_filters={"pubkey_prefix": pubkey_prefix},
)
# Waits 5 seconds for a response that will never arrive if send failed

After an HA restart, the firmware may not have re-established mesh routes. If send_binary_req returns ERROR due to no valid path, the coordinator doesn't know and waits for a response that will never come.

Factor 2: Fixed 5-second timeout instead of firmware-suggested timeout.

The SDK calculates suggested_timeout based on path length to the destination. For multi-hop repeaters, 5 seconds is insufficient. The firmware provides the correct timeout in the RESP_CODE_SENT payload, but the coordinator ignored it.

Factor 3: Imprecise response matching.

The coordinator filtered wait_for_event by pubkey_prefix instead of the precise expected_ack tag returned in RESP_CODE_SENT. This is less precise and could theoretically cross-match with stale responses.

Fix

Replace the manual send_binary_req + wait_for_event pattern with the SDK's purpose-built sync methods:

# After (fixed):
result = await self.api.mesh_core.commands.req_status_sync(contact)

req_status_sync() handles error checking, firmware-calculated timeout, and precise tag-based matching internally.

Additional fixes in PR 1:

send_login response check: The coordinator checked the return value for LOGIN_SUCCESS, which never matched (the return is MSG_SENT — the actual login result arrives asynchronously). Fixed to check for MSG_SENT, then wait_for_event(EventType.LOGIN_SUCCESS).
Log level: Lowered "Clearing message queue..." from INFO to DEBUG (logged every 5 seconds with no actionable content).
Deprecated API: Replaced logger.warn() with logger.warning() (4 occurrences).
Unused import: Removed BinaryReqType (no longer needed with sync methods).

Why It Works on Second Restart

During the first boot's failed attempts, send_binary_req still transmits packets over the mesh, triggering route/path updates. By the second boot, paths are fresh. The firmware's mesh state was "warmed up" by the first boot.

3. Excessive get_msg() Polling

PR 2: feature/conditional-msg-poll

Symptom

The HA log shows "Clearing message queue..." every 5 seconds continuously. The coordinator calls get_msg() on every update cycle — 17,280 times per day — regardless of whether messages are waiting.

Root Cause

The integration has two message-fetching paths running simultaneously:

Polling path: Every 5-second coordinator update cycle calls get_msg() in a while loop until NO_MORE_MSGS.
Event-driven path: Subscribes to MESSAGES_WAITING firmware push events and calls async_flush_messages() to drain the queue immediately.

The event-driven path is correct and sufficient. The polling path is redundant — almost every call returns NO_MORE_MSGS immediately, but still occupies the serial/TCP link for the round-trip.

Impact

Link contention: Each unnecessary get_msg() competes with repeater status, telemetry, battery, and contact sync commands. On a busy mesh with multiple repeaters, this increases the probability of clearPendingReqs() collisions.
Log noise: INFO-level logging every 5 seconds buries meaningful entries.
Lock contention: The _message_lock is held during polling. If a real MESSAGES_WAITING event fires during this window, message delivery is delayed.

Fix

Replace the unconditional 5-second get_msg() poll with a conditional safety-net:

Normal operation: Rely entirely on the event-driven MESSAGES_WAITING → async_flush_messages() path.
Safety net: Only poll get_msg() if no message activity has occurred in the last 60 seconds (MSG_SAFETY_NET_INTERVAL). This catches any MESSAGES_WAITING events that were missed.
One-time drain: On the first coordinator cycle after startup, drain any messages queued on the device while the integration was disconnected.

This eliminates ~17,280 unnecessary commands per day on a quiet mesh.

4. No Command Serialization

PR 3: feature/command-serialization (depends on PRs 1 and 2)

Symptom

Sent messages are silently dropped. Repeater status requests time out intermittently even after the PR 1 fix. Any two commands sent concurrently risk clearPendingReqs() collision.

Root Cause

The integration has 45 command call sites across 7 files. Prior to this fix, only 2 were protected by a lock (_message_lock on get_msg() calls). The remaining 43 could execute concurrently on the shared TCP/Serial connection.

The MeshCoreAPI._device_lock was defined but never acquired anywhere in the codebase.

Concurrent command sources during normal operation:

Main coordinator update loop (every 30s after PR 3): get_bat(), ensure_contacts(), get_self_telemetry()
Repeater background tasks (spawned via asyncio.create_task): send_login(), req_status_sync(), fetch_all_neighbours()
Telemetry background tasks: req_telemetry_sync()
Service handlers (user/automation): send_msg(), send_chan_msg()
WebSocket handlers (frontend panel): set_device_config(), execute_remote_command(), add_contact(), etc.
Event-driven message fetch: async_flush_messages() → get_msg()

Any two of these running simultaneously risks a clearPendingReqs() collision at the firmware level, causing one response to be silently dropped.

Concrete Collision Examples

Sent message dropped: User sends a message via automation (send_msg()) while a repeater status request is in-flight. Both commands reach the firmware in consecutive loop iterations. The status request's clearPendingReqs() already stored its tag. The send_msg() doesn't use clearPendingReqs() (it's a text message, not a binary request), so no collision there. But if the timing is reversed — status request sent after message — the status request's clearPendingReqs() is harmless (text messages don't use pending flags). The real danger is two binary requests: a status request and a telemetry request, or two repeater logins, where the second wipes the first's tag.

Dual login race: User clicks "Execute Remote Command" on a repeater in the panel while the coordinator's failure recovery triggers send_login() to the same repeater. Two login packets sent — second clearPendingReqs() wipes the first's pending_login tag. One login response is dropped.

Contact sync race: User adds a contact from the panel while ensure_contacts() runs in the coordinator's update cycle. The contact list can end up inconsistent.

Fix

Activate _device_lock via two wrappers in meshcore_api.py:

async def execute(self, coro):
    """Execute a single device command under the device lock."""
    gen_before = self._connection_gen
    async with self._device_lock:
        if self._connection_gen != gen_before:
            raise ConnectionError("Connection changed while waiting for device lock")
        return await coro

@asynccontextmanager
async def command_session(self):
    """Hold the device lock across multiple sequential commands."""
    gen_before = self._connection_gen
    async with self._device_lock:
        if self._connection_gen != gen_before:
            raise ConnectionError("Connection changed while waiting for device lock")
        yield

execute(coro) — for single commands (e.g., await api.execute(api.mesh_core.commands.get_bat()))
command_session() — for atomic multi-step sequences (e.g., login + send_cmd, multi-setting config changes)
_connection_gen — incremented on every connect/reconnect. If the connection changed while a command was waiting for the lock, the command fails immediately rather than executing on a potentially stale or new connection.

Additional changes in PR 3:

Removed _message_lock: Subsumed by _device_lock — both get_msg() paths now use the device lock directly.
Raised DEFAULT_UPDATE_TICK from 5s to 30s: Battery and contact sync don't need 5-second resolution, and serialized commands take longer to execute sequentially.
Independent background loops: Repeater and telemetry scheduling moved from the main coordinator cycle into independent background loops with 10-second check intervals. This prevents a slow repeater from delaying the main update.

5. How the Three Fixes Work Together

The three PRs form a layered defense:

PR 1 (correct timeouts) ensures that when a command does reach the firmware, the integration waits long enough and checks for errors properly. Without this, even perfectly serialized commands would time out on multi-hop repeaters.

PR 2 (remove polling) eliminates ~17,280 unnecessary commands per day that competed for link time and increased collision probability. It also reduces lock contention under PR 3's serialization.

PR 3 (command serialization) enforces the firmware's one-request-at-a-time constraint at the integration level. It prevents clearPendingReqs() collisions entirely by ensuring only one command is in-flight at any moment.

Without PR 1, serialization alone doesn't fix the timeout bug — the integration still wouldn't wait long enough for multi-hop responses. Without PR 2, serialization works but the unnecessary get_msg() polling adds 17,280 lock acquisitions per day. Without PR 3, the other two fixes reduce the probability of collisions but don't prevent them.

6. Event Dispatch Is Not Affected

The SDK's reader runs as an independent asyncio task that continuously reads from the TCP/Serial transport and dispatches events via the EventDispatcher. It has no awareness of any lock. A command lock on the integration side does not block, delay, or interfere with event dispatch.

When a PUSH_CODE_MSG_WAITING notification arrives from the firmware, the SDK reader dispatches it immediately. The integration's event handler schedules async_flush_messages(), which acquires the device lock before calling get_msg(). If the lock is held by another command, the message fetch waits — but the messages are safe in the firmware's 16-slot offline_queue[] and are fetched as soon as the lock is released.

Other push events (NEW_CONTACT, PATH_UPDATED, RX_LOG_DATA, ACK, LOGIN_SUCCESS/FAIL, etc.) are handled by their respective listeners without needing the lock, since they don't send commands to the device.

Worst case: during a long operation like fetch_all_neighbours (which may take 20-30 seconds for a repeater with many neighbors), incoming chat message delivery is delayed until the operation completes. Messages are not lost.

Appendix: Firmware Pending Request Flags

Flag	Set By	Matched In	Cleared By
`pending_login`	`CMD_SEND_LOGIN`	`onContactResponse()` → `LOGIN_SUCCESS/FAIL`	`clearPendingReqs()`
`pending_status`	`CMD_REQ_STATUS`	`onContactResponse()` → `STATUS_RESPONSE`	`clearPendingReqs()`
`pending_telemetry`	`CMD_REQ_TELEMETRY`	`onContactResponse()` → `TELEMETRY_RESPONSE`	`clearPendingReqs()`
`pending_discovery`	`CMD_SEND_PATH_DISCOVERY_REQ`	`onContactResponse()` → `PATH_DISCOVERY_RESPONSE`	`clearPendingReqs()`
`pending_req`	`CMD_SEND_BINARY_REQ`	`onContactResponse()` → `BINARY_RESPONSE`	`clearPendingReqs()`

clearPendingReqs() is called at the start of every handler listed above. It zeroes all five flags. Sending any request of any type cancels tracking for all previously pending requests of all types.

mwolter805/Forensics - Companion Device Communication Reliability.md

Select an option

No results found

Select an option

No results found

Forensics Report — Companion Device Communication Reliability

Executive Summary

Table of Contents

1. Firmware Architecture

Single-Threaded Main Loop

The clearPendingReqs() Constraint

Companion Protocol Contract

TCP/Serial Interface

2. Repeater Status Timeouts

Symptom

Root Cause: Three Contributing Factors

Fix

Why It Works on Second Restart

3. Excessive get_msg() Polling

Symptom

Root Cause

Impact

Fix

4. No Command Serialization

Symptom

Root Cause

Concrete Collision Examples

Fix

5. How the Three Fixes Work Together

6. Event Dispatch Is Not Affected

Appendix: Firmware Pending Request Flags