Date: 2026-04-01 Reporter: Matt Wolter Status: Root causes identified and fixed across three coordinated PRs Applies to: meshcore-ha Home Assistant integration communicating with MeshCore companion firmware over TCP
The MeshCore-HA integration communicates with a companion radio device over a single TCP/Serial connection using a binary frame protocol. Three distinct but interrelated issues cause unreliable command execution — most visibly, managed repeater status timeouts after HA restarts and silently dropped sent messages. All three issues stem from a mismatch between the firmware's single-request architecture and the integration's concurrent command patterns.
The core constraint: The companion firmware tracks outstanding mesh requests with five global flags (pending_login, pending_status, pending_telemetry, pending_discovery, pending_req). Every outbound request handler calls clearPendingReqs() before storing its own tag — wiping all five flags. This means the firmware can only track one outstanding mesh request at a time. If any second request is sent before the first response arrives from the mesh, the first request's tracking is erased and its response is silently dropped.
Three PRs address the three layers of this problem:
| PR | Branch | What It Fixes |
|---|---|---|
| PR 1 | feature/fix-repeater-status |
Incorrect timeout handling and missing error checks on repeater/telemetry requests |
| PR 2 | feature/conditional-msg-poll |
Unnecessary get_msg() polling every 5 seconds (~17,280 commands/day) that wastes link time and increases collision probability |
| PR 3 (Future) | feature/command-serialization |
No mutual exclusion on the shared connection — many command call sites have no lock protection |
PRs 1 and 2 are independent of each other and can be reviewed/merged in any order. PR 3 depends on both (it builds on the SDK sync methods from PR 1 and the reduced polling from PR 2) and will be submitted after PRs 1 and 2 are merged.
- Firmware Architecture — Why This Matters
- Issue 1: Repeater Status Timeouts
- Issue 2: Excessive get_msg() Polling
- Issue 3: No Command Serialization
- How the Three Fixes Work Together
- Event Dispatch Is Not Affected
- Appendix: Firmware Pending Request Flags
Source: MeshCore/examples/companion_radio/MyMesh.cpp, MeshCore/src/Dispatcher.cpp, MeshCore/docs/companion_protocol.md
The companion firmware runs a single-threaded main loop with no preemption:
MyMesh::loop() {
1. BaseChatMesh::loop() // Radio: receive/transmit mesh packets
2. checkSerialInterface() // Serial/TCP: read one command frame, dispatch it
3. Lazy contact writes // Save contacts to flash if dirty
4. UI connection status // Update display
}
Key behaviors:
- One command per iteration.
checkSerialInterface()reads at most one complete frame per call. Two commands sent back-to-back are processed in consecutive loop iterations, not simultaneously. - Radio before serial. Incoming mesh packets (including those that trigger
PUSH_CODE_MSG_WAITING) are processed before serial commands in the same iteration. - Non-blocking binary requests. When the firmware sends a mesh request (status, telemetry, login), it stores a tracking tag and returns immediately with
RESP_CODE_SENT. The actual mesh response arrives asynchronously via the radio path and is matched against the stored tag.
The firmware tracks outstanding requests with five per-type flags. Every outbound request handler calls clearPendingReqs() first:
void clearPendingReqs() {
pending_login = pending_status = pending_telemetry = pending_discovery = pending_req = 0;
}Called by: CMD_SEND_LOGIN, CMD_SEND_ANON_REQ, CMD_REQ_STATUS, CMD_REQ_TELEMETRY, CMD_SEND_PATH_DISCOVERY_REQ, CMD_SEND_BINARY_REQ.
Consequence: Only one pending mesh request of any type can exist. Sending a login wipes a pending status. Sending a status wipes a pending telemetry. The response to the wiped request is silently dropped when it arrives — the firmware has no tag to match it against.
The companion protocol documentation (companion_protocol.md) explicitly states:
Command-Response Matching:
- Send one command at a time
- Wait for a response before sending another command
- Use a timeout (typically 5 seconds)
- Match response to command by type
The TCP/Serial server accepts one client at a time (new connections forcefully disconnect the existing client). It uses a 4-frame send queue — if the queue is full, frames are silently dropped. This is shared between command responses and asynchronous push notifications from the radio side.
PR 1: feature/fix-repeater-status
After an HA core restart, managed repeaters fail to return status data:
Error requesting status from repeater ca.cv.main-st: None
The None is a timeout — no STATUS_RESPONSE event was received. Restarting HA core a second time resolves the issue.
Factor 1: No error checking on send.
The coordinator called send_binary_req() and discarded the return value:
# Before (broken):
await self.api.mesh_core.commands.send_binary_req(contact, BinaryReqType.STATUS)
# Return value discarded — no check for ERROR (no path, write failure, etc.)
result = await self.api.mesh_core.wait_for_event(
EventType.STATUS_RESPONSE,
attribute_filters={"pubkey_prefix": pubkey_prefix},
)
# Waits 5 seconds for a response that will never arrive if send failedAfter an HA restart, the firmware may not have re-established mesh routes. If send_binary_req returns ERROR due to no valid path, the coordinator doesn't know and waits for a response that will never come.
Factor 2: Fixed 5-second timeout instead of firmware-suggested timeout.
The SDK calculates suggested_timeout based on path length to the destination. For multi-hop repeaters, 5 seconds is insufficient. The firmware provides the correct timeout in the RESP_CODE_SENT payload, but the coordinator ignored it.
Factor 3: Imprecise response matching.
The coordinator filtered wait_for_event by pubkey_prefix instead of the precise expected_ack tag returned in RESP_CODE_SENT. This is less precise and could theoretically cross-match with stale responses.
Replace the manual send_binary_req + wait_for_event pattern with the SDK's purpose-built sync methods:
# After (fixed):
result = await self.api.mesh_core.commands.req_status_sync(contact)req_status_sync() handles error checking, firmware-calculated timeout, and precise tag-based matching internally.
Additional fixes in PR 1:
send_loginresponse check: The coordinator checked the return value forLOGIN_SUCCESS, which never matched (the return isMSG_SENT— the actual login result arrives asynchronously). Fixed to check forMSG_SENT, thenwait_for_event(EventType.LOGIN_SUCCESS).- Log level: Lowered "Clearing message queue..." from INFO to DEBUG (logged every 5 seconds with no actionable content).
- Deprecated API: Replaced
logger.warn()withlogger.warning()(4 occurrences). - Unused import: Removed
BinaryReqType(no longer needed with sync methods).
During the first boot's failed attempts, send_binary_req still transmits packets over the mesh, triggering route/path updates. By the second boot, paths are fresh. The firmware's mesh state was "warmed up" by the first boot.
PR 2: feature/conditional-msg-poll
The HA log shows "Clearing message queue..." every 5 seconds continuously. The coordinator calls get_msg() on every update cycle — 17,280 times per day — regardless of whether messages are waiting.
The integration has two message-fetching paths running simultaneously:
- Polling path: Every 5-second coordinator update cycle calls
get_msg()in a while loop untilNO_MORE_MSGS. - Event-driven path: Subscribes to
MESSAGES_WAITINGfirmware push events and callsasync_flush_messages()to drain the queue immediately.
The event-driven path is correct and sufficient. The polling path is redundant — almost every call returns NO_MORE_MSGS immediately, but still occupies the serial/TCP link for the round-trip.
- Link contention: Each unnecessary
get_msg()competes with repeater status, telemetry, battery, and contact sync commands. On a busy mesh with multiple repeaters, this increases the probability ofclearPendingReqs()collisions. - Log noise: INFO-level logging every 5 seconds buries meaningful entries.
- Lock contention: The
_message_lockis held during polling. If a realMESSAGES_WAITINGevent fires during this window, message delivery is delayed.
Replace the unconditional 5-second get_msg() poll with a conditional safety-net:
- Normal operation: Rely entirely on the event-driven
MESSAGES_WAITING→async_flush_messages()path. - Safety net: Only poll
get_msg()if no message activity has occurred in the last 60 seconds (MSG_SAFETY_NET_INTERVAL). This catches anyMESSAGES_WAITINGevents that were missed. - One-time drain: On the first coordinator cycle after startup, drain any messages queued on the device while the integration was disconnected.
This eliminates ~17,280 unnecessary commands per day on a quiet mesh.
PR 3: feature/command-serialization (depends on PRs 1 and 2)
Sent messages are silently dropped. Repeater status requests time out intermittently even after the PR 1 fix. Any two commands sent concurrently risk clearPendingReqs() collision.
The integration has 45 command call sites across 7 files. Prior to this fix, only 2 were protected by a lock (_message_lock on get_msg() calls). The remaining 43 could execute concurrently on the shared TCP/Serial connection.
The MeshCoreAPI._device_lock was defined but never acquired anywhere in the codebase.
Concurrent command sources during normal operation:
- Main coordinator update loop (every 30s after PR 3):
get_bat(),ensure_contacts(),get_self_telemetry() - Repeater background tasks (spawned via
asyncio.create_task):send_login(),req_status_sync(),fetch_all_neighbours() - Telemetry background tasks:
req_telemetry_sync() - Service handlers (user/automation):
send_msg(),send_chan_msg() - WebSocket handlers (frontend panel):
set_device_config(),execute_remote_command(),add_contact(), etc. - Event-driven message fetch:
async_flush_messages()→get_msg()
Any two of these running simultaneously risks a clearPendingReqs() collision at the firmware level, causing one response to be silently dropped.
Sent message dropped: User sends a message via automation (send_msg()) while a repeater status request is in-flight. Both commands reach the firmware in consecutive loop iterations. The status request's clearPendingReqs() already stored its tag. The send_msg() doesn't use clearPendingReqs() (it's a text message, not a binary request), so no collision there. But if the timing is reversed — status request sent after message — the status request's clearPendingReqs() is harmless (text messages don't use pending flags). The real danger is two binary requests: a status request and a telemetry request, or two repeater logins, where the second wipes the first's tag.
Dual login race: User clicks "Execute Remote Command" on a repeater in the panel while the coordinator's failure recovery triggers send_login() to the same repeater. Two login packets sent — second clearPendingReqs() wipes the first's pending_login tag. One login response is dropped.
Contact sync race: User adds a contact from the panel while ensure_contacts() runs in the coordinator's update cycle. The contact list can end up inconsistent.
Activate _device_lock via two wrappers in meshcore_api.py:
async def execute(self, coro):
"""Execute a single device command under the device lock."""
gen_before = self._connection_gen
async with self._device_lock:
if self._connection_gen != gen_before:
raise ConnectionError("Connection changed while waiting for device lock")
return await coro
@asynccontextmanager
async def command_session(self):
"""Hold the device lock across multiple sequential commands."""
gen_before = self._connection_gen
async with self._device_lock:
if self._connection_gen != gen_before:
raise ConnectionError("Connection changed while waiting for device lock")
yieldexecute(coro)— for single commands (e.g.,await api.execute(api.mesh_core.commands.get_bat()))command_session()— for atomic multi-step sequences (e.g., login + send_cmd, multi-setting config changes)_connection_gen— incremented on every connect/reconnect. If the connection changed while a command was waiting for the lock, the command fails immediately rather than executing on a potentially stale or new connection.
Additional changes in PR 3:
- Removed
_message_lock: Subsumed by_device_lock— bothget_msg()paths now use the device lock directly. - Raised
DEFAULT_UPDATE_TICKfrom 5s to 30s: Battery and contact sync don't need 5-second resolution, and serialized commands take longer to execute sequentially. - Independent background loops: Repeater and telemetry scheduling moved from the main coordinator cycle into independent background loops with 10-second check intervals. This prevents a slow repeater from delaying the main update.
The three PRs form a layered defense:
PR 1 (correct timeouts) ensures that when a command does reach the firmware, the integration waits long enough and checks for errors properly. Without this, even perfectly serialized commands would time out on multi-hop repeaters.
PR 2 (remove polling) eliminates ~17,280 unnecessary commands per day that competed for link time and increased collision probability. It also reduces lock contention under PR 3's serialization.
PR 3 (command serialization) enforces the firmware's one-request-at-a-time constraint at the integration level. It prevents clearPendingReqs() collisions entirely by ensuring only one command is in-flight at any moment.
Without PR 1, serialization alone doesn't fix the timeout bug — the integration still wouldn't wait long enough for multi-hop responses. Without PR 2, serialization works but the unnecessary get_msg() polling adds 17,280 lock acquisitions per day. Without PR 3, the other two fixes reduce the probability of collisions but don't prevent them.
The SDK's reader runs as an independent asyncio task that continuously reads from the TCP/Serial transport and dispatches events via the EventDispatcher. It has no awareness of any lock. A command lock on the integration side does not block, delay, or interfere with event dispatch.
When a PUSH_CODE_MSG_WAITING notification arrives from the firmware, the SDK reader dispatches it immediately. The integration's event handler schedules async_flush_messages(), which acquires the device lock before calling get_msg(). If the lock is held by another command, the message fetch waits — but the messages are safe in the firmware's 16-slot offline_queue[] and are fetched as soon as the lock is released.
Other push events (NEW_CONTACT, PATH_UPDATED, RX_LOG_DATA, ACK, LOGIN_SUCCESS/FAIL, etc.) are handled by their respective listeners without needing the lock, since they don't send commands to the device.
Worst case: during a long operation like fetch_all_neighbours (which may take 20-30 seconds for a repeater with many neighbors), incoming chat message delivery is delayed until the operation completes. Messages are not lost.
| Flag | Set By | Matched In | Cleared By |
|---|---|---|---|
pending_login |
CMD_SEND_LOGIN |
onContactResponse() → LOGIN_SUCCESS/FAIL |
clearPendingReqs() |
pending_status |
CMD_REQ_STATUS |
onContactResponse() → STATUS_RESPONSE |
clearPendingReqs() |
pending_telemetry |
CMD_REQ_TELEMETRY |
onContactResponse() → TELEMETRY_RESPONSE |
clearPendingReqs() |
pending_discovery |
CMD_SEND_PATH_DISCOVERY_REQ |
onContactResponse() → PATH_DISCOVERY_RESPONSE |
clearPendingReqs() |
pending_req |
CMD_SEND_BINARY_REQ |
onContactResponse() → BINARY_RESPONSE |
clearPendingReqs() |
clearPendingReqs() is called at the start of every handler listed above. It zeroes all five flags. Sending any request of any type cancels tracking for all previously pending requests of all types.