Here's the exact pattern I'd implement.
1. Storage: where and how
File: /home/clawd/clawd/data/last_update_id (plain text, just the integer)
Atomic write via rename:
const OFFSET_FILE = '/home/clawd/clawd/data/last_update_id';
const OFFSET_TMP = OFFSET_FILE + '.tmp';
function persistOffset(updateId) {
fs.writeFileSync(OFFSET_TMP, String(updateId));
fs.renameSync(OFFSET_TMP, OFFSET_FILE); // atomic on ext4
}
function loadOffset() {
try {
return parseInt(fs.readFileSync(OFFSET_FILE, 'utf8').trim(), 10);
} catch {
return 0; // first run
}
}No fsync needed. rename() is atomic on ext4 — the file is either the old value or the new value, never corrupted. Worst case on power loss: we lose the last write, which means one re-delivery on restart. That's fine.
2. When to advance: after each update, not after the batch
Current code:
for (const update of result.result || []) {
lastUpdateId = update.update_id;
processUpdate(update);
}New code:
for (const update of result.result || []) {
processUpdate(update);
lastUpdateId = update.update_id;
persistOffset(lastUpdateId);
}Process first, persist second. If we crash mid-batch after processing update N but before persisting:
- On restart,
loadOffset()returns N-1 getUpdates({ offset: N })re-fetches update N + rest of batch- Update N gets reprocessed (one duplicate)
- Everything after N is processed normally
This gives us at-least-once with a worst-case window of one update per crash.
3. Startup: replace offset: -1 with disk read
function main() {
const persisted = loadOffset();
if (persisted > 0) {
lastUpdateId = persisted;
log(`Resumed from persisted offset: ${persisted}`);
} else {
// First run ever: skip history, start from now
const init = telegramApi('getUpdates', { offset: -1, limit: 1 });
if (init?.ok && init.result.length > 0) {
lastUpdateId = init.result[0].update_id;
persistOffset(lastUpdateId);
}
log(`First run, starting from offset: ${lastUpdateId}`);
}
pollLoop();
}The offset: -1 path only runs on first-ever startup (no file exists). Every subsequent restart resumes exactly where we left off.
4. Dedupe: update_id as the key, not message_id
Current dedupe uses message.message_id — but that's per-chat, not globally unique. update_id is globally unique and monotonic. Switch to:
const processedUpdates = new Set();
function processUpdate(update) {
if (processedUpdates.has(update.update_id)) return;
processedUpdates.add(update.update_id);
// Evict old entries — only keep IDs > lastUpdateId - 200
if (processedUpdates.size > 500) {
const cutoff = lastUpdateId - 200;
for (const id of processedUpdates) {
if (id < cutoff) processedUpdates.delete(id);
}
}
// ... rest of processing
}Why this works: after restart, we might re-fetch a few updates (the crash window). The in-memory Set catches them within the session. Between sessions, the persisted offset handles it — we never go back further than persistedOffset + 1. The Set only needs to cover the one-update gap from persist-after-process.
We don't need to persist the Set itself. The window is so small (1 update) that even if the Set doesn't catch it, the duplicate is a redundant /run trigger or a duplicate clawdbot forward — annoying, not catastrophic.
5. Multiple instance protection
Current systemd config already helps:
KillMode=control-group(confirmed) — kills entire process tree before restartRestartSec=2— 2s gap between stop and start
But there's a race: if systemd sends SIGTERM and the process doesn't exit within TimeoutStopSec (default 90s), systemd sends SIGKILL. During that 90s, the old process could still be polling.
Hardening:
# Add to [Service] section:
TimeoutStopSec=5
PIDFile=/run/reply-handler.pidPlus in the code:
const LOCK_FILE = '/run/reply-handler.pid';
function acquireLock() {
try {
const existing = fs.readFileSync(LOCK_FILE, 'utf8').trim();
// Check if PID is still alive
try { process.kill(parseInt(existing), 0); } catch {
// Process dead, stale lock — take over
fs.writeFileSync(LOCK_FILE, String(process.pid));
return true;
}
log(`FATAL: Another instance running (PID ${existing})`);
process.exit(1);
} catch {
// No lock file — first instance
fs.writeFileSync(LOCK_FILE, String(process.pid));
return true;
}
}
process.on('SIGTERM', () => {
try { fs.unlinkSync(LOCK_FILE); } catch {}
process.exit(0);
});This is belt-and-suspenders. The real protection is KillMode=control-group + TimeoutStopSec=5. The lockfile catches the edge case where someone manually runs node reply_handler.js while systemd is also running it.
Summary of guarantees:
| Property | How |
|---|---|
| At-least-once delivery | Persist offset AFTER processing; crash = re-fetch |
| No lost updates on restart | Load offset from disk, not offset: -1 |
| Duplicate protection | update_id Set in-memory + idempotent offset advance |
| Atomic persistence | write-tmp + rename (ext4 atomic) |
| Single-instance | PID lockfile + KillMode=control-group + TimeoutStopSec=5 |
| First-run bootstrap | Falls back to offset: -1 only when no persisted file |
The entire change is ~40 lines of code. Want me to spawn a /run to implement it?