Discussion reply from discuss-1772174258146

Here's the exact pattern I'd implement.

1. Storage: where and how

File: /home/clawd/clawd/data/last_update_id (plain text, just the integer)

Atomic write via rename:

const OFFSET_FILE = '/home/clawd/clawd/data/last_update_id';
const OFFSET_TMP  = OFFSET_FILE + '.tmp';

function persistOffset(updateId) {
  fs.writeFileSync(OFFSET_TMP, String(updateId));
  fs.renameSync(OFFSET_TMP, OFFSET_FILE); // atomic on ext4
}

function loadOffset() {
  try {
    return parseInt(fs.readFileSync(OFFSET_FILE, 'utf8').trim(), 10);
  } catch {
    return 0; // first run
  }
}

No fsync needed. rename() is atomic on ext4 — the file is either the old value or the new value, never corrupted. Worst case on power loss: we lose the last write, which means one re-delivery on restart. That's fine.

2. When to advance: after each update, not after the batch

Current code:

for (const update of result.result || []) {
  lastUpdateId = update.update_id;
  processUpdate(update);
}

New code:

for (const update of result.result || []) {
  processUpdate(update);
  lastUpdateId = update.update_id;
  persistOffset(lastUpdateId);
}

Process first, persist second. If we crash mid-batch after processing update N but before persisting:

On restart, loadOffset() returns N-1
getUpdates({ offset: N }) re-fetches update N + rest of batch
Update N gets reprocessed (one duplicate)
Everything after N is processed normally

This gives us at-least-once with a worst-case window of one update per crash.

3. Startup: replace offset: -1 with disk read

function main() {
  const persisted = loadOffset();
  if (persisted > 0) {
    lastUpdateId = persisted;
    log(`Resumed from persisted offset: ${persisted}`);
  } else {
    // First run ever: skip history, start from now
    const init = telegramApi('getUpdates', { offset: -1, limit: 1 });
    if (init?.ok && init.result.length > 0) {
      lastUpdateId = init.result[0].update_id;
      persistOffset(lastUpdateId);
    }
    log(`First run, starting from offset: ${lastUpdateId}`);
  }
  pollLoop();
}

The offset: -1 path only runs on first-ever startup (no file exists). Every subsequent restart resumes exactly where we left off.

4. Dedupe: update_id as the key, not message_id

Current dedupe uses message.message_id — but that's per-chat, not globally unique. update_id is globally unique and monotonic. Switch to:

const processedUpdates = new Set();

function processUpdate(update) {
  if (processedUpdates.has(update.update_id)) return;
  processedUpdates.add(update.update_id);

  // Evict old entries — only keep IDs > lastUpdateId - 200
  if (processedUpdates.size > 500) {
    const cutoff = lastUpdateId - 200;
    for (const id of processedUpdates) {
      if (id < cutoff) processedUpdates.delete(id);
    }
  }
  // ... rest of processing
}

Why this works: after restart, we might re-fetch a few updates (the crash window). The in-memory Set catches them within the session. Between sessions, the persisted offset handles it — we never go back further than persistedOffset + 1. The Set only needs to cover the one-update gap from persist-after-process.

We don't need to persist the Set itself. The window is so small (1 update) that even if the Set doesn't catch it, the duplicate is a redundant /run trigger or a duplicate clawdbot forward — annoying, not catastrophic.

5. Multiple instance protection

Current systemd config already helps:

KillMode=control-group (confirmed) — kills entire process tree before restart
RestartSec=2 — 2s gap between stop and start

But there's a race: if systemd sends SIGTERM and the process doesn't exit within TimeoutStopSec (default 90s), systemd sends SIGKILL. During that 90s, the old process could still be polling.

Hardening:

# Add to [Service] section:
TimeoutStopSec=5
PIDFile=/run/reply-handler.pid

Plus in the code:

const LOCK_FILE = '/run/reply-handler.pid';

function acquireLock() {
  try {
    const existing = fs.readFileSync(LOCK_FILE, 'utf8').trim();
    // Check if PID is still alive
    try { process.kill(parseInt(existing), 0); } catch {
      // Process dead, stale lock — take over
      fs.writeFileSync(LOCK_FILE, String(process.pid));
      return true;
    }
    log(`FATAL: Another instance running (PID ${existing})`);
    process.exit(1);
  } catch {
    // No lock file — first instance
    fs.writeFileSync(LOCK_FILE, String(process.pid));
    return true;
  }
}

process.on('SIGTERM', () => {
  try { fs.unlinkSync(LOCK_FILE); } catch {}
  process.exit(0);
});

This is belt-and-suspenders. The real protection is KillMode=control-group + TimeoutStopSec=5. The lockfile catches the edge case where someone manually runs node reply_handler.js while systemd is also running it.

Summary of guarantees:

Property	How
At-least-once delivery	Persist offset AFTER processing; crash = re-fetch
No lost updates on restart	Load offset from disk, not `offset: -1`
Duplicate protection	`update_id` Set in-memory + idempotent offset advance
Atomic persistence	write-tmp + rename (ext4 atomic)
Single-instance	PID lockfile + `KillMode=control-group` + `TimeoutStopSec=5`
First-run bootstrap	Falls back to `offset: -1` only when no persisted file

The entire change is ~40 lines of code. Want me to spawn a /run to implement it?

swapp1990/discuss-1772174258146-turn-3.md

Select an option

No results found

Select an option

No results found