Tech Spec: Reliable Notification System for 1M+ Users

1. Overview

Design a scalable notification platform that sends push notifications, SMS, and email to more than 1M users.

The system must:

Support push, SMS, and email.
Handle transactional and marketing notifications.
Avoid duplicate sends.
Avoid missed sends.
Retry transient failures.
Fail over between providers.
Gracefully degrade when providers are slow or unavailable.
Provide a full audit trail for support, compliance, and debugging.
Scale horizontally across channels, priorities, and workers.

Important caveat: no system can guarantee true end-to-end exactly-once delivery once third-party providers are involved. A provider can timeout after accepting a message, or deliver a message while our system never receives the response. The correct target is effectively-once sending from our system, using idempotency keys, durable state, provider references, retries, reconciliation, and duplicate suppression.

2. High-Level Architecture

The system should be event-driven and queue-first.

Application Services
    |
    | 1. Notification request / domain event
    v
Notification API
    |
    | 2. Validate request, resolve template, store durable record
    v
PostgreSQL
    |
    | 3. Transactional outbox event
    v
Outbox Dispatcher
    |
    | 4. Publish jobs
    v
Queue System
    |
    | 5. Channel workers
    v
Provider Router
    |
    | 6. Send through selected provider
    v
Push / SMS / Email Providers
    |
    | 7. Delivery webhooks
    v
Webhook Processor
    |
    | 8. Status updates, audit logs, metrics
    v
PostgreSQL + Observability Stack

Example event:

{
  "event": "withdrawal_successful",
  "user_id": "user_123",
  "channels": ["push", "email", "sms"],
  "template": "withdrawal_success",
  "idempotency_key": "withdrawal_wd_789_success_notification",
  "metadata": {
    "amount": 50000,
    "currency": "NGN"
  }
}

3. Design Goals

Functional Requirements

Send notifications through push, SMS, and email.
Support single-user notifications.
Support batch and campaign notifications.
Support scheduled notifications.
Support templates and localization.
Support user preferences and opt-outs.
Support priority-based delivery.
Track every send attempt.
Track provider delivery events.
Provide admin visibility into notification status.

Non-Functional Requirements

Reliable for critical notifications such as OTPs, transaction alerts, and security alerts.
Horizontally scalable for 1M+ users.
Fault tolerant when workers, queues, databases, or providers fail.
Idempotent at API, queue, worker, and webhook layers.
Observable through metrics, logs, traces, and dashboards.
Cost-aware, especially for SMS.
Secure with PII protection and strict access control.

4. Core Principle

The API should never send notifications directly.

The API should only:

Authenticate and authorize the caller.
Validate the request.
Resolve the recipient and notification policy.
Create a durable notification record.
Create an outbox event in the same database transaction.
Return a response.

Actual delivery should happen asynchronously in background workers.

This prevents request timeouts, supports retries, allows backpressure, and gives the system a durable source of truth.

5. Main Components

5.1 Notification API

The Notification API accepts requests from internal services such as wallet, auth, orders, billing, and marketing.

Example endpoint:

POST /notifications/send

Example request:

{
  "user_id": "user_123",
  "type": "transaction_alert",
  "channels": ["push", "sms", "email"],
  "template": "transaction_alert_v1",
  "idempotency_key": "txn_txn_456_alert",
  "priority": "HIGH",
  "metadata": {
    "amount": 12000,
    "currency": "NGN",
    "merchant": "Example Store"
  }
}

The idempotency_key is required for transactional notifications. It should be generated by the producing service from the business event ID.

Good examples:

withdrawal_wd_123_success_notification
otp_login_user_123_202605231020
invoice_inv_456_due_reminder_1

Bad examples:

random UUID generated on every retry
current timestamp only
user ID only

If the same request is submitted twice with the same idempotency_key, the API returns the existing notification record instead of creating a new one.

5.2 PostgreSQL

PostgreSQL is the source of truth.

The queue is not the source of truth. Redis is not the source of truth. Provider dashboards are not the source of truth.

Core tables:

notifications
notification_recipients
notification_attempts
notification_templates
notification_preferences
outbox_events
provider_delivery_events
dead_letter_notifications
provider_accounts

6. Data Model

6.1 `notifications`

Stores the logical notification request.

CREATE TABLE notifications (
    id UUID PRIMARY KEY,
    user_id UUID NOT NULL,
    type TEXT NOT NULL,
    template_key TEXT NOT NULL,
    idempotency_key TEXT NOT NULL,
    priority TEXT NOT NULL,
    status TEXT NOT NULL,
    scheduled_at TIMESTAMPTZ,
    metadata JSONB NOT NULL DEFAULT '{}',
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE UNIQUE INDEX idx_notifications_idempotency
ON notifications (idempotency_key);

CREATE INDEX idx_notifications_status_scheduled
ON notifications (status, scheduled_at);

Example statuses:

PENDING
QUEUED
PROCESSING
SENT
PARTIALLY_SENT
FAILED
CANCELLED

6.2 `notification_attempts`

Stores each channel-level send attempt.

For a notification sent through push, SMS, and email, there can be one or more attempt records per channel.

CREATE TABLE notification_attempts (
    id UUID PRIMARY KEY,
    notification_id UUID NOT NULL REFERENCES notifications(id),
    channel TEXT NOT NULL,
    provider TEXT,
    provider_message_id TEXT,
    provider_idempotency_key TEXT NOT NULL,
    status TEXT NOT NULL,
    attempt_number INT NOT NULL DEFAULT 0,
    next_attempt_at TIMESTAMPTZ,
    locked_by TEXT,
    locked_until TIMESTAMPTZ,
    last_error_code TEXT,
    last_error_message TEXT,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_attempts_status_channel
ON notification_attempts (status, channel, next_attempt_at);

CREATE UNIQUE INDEX idx_attempts_provider_idempotency
ON notification_attempts (provider_idempotency_key);

Attempt statuses:

PENDING
PROCESSING
SENT
DELIVERED
FAILED
RETRYING
SUPPRESSED
DEAD_LETTERED

SENT means the provider accepted the message. DELIVERED means the provider later confirmed delivery through a webhook or delivery receipt.

6.3 `outbox_events`

Stores events that must be published to the queue.

CREATE TABLE outbox_events (
    id UUID PRIMARY KEY,
    aggregate_type TEXT NOT NULL,
    aggregate_id UUID NOT NULL,
    event_type TEXT NOT NULL,
    payload JSONB NOT NULL,
    status TEXT NOT NULL,
    attempts INT NOT NULL DEFAULT 0,
    next_attempt_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    dispatched_at TIMESTAMPTZ
);

CREATE INDEX idx_outbox_pending
ON outbox_events (status, next_attempt_at, created_at);

Outbox statuses:

PENDING
DISPATCHED
FAILED

6.4 `provider_delivery_events`

Stores webhook events from providers.

CREATE TABLE provider_delivery_events (
    id UUID PRIMARY KEY,
    provider TEXT NOT NULL,
    provider_event_id TEXT NOT NULL,
    provider_message_id TEXT,
    notification_attempt_id UUID REFERENCES notification_attempts(id),
    event_type TEXT NOT NULL,
    payload JSONB NOT NULL,
    received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    processed_at TIMESTAMPTZ
);

CREATE UNIQUE INDEX idx_provider_events_unique
ON provider_delivery_events (provider, provider_event_id);

This makes webhook processing idempotent.

7. Idempotency Strategy

Duplicate prevention should happen at multiple layers.

API-Level Idempotency

Every notification request includes an idempotency_key.

CREATE UNIQUE INDEX idx_notifications_idempotency
ON notifications (idempotency_key);

If the same request is retried, the API returns the existing record.

Channel-Level Idempotency

Each channel attempt has a stable provider_idempotency_key.

Example:

notification:{notification_id}:channel:sms:purpose:primary

If a worker crashes and retries, it reuses the same reference instead of generating a new one.

Provider-Level Idempotency

When supported, pass the provider idempotency key or client reference to the provider.

Examples:

SendGrid custom_args.notification_id
SES message tags
Twilio statusCallback reference
FCM collapse_key or custom data reference

Not all providers offer true idempotency, but most allow custom metadata that helps with reconciliation.

Webhook Idempotency

Webhook processing must deduplicate provider event IDs.

CREATE UNIQUE INDEX idx_provider_events_unique
ON provider_delivery_events (provider, provider_event_id);

If the provider sends the same webhook multiple times, only the first one changes state.

8. Queue Architecture

For a Django/Python implementation, a practical stack is:

Django
PostgreSQL
Redis
Celery
Celery Beat

For larger scale or stricter durability, use:

Kafka
RabbitMQ
Amazon SQS
Google Pub/Sub

Recommended queues:

notifications.critical
notifications.high
notifications.normal
notifications.low
notifications.push
notifications.sms
notifications.email
notifications.retry
notifications.dead_letter

Priority examples:

CRITICAL: OTP, password reset, fraud alert
HIGH: transaction alerts, login alerts
NORMAL: account updates, reminders
LOW: marketing campaigns, newsletters

Critical queues should have more workers, stricter latency alerts, and separate provider rate limits.

9. Transactional Outbox

The transactional outbox prevents this failure:

Database insert succeeds, but queue publish fails.

Notification creation should happen like this:

BEGIN;

INSERT INTO notifications (...);
INSERT INTO notification_attempts (...);
INSERT INTO outbox_events (...);

COMMIT;

An outbox dispatcher later reads pending outbox events and publishes jobs to the queue.

After a successful publish:

outbox_events.status = DISPATCHED
outbox_events.dispatched_at = now()

If the dispatcher crashes, another dispatcher can pick up the same pending outbox event.

The queued job should contain only identifiers, not the full send payload:

{
  "notification_id": "9df1a5b8-4d8d-4a99-8b5e-7ef11c913ef8",
  "attempt_id": "441ebf44-9176-4f3f-a0f7-fd4da8c09e84"
}

Workers reload the latest state from PostgreSQL before sending.

10. Worker Flow

A worker processes one attempt at a time.

1. Receive job from queue.
2. Load notification and attempt from PostgreSQL.
3. If attempt is already SENT or DELIVERED, exit successfully.
4. Check notification status.
5. Check user preferences and legal opt-outs.
6. Check rate limits.
7. Acquire a processing lease.
8. Render template.
9. Select provider.
10. Send notification.
11. Store provider response.
12. Mark attempt SENT or schedule retry.
13. Update parent notification status.
14. Emit metrics and audit logs.

The worker must be safe to run more than once for the same job.

11. Locking and Leases

Use database leases as the primary correctness mechanism.

Example:

UPDATE notification_attempts
SET
    status = 'PROCESSING',
    locked_by = :worker_id,
    locked_until = now() + interval '5 minutes',
    updated_at = now()
WHERE id = :attempt_id
  AND status IN ('PENDING', 'RETRYING')
  AND (locked_until IS NULL OR locked_until < now())
RETURNING *;

If no row is returned, another worker owns the attempt or the attempt is no longer sendable.

Redis locks may also be used as a fast concurrency guard:

lock_id = str(uuid.uuid4())
lock_key = f"notification:attempt:{attempt_id}"

acquired = redis.set(lock_key, lock_id, nx=True, ex=300)

if not acquired:
    return

Release only if the same worker still owns the lock:

if redis.get(lock_key) == lock_id:
    redis.delete(lock_key)

However, Redis locks should not be the only correctness mechanism. PostgreSQL state and idempotency keys should remain authoritative.

12. Provider Abstraction

The system should not be tightly coupled to one provider.

Use provider adapters behind a common interface:

class NotificationProvider:
    def send(self, message: ProviderMessage) -> ProviderResponse:
        raise NotImplementedError

Example providers:

Email: Amazon SES, SendGrid, Mailgun, ZeptoMail
SMS: Termii, Twilio, Africa's Talking
Push: Firebase Cloud Messaging, APNs

Provider response:

class ProviderResponse:
    provider: str
    provider_message_id: str | None
    accepted: bool
    retryable: bool
    error_code: str | None
    error_message: str | None

Each adapter should normalize provider-specific responses into a common internal result.

13. Provider Routing

Provider routing should consider:

Channel.
Country.
Cost.
Provider health.
Provider rate limits.
Message type.
Historical delivery performance.
User segment or tenant.

Example routing:

SMS Nigeria primary: Termii
SMS Nigeria backup: Twilio
SMS Ghana primary: Africa's Talking
Email transactional primary: SES
Email transactional backup: SendGrid
Push Android: FCM
Push iOS: APNs or FCM-to-APNs

Routing policy can be stored in configuration:

sms:
  NG:
    primary: termii
    fallback: twilio
  GH:
    primary: africastalking
    fallback: twilio
email:
  transactional:
    primary: ses
    fallback: sendgrid

14. Provider Failover

Failures should be classified before deciding whether to retry or fail over.

Retryable failures:

timeout
connection error
provider 5xx
rate limit
temporary provider outage
unknown response after request timeout

Permanent failures:

invalid phone number
invalid email address
unsubscribed email
bad device token
user opted out
blocked destination
template rejected

Example SMS failover:

Attempt 1: Termii
Attempt 2: Termii
Attempt 3: Twilio
Attempt 4: Twilio
Then dead letter

Failover should not blindly send duplicates. If the primary provider timed out after accepting the message, the system should check whether a provider message ID or webhook arrives before immediately sending through a backup provider. For critical messages, the business may still choose failover because late duplicate risk is better than missed delivery risk. That should be a per-notification policy.

15. Retry Strategy

Use exponential backoff with jitter.

Example:

Attempt 1: immediately
Attempt 2: after 1 minute
Attempt 3: after 5 minutes
Attempt 4: after 15 minutes
Attempt 5: after 1 hour

Add jitter to avoid retry storms:

next_attempt_at = base_backoff + random(0, 30 seconds)

Retry limits should vary by notification type:

OTP: short retry window, low max age
Transaction alert: moderate retry window
Marketing email: long retry window
Security alert: aggressive retry and fallback

Every retry must reuse the same internal notification and attempt history.

16. Dead Letter Queue

If all retries fail, move the attempt to a dead-letter state.

notification_attempts.status = DEAD_LETTERED

Also store a dead-letter record:

CREATE TABLE dead_letter_notifications (
    id UUID PRIMARY KEY,
    notification_id UUID NOT NULL,
    attempt_id UUID NOT NULL,
    channel TEXT NOT NULL,
    provider TEXT,
    reason TEXT NOT NULL,
    last_error_code TEXT,
    last_error_message TEXT,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Dead-letter records let engineering and support inspect:

Which notification failed.
Which provider was used.
How many times it was retried.
Whether the failure was permanent or temporary.
Whether manual replay is safe.

Manual replay should require permission and should create a new explicit replay record.

17. Delivery Webhooks

Providers usually send delivery events asynchronously.

Examples:

SMS delivered
SMS failed
Email delivered
Email bounced
Email opened
Push failed
Device token invalid

Webhook endpoint:

POST /notifications/webhooks/{provider}

Webhook processing:

1. Verify provider signature.
2. Parse event.
3. Insert provider_delivery_events row.
4. Ignore duplicate provider event IDs.
5. Match event to notification_attempt.
6. Update attempt status.
7. Update user contact health if needed.
8. Emit metrics.

Example updates:

email delivered -> attempt status DELIVERED
email bounced -> attempt status FAILED
sms delivered -> attempt status DELIVERED
push invalid token -> mark device token inactive

18. Avoiding Duplicate Sends

Use layered duplicate protection:

Unique idempotency_key on notifications.
Stable channel-level provider_idempotency_key.
Database lease before processing an attempt.
Redis lock as an additional short-lived guard.
Worker checks attempt status before sending.
Provider idempotency or client reference where supported.
Idempotent webhook processing.
Reconciliation jobs to resolve uncertain states.

Before sending, every worker should check:

If notification is CANCELLED, do not send.
If attempt is SENT or DELIVERED, do not send.
If user is opted out, suppress the attempt.
If another worker owns the lease, do not send.

19. Avoiding Missed Sends

Use durable state and recovery processes:

Save notification records before queueing.
Use the transactional outbox pattern.
Store all attempt state in PostgreSQL.
Run an outbox dispatcher continuously.
Retry failed queue publishes.
Run scheduled recovery jobs.
Reconcile stuck PROCESSING attempts.
Reconcile provider statuses through webhooks and provider APIs.

Recovery jobs:

Find outbox_events where status = PENDING and next_attempt_at <= now()
Find attempts where status = PROCESSING and locked_until < now()
Find attempts where status = RETRYING and next_attempt_at <= now()
Find notifications stuck in QUEUED with no active attempts
Find SENT attempts with no webhook after provider-specific timeout

This handles worker crashes, queue publish failures, provider timeouts, and missed webhooks.

20. Graceful Degradation

The system should degrade based on notification importance.

Push Provider Failure

For non-critical messages:

Queue retry later.

For critical messages:

Push fails -> SMS fallback
Push delayed -> SMS fallback after threshold

SMS Provider Failure

Primary SMS provider down -> switch to backup provider.
Backup provider rate-limited -> queue retry with backoff.
Critical messages -> use alternative channel if allowed.

Email Provider Failure

Transactional email -> fail over to backup provider.
Marketing email -> delay and retry later.

Queue Backlog

Pause low-priority campaigns.
Allocate workers to critical queues.
Apply backpressure to producers.
Rate-limit bulk sends.

Database Pressure

Reduce campaign ingestion.
Batch writes where safe.
Archive old notification events.
Protect transactional notifications first.

Graceful degradation should be policy-driven. OTPs and fraud alerts deserve different behavior from newsletters.

21. User Preferences and Compliance

Before sending, check:

can_receive_sms
can_receive_email
can_receive_push
marketing_opt_in
quiet_hours
timezone
country
language
unsubscribed_at
blocked_until

Transactional messages may bypass marketing preferences, but must still respect:

Legal unsubscribe requirements.
Provider rules.
Platform rules.
User account restrictions.
Regional compliance requirements.

Suppressed messages should still be recorded:

attempt.status = SUPPRESSED
reason = USER_OPTED_OUT

That way, support can explain why a notification was not sent.

22. Rate Limiting

Rate limits should exist at several levels:

provider
channel
user
tenant
notification type
country
campaign

Examples:

Max 3 OTP SMS per user per 10 minutes
Max 10 marketing messages per user per day
Max 100 SMS per second through Provider A
Max 5 password reset emails per user per hour

Redis is useful for fast counters and sliding windows.

If a rate limit is hit:

Critical notifications can be delayed briefly or routed to a backup provider.
Low-priority notifications can be rescheduled.
Abusive requests can be rejected.

23. Templates and Personalization

Templates should be versioned.

Example:

transaction_alert_v1
withdrawal_success_v3
otp_login_v2

Template rendering should happen in workers using stored metadata.

Template records:

template_key
version
channel
locale
subject
body
variables_schema
status
created_at
updated_at

Rendering rules:

Validate required variables.
Escape user-supplied content.
Use localized templates when available.
Fall back to default locale.
Store enough rendered output or metadata for audit, depending on PII policy.

Avoid storing full sensitive message bodies forever if they contain PII or financial data.

24. Push Notifications

Push notifications need device token management.

Device token table:

id
user_id
platform
token_hash
encrypted_token
status
last_seen_at
created_at
updated_at

Push-specific behavior:

Send to active devices only.
Remove or deactivate invalid tokens.
Use collapse keys for replaceable messages.
Use high priority only for urgent alerts.
Respect mobile platform rules.
Track provider responses per device when needed.

For multi-device users, a single logical push notification may create multiple device-level attempts.

25. SMS Notifications

SMS-specific concerns:

Country-specific routing.
Sender ID rules.
Message length and segmentation.
Unicode cost impact.
Regulatory restrictions.
Delivery receipt availability.
Cost controls.

SMS should have strict rate limits because it is expensive.

For critical SMS:

Use primary provider first.
Retry transient failures.
Fail over to backup provider.
Stop retrying once delivery is confirmed.

26. Email Notifications

Email-specific concerns:

Bounce handling.
Suppression lists.
Spam reputation.
Dedicated IP pools for high scale.
Transactional and marketing separation.
Open and click tracking, if needed.
Unsubscribe links for marketing email.

Transactional and marketing email should be separated:

transactional.example.com
marketing.example.com

Hard bounces should update user email health and suppress future sends to that address.

27. Observability

Metrics:

notifications_created_total
notification_attempts_total
notifications_sent_total
notifications_failed_total
notifications_suppressed_total
provider_latency_ms
provider_error_rate
provider_timeout_rate
queue_depth
queue_age_seconds
retry_count
dead_letter_count
delivery_rate
bounce_rate
sms_cost_estimate

Alerts:

OTP delivery p95 > 30 seconds
SMS provider failure rate > 5%
Email bounce rate spike
Queue depth increasing for critical queue
Dead-letter count increasing
Outbox pending events older than 2 minutes
Webhook processing failures
Provider rate limit reached

Structured logs should include:

notification_id
attempt_id
user_id
channel
provider
idempotency_key
provider_message_id
worker_id
request_id

Distributed tracing should connect:

producer service -> notification API -> outbox dispatcher -> worker -> provider adapter -> webhook

28. Admin Dashboard

The admin dashboard should answer:

Was this notification created?
Was it sent?
Which channels were attempted?
Which provider was used?
What was the provider response?
Was it delivered?
Did it bounce or fail?
Was it retried?
Why was it suppressed?
Can it be safely replayed?

Useful views:

Search by user ID, notification ID, idempotency key, provider message ID.
Timeline of notification events.
Attempt history.
Provider webhook history.
Dead-letter queue.
Retry and replay controls.
Provider health dashboard.
Queue depth dashboard.

Manual resend should:

Require elevated permission.
Record who triggered it.
Use a new replay reason.
Avoid bypassing compliance rules.
Preserve the original notification audit trail.

29. Security and Privacy

Security requirements:

Authenticate all producer services.
Authorize notification types by service.
Verify provider webhook signatures.
Encrypt sensitive contact fields.
Avoid logging raw phone numbers, emails, tokens, or message bodies.
Mask PII in dashboards.
Limit dashboard access by role.
Audit manual resend actions.
Store provider credentials in a secret manager.

PII examples:

email address
phone number
device token
message body containing financial or account data

Use tokenization, hashing, or encryption where appropriate.

30. Scaling Plan for 1M+ Users

Scaling techniques:

Queue all sends.
Separate queues by channel and priority.
Horizontally scale workers.
Use PostgreSQL indexes for status scans.
Partition large tables by time.
Archive old delivery events.
Cache user preferences in Redis.
Batch low-priority campaign creation.
Enforce provider-specific throughput limits.
Use multiple provider accounts if needed.
Keep queued jobs small and ID-based.
Use read replicas for analytics and dashboards.

Suggested indexes:

CREATE INDEX idx_notifications_status_scheduled
ON notifications (status, scheduled_at);

CREATE INDEX idx_attempts_status_channel_next
ON notification_attempts (status, channel, next_attempt_at);

CREATE UNIQUE INDEX idx_notifications_idempotency
ON notifications (idempotency_key);

CREATE UNIQUE INDEX idx_provider_events_unique
ON provider_delivery_events (provider, provider_event_id);

CREATE INDEX idx_outbox_pending
ON outbox_events (status, next_attempt_at, created_at);

For campaign sends to millions of users, avoid inserting all rows synchronously inside the API request. Create a campaign job, expand recipients asynchronously, and throttle generation into channel queues.

31. Example Critical Notification Flow

Example: withdrawal successful alert.

1. Wallet service emits withdrawal_successful.
2. Notification API receives the request.
3. API validates the payload and idempotency key.
4. API creates notification and attempt records in PostgreSQL.
5. API creates an outbox event in the same transaction.
6. Outbox dispatcher publishes jobs to the queue.
7. SMS, push, and email workers pick up channel attempts.
8. Each worker acquires a database lease.
9. Each worker checks user preferences and rate limits.
10. Each worker renders the correct template.
11. Provider router selects the best provider.
12. Worker sends through the provider.
13. Provider response is saved.
14. Retry or fallback is scheduled if needed.
15. Provider webhook updates delivery status.
16. Parent notification status becomes SENT, PARTIALLY_SENT, or FAILED.
17. Metrics, logs, and audit trail are updated.

32. Failure Scenarios

32.1 API Receives Duplicate Request

Unique idempotency key prevents duplicate notification creation.
Existing notification is returned.

32.2 Database Insert Succeeds but Queue Publish Fails

Transactional outbox keeps the event in PostgreSQL.
Dispatcher retries publish later.
No notification is lost.

32.3 Worker Crashes Before Sending

Attempt lease expires.
Recovery job or queue retry picks it up.
Worker sends later.

32.4 Worker Crashes After Provider Accepted Message

Attempt may remain PROCESSING until lease expires.
Recovery checks provider reference or waits for webhook.
If uncertain, policy decides whether to retry, fail over, or hold for reconciliation.

32.5 Provider Times Out

Classify as uncertain.
Retry with same idempotency reference if supported.
For critical messages, fallback may be allowed after a delay.
For non-critical messages, wait and reconcile.

32.6 Provider Sends Duplicate Webhook

Unique provider event ID prevents duplicate processing.

32.7 Marketing Campaign Overloads Queue

Low-priority queues are paused or throttled.
Critical and high-priority queues continue.

33. Technology Choice

For a Python/Django system:

Django or FastAPI: Notification API
PostgreSQL: durable source of truth
Redis: Celery broker, rate limits, short-lived locks, cache
Celery: async workers
Celery Beat: scheduled recovery jobs
Prometheus/Grafana: metrics and dashboards
OpenTelemetry: tracing
Sentry or similar: error reporting

For very high throughput:

Kafka or SQS instead of Redis-backed Celery queues
Dedicated outbox dispatcher service
Dedicated webhook processor service
Partitioned PostgreSQL tables
Separate OLAP store for analytics

34. Final Architecture Summary

I would build this as a durable, queue-first notification platform.

The Notification API writes every request to PostgreSQL and creates an outbox event in the same transaction. A dispatcher publishes outbox events to queues. Channel-specific workers process attempts, acquire leases, check preferences and rate limits, render templates, route to providers, and record every provider response. Delivery webhooks update final statuses and are processed idempotently.

Reliability comes from:

idempotency keys
transactional outbox
database leases
short-lived Redis locks
channel-level attempt records
provider references
retry queues
provider failover
dead-letter queues
webhook deduplication
reconciliation jobs
audit logs

Graceful degradation comes from:

priority queues
provider health checks
fallback providers
channel fallback for critical notifications
rate limits
backpressure
pausing low-priority campaigns
delaying non-critical sends

The system should not assume any provider is reliable. Providers will timeout, rate-limit, duplicate webhooks, delay delivery receipts, and occasionally accept messages without returning a clean response. The architecture handles that by making PostgreSQL the source of truth, making workers idempotent, and continuously reconciling uncertain states.

adelekecode/notification.md

Tech Spec: Reliable Notification System for 1M+ Users

1. Overview

2. High-Level Architecture

3. Design Goals

Functional Requirements

Non-Functional Requirements

4. Core Principle

5. Main Components

5.1 Notification API

5.2 PostgreSQL

6. Data Model

6.1 notifications

6.2 notification_attempts

6.3 outbox_events

6.4 provider_delivery_events

7. Idempotency Strategy

API-Level Idempotency

Channel-Level Idempotency

Provider-Level Idempotency

Webhook Idempotency

8. Queue Architecture

9. Transactional Outbox

10. Worker Flow

11. Locking and Leases

12. Provider Abstraction

13. Provider Routing

14. Provider Failover

15. Retry Strategy

16. Dead Letter Queue

17. Delivery Webhooks

18. Avoiding Duplicate Sends

19. Avoiding Missed Sends

20. Graceful Degradation

Push Provider Failure

SMS Provider Failure

Email Provider Failure

Queue Backlog

Database Pressure

21. User Preferences and Compliance

22. Rate Limiting

23. Templates and Personalization

24. Push Notifications

25. SMS Notifications

26. Email Notifications

27. Observability

28. Admin Dashboard

29. Security and Privacy

30. Scaling Plan for 1M+ Users

31. Example Critical Notification Flow

32. Failure Scenarios

32.1 API Receives Duplicate Request

32.2 Database Insert Succeeds but Queue Publish Fails

32.3 Worker Crashes Before Sending

32.4 Worker Crashes After Provider Accepted Message

32.5 Provider Times Out

32.6 Provider Sends Duplicate Webhook

32.7 Marketing Campaign Overloads Queue

33. Technology Choice

34. Final Architecture Summary

6.1 `notifications`

6.2 `notification_attempts`

6.3 `outbox_events`

6.4 `provider_delivery_events`