Design a scalable notification platform that sends push notifications, SMS, and email to more than 1M users.
The system must:
- Support push, SMS, and email.
- Handle transactional and marketing notifications.
- Avoid duplicate sends.
- Avoid missed sends.
- Retry transient failures.
- Fail over between providers.
- Gracefully degrade when providers are slow or unavailable.
- Provide a full audit trail for support, compliance, and debugging.
- Scale horizontally across channels, priorities, and workers.
Important caveat: no system can guarantee true end-to-end exactly-once delivery once third-party providers are involved. A provider can timeout after accepting a message, or deliver a message while our system never receives the response. The correct target is effectively-once sending from our system, using idempotency keys, durable state, provider references, retries, reconciliation, and duplicate suppression.
The system should be event-driven and queue-first.
Application Services
|
| 1. Notification request / domain event
v
Notification API
|
| 2. Validate request, resolve template, store durable record
v
PostgreSQL
|
| 3. Transactional outbox event
v
Outbox Dispatcher
|
| 4. Publish jobs
v
Queue System
|
| 5. Channel workers
v
Provider Router
|
| 6. Send through selected provider
v
Push / SMS / Email Providers
|
| 7. Delivery webhooks
v
Webhook Processor
|
| 8. Status updates, audit logs, metrics
v
PostgreSQL + Observability StackExample event:
{
"event": "withdrawal_successful",
"user_id": "user_123",
"channels": ["push", "email", "sms"],
"template": "withdrawal_success",
"idempotency_key": "withdrawal_wd_789_success_notification",
"metadata": {
"amount": 50000,
"currency": "NGN"
}
}- Send notifications through push, SMS, and email.
- Support single-user notifications.
- Support batch and campaign notifications.
- Support scheduled notifications.
- Support templates and localization.
- Support user preferences and opt-outs.
- Support priority-based delivery.
- Track every send attempt.
- Track provider delivery events.
- Provide admin visibility into notification status.
- Reliable for critical notifications such as OTPs, transaction alerts, and security alerts.
- Horizontally scalable for 1M+ users.
- Fault tolerant when workers, queues, databases, or providers fail.
- Idempotent at API, queue, worker, and webhook layers.
- Observable through metrics, logs, traces, and dashboards.
- Cost-aware, especially for SMS.
- Secure with PII protection and strict access control.
The API should never send notifications directly.
The API should only:
- Authenticate and authorize the caller.
- Validate the request.
- Resolve the recipient and notification policy.
- Create a durable notification record.
- Create an outbox event in the same database transaction.
- Return a response.
Actual delivery should happen asynchronously in background workers.
This prevents request timeouts, supports retries, allows backpressure, and gives the system a durable source of truth.
The Notification API accepts requests from internal services such as wallet, auth, orders, billing, and marketing.
Example endpoint:
POST /notifications/sendExample request:
{
"user_id": "user_123",
"type": "transaction_alert",
"channels": ["push", "sms", "email"],
"template": "transaction_alert_v1",
"idempotency_key": "txn_txn_456_alert",
"priority": "HIGH",
"metadata": {
"amount": 12000,
"currency": "NGN",
"merchant": "Example Store"
}
}The idempotency_key is required for transactional notifications. It should be generated by the producing service from the business event ID.
Good examples:
withdrawal_wd_123_success_notification
otp_login_user_123_202605231020
invoice_inv_456_due_reminder_1Bad examples:
random UUID generated on every retry
current timestamp only
user ID onlyIf the same request is submitted twice with the same idempotency_key, the API returns the existing notification record instead of creating a new one.
PostgreSQL is the source of truth.
The queue is not the source of truth. Redis is not the source of truth. Provider dashboards are not the source of truth.
Core tables:
notifications
notification_recipients
notification_attempts
notification_templates
notification_preferences
outbox_events
provider_delivery_events
dead_letter_notifications
provider_accountsStores the logical notification request.
CREATE TABLE notifications (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
type TEXT NOT NULL,
template_key TEXT NOT NULL,
idempotency_key TEXT NOT NULL,
priority TEXT NOT NULL,
status TEXT NOT NULL,
scheduled_at TIMESTAMPTZ,
metadata JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE UNIQUE INDEX idx_notifications_idempotency
ON notifications (idempotency_key);
CREATE INDEX idx_notifications_status_scheduled
ON notifications (status, scheduled_at);Example statuses:
PENDING
QUEUED
PROCESSING
SENT
PARTIALLY_SENT
FAILED
CANCELLEDStores each channel-level send attempt.
For a notification sent through push, SMS, and email, there can be one or more attempt records per channel.
CREATE TABLE notification_attempts (
id UUID PRIMARY KEY,
notification_id UUID NOT NULL REFERENCES notifications(id),
channel TEXT NOT NULL,
provider TEXT,
provider_message_id TEXT,
provider_idempotency_key TEXT NOT NULL,
status TEXT NOT NULL,
attempt_number INT NOT NULL DEFAULT 0,
next_attempt_at TIMESTAMPTZ,
locked_by TEXT,
locked_until TIMESTAMPTZ,
last_error_code TEXT,
last_error_message TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_attempts_status_channel
ON notification_attempts (status, channel, next_attempt_at);
CREATE UNIQUE INDEX idx_attempts_provider_idempotency
ON notification_attempts (provider_idempotency_key);Attempt statuses:
PENDING
PROCESSING
SENT
DELIVERED
FAILED
RETRYING
SUPPRESSED
DEAD_LETTEREDSENT means the provider accepted the message. DELIVERED means the provider later confirmed delivery through a webhook or delivery receipt.
Stores events that must be published to the queue.
CREATE TABLE outbox_events (
id UUID PRIMARY KEY,
aggregate_type TEXT NOT NULL,
aggregate_id UUID NOT NULL,
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
status TEXT NOT NULL,
attempts INT NOT NULL DEFAULT 0,
next_attempt_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
dispatched_at TIMESTAMPTZ
);
CREATE INDEX idx_outbox_pending
ON outbox_events (status, next_attempt_at, created_at);Outbox statuses:
PENDING
DISPATCHED
FAILEDStores webhook events from providers.
CREATE TABLE provider_delivery_events (
id UUID PRIMARY KEY,
provider TEXT NOT NULL,
provider_event_id TEXT NOT NULL,
provider_message_id TEXT,
notification_attempt_id UUID REFERENCES notification_attempts(id),
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
processed_at TIMESTAMPTZ
);
CREATE UNIQUE INDEX idx_provider_events_unique
ON provider_delivery_events (provider, provider_event_id);This makes webhook processing idempotent.
Duplicate prevention should happen at multiple layers.
Every notification request includes an idempotency_key.
CREATE UNIQUE INDEX idx_notifications_idempotency
ON notifications (idempotency_key);If the same request is retried, the API returns the existing record.
Each channel attempt has a stable provider_idempotency_key.
Example:
notification:{notification_id}:channel:sms:purpose:primaryIf a worker crashes and retries, it reuses the same reference instead of generating a new one.
When supported, pass the provider idempotency key or client reference to the provider.
Examples:
SendGrid custom_args.notification_id
SES message tags
Twilio statusCallback reference
FCM collapse_key or custom data referenceNot all providers offer true idempotency, but most allow custom metadata that helps with reconciliation.
Webhook processing must deduplicate provider event IDs.
CREATE UNIQUE INDEX idx_provider_events_unique
ON provider_delivery_events (provider, provider_event_id);If the provider sends the same webhook multiple times, only the first one changes state.
For a Django/Python implementation, a practical stack is:
Django
PostgreSQL
Redis
Celery
Celery BeatFor larger scale or stricter durability, use:
Kafka
RabbitMQ
Amazon SQS
Google Pub/SubRecommended queues:
notifications.critical
notifications.high
notifications.normal
notifications.low
notifications.push
notifications.sms
notifications.email
notifications.retry
notifications.dead_letterPriority examples:
CRITICAL: OTP, password reset, fraud alert
HIGH: transaction alerts, login alerts
NORMAL: account updates, reminders
LOW: marketing campaigns, newslettersCritical queues should have more workers, stricter latency alerts, and separate provider rate limits.
The transactional outbox prevents this failure:
Database insert succeeds, but queue publish fails.Notification creation should happen like this:
BEGIN;
INSERT INTO notifications (...);
INSERT INTO notification_attempts (...);
INSERT INTO outbox_events (...);
COMMIT;An outbox dispatcher later reads pending outbox events and publishes jobs to the queue.
After a successful publish:
outbox_events.status = DISPATCHED
outbox_events.dispatched_at = now()If the dispatcher crashes, another dispatcher can pick up the same pending outbox event.
The queued job should contain only identifiers, not the full send payload:
{
"notification_id": "9df1a5b8-4d8d-4a99-8b5e-7ef11c913ef8",
"attempt_id": "441ebf44-9176-4f3f-a0f7-fd4da8c09e84"
}Workers reload the latest state from PostgreSQL before sending.
A worker processes one attempt at a time.
1. Receive job from queue.
2. Load notification and attempt from PostgreSQL.
3. If attempt is already SENT or DELIVERED, exit successfully.
4. Check notification status.
5. Check user preferences and legal opt-outs.
6. Check rate limits.
7. Acquire a processing lease.
8. Render template.
9. Select provider.
10. Send notification.
11. Store provider response.
12. Mark attempt SENT or schedule retry.
13. Update parent notification status.
14. Emit metrics and audit logs.The worker must be safe to run more than once for the same job.
Use database leases as the primary correctness mechanism.
Example:
UPDATE notification_attempts
SET
status = 'PROCESSING',
locked_by = :worker_id,
locked_until = now() + interval '5 minutes',
updated_at = now()
WHERE id = :attempt_id
AND status IN ('PENDING', 'RETRYING')
AND (locked_until IS NULL OR locked_until < now())
RETURNING *;If no row is returned, another worker owns the attempt or the attempt is no longer sendable.
Redis locks may also be used as a fast concurrency guard:
lock_id = str(uuid.uuid4())
lock_key = f"notification:attempt:{attempt_id}"
acquired = redis.set(lock_key, lock_id, nx=True, ex=300)
if not acquired:
returnRelease only if the same worker still owns the lock:
if redis.get(lock_key) == lock_id:
redis.delete(lock_key)However, Redis locks should not be the only correctness mechanism. PostgreSQL state and idempotency keys should remain authoritative.
The system should not be tightly coupled to one provider.
Use provider adapters behind a common interface:
class NotificationProvider:
def send(self, message: ProviderMessage) -> ProviderResponse:
raise NotImplementedErrorExample providers:
Email: Amazon SES, SendGrid, Mailgun, ZeptoMail
SMS: Termii, Twilio, Africa's Talking
Push: Firebase Cloud Messaging, APNsProvider response:
class ProviderResponse:
provider: str
provider_message_id: str | None
accepted: bool
retryable: bool
error_code: str | None
error_message: str | NoneEach adapter should normalize provider-specific responses into a common internal result.
Provider routing should consider:
- Channel.
- Country.
- Cost.
- Provider health.
- Provider rate limits.
- Message type.
- Historical delivery performance.
- User segment or tenant.
Example routing:
SMS Nigeria primary: Termii
SMS Nigeria backup: Twilio
SMS Ghana primary: Africa's Talking
Email transactional primary: SES
Email transactional backup: SendGrid
Push Android: FCM
Push iOS: APNs or FCM-to-APNsRouting policy can be stored in configuration:
sms:
NG:
primary: termii
fallback: twilio
GH:
primary: africastalking
fallback: twilio
email:
transactional:
primary: ses
fallback: sendgridFailures should be classified before deciding whether to retry or fail over.
Retryable failures:
timeout
connection error
provider 5xx
rate limit
temporary provider outage
unknown response after request timeoutPermanent failures:
invalid phone number
invalid email address
unsubscribed email
bad device token
user opted out
blocked destination
template rejectedExample SMS failover:
Attempt 1: Termii
Attempt 2: Termii
Attempt 3: Twilio
Attempt 4: Twilio
Then dead letterFailover should not blindly send duplicates. If the primary provider timed out after accepting the message, the system should check whether a provider message ID or webhook arrives before immediately sending through a backup provider. For critical messages, the business may still choose failover because late duplicate risk is better than missed delivery risk. That should be a per-notification policy.
Use exponential backoff with jitter.
Example:
Attempt 1: immediately
Attempt 2: after 1 minute
Attempt 3: after 5 minutes
Attempt 4: after 15 minutes
Attempt 5: after 1 hourAdd jitter to avoid retry storms:
next_attempt_at = base_backoff + random(0, 30 seconds)Retry limits should vary by notification type:
OTP: short retry window, low max age
Transaction alert: moderate retry window
Marketing email: long retry window
Security alert: aggressive retry and fallbackEvery retry must reuse the same internal notification and attempt history.
If all retries fail, move the attempt to a dead-letter state.
notification_attempts.status = DEAD_LETTEREDAlso store a dead-letter record:
CREATE TABLE dead_letter_notifications (
id UUID PRIMARY KEY,
notification_id UUID NOT NULL,
attempt_id UUID NOT NULL,
channel TEXT NOT NULL,
provider TEXT,
reason TEXT NOT NULL,
last_error_code TEXT,
last_error_message TEXT,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);Dead-letter records let engineering and support inspect:
- Which notification failed.
- Which provider was used.
- How many times it was retried.
- Whether the failure was permanent or temporary.
- Whether manual replay is safe.
Manual replay should require permission and should create a new explicit replay record.
Providers usually send delivery events asynchronously.
Examples:
SMS delivered
SMS failed
Email delivered
Email bounced
Email opened
Push failed
Device token invalidWebhook endpoint:
POST /notifications/webhooks/{provider}Webhook processing:
1. Verify provider signature.
2. Parse event.
3. Insert provider_delivery_events row.
4. Ignore duplicate provider event IDs.
5. Match event to notification_attempt.
6. Update attempt status.
7. Update user contact health if needed.
8. Emit metrics.Example updates:
email delivered -> attempt status DELIVERED
email bounced -> attempt status FAILED
sms delivered -> attempt status DELIVERED
push invalid token -> mark device token inactiveUse layered duplicate protection:
- Unique
idempotency_keyonnotifications. - Stable channel-level
provider_idempotency_key. - Database lease before processing an attempt.
- Redis lock as an additional short-lived guard.
- Worker checks attempt status before sending.
- Provider idempotency or client reference where supported.
- Idempotent webhook processing.
- Reconciliation jobs to resolve uncertain states.
Before sending, every worker should check:
If notification is CANCELLED, do not send.
If attempt is SENT or DELIVERED, do not send.
If user is opted out, suppress the attempt.
If another worker owns the lease, do not send.Use durable state and recovery processes:
- Save notification records before queueing.
- Use the transactional outbox pattern.
- Store all attempt state in PostgreSQL.
- Run an outbox dispatcher continuously.
- Retry failed queue publishes.
- Run scheduled recovery jobs.
- Reconcile stuck
PROCESSINGattempts. - Reconcile provider statuses through webhooks and provider APIs.
Recovery jobs:
Find outbox_events where status = PENDING and next_attempt_at <= now()
Find attempts where status = PROCESSING and locked_until < now()
Find attempts where status = RETRYING and next_attempt_at <= now()
Find notifications stuck in QUEUED with no active attempts
Find SENT attempts with no webhook after provider-specific timeoutThis handles worker crashes, queue publish failures, provider timeouts, and missed webhooks.
The system should degrade based on notification importance.
For non-critical messages:
Queue retry later.For critical messages:
Push fails -> SMS fallback
Push delayed -> SMS fallback after thresholdPrimary SMS provider down -> switch to backup provider.
Backup provider rate-limited -> queue retry with backoff.
Critical messages -> use alternative channel if allowed.Transactional email -> fail over to backup provider.
Marketing email -> delay and retry later.Pause low-priority campaigns.
Allocate workers to critical queues.
Apply backpressure to producers.
Rate-limit bulk sends.Reduce campaign ingestion.
Batch writes where safe.
Archive old notification events.
Protect transactional notifications first.Graceful degradation should be policy-driven. OTPs and fraud alerts deserve different behavior from newsletters.
Before sending, check:
can_receive_sms
can_receive_email
can_receive_push
marketing_opt_in
quiet_hours
timezone
country
language
unsubscribed_at
blocked_untilTransactional messages may bypass marketing preferences, but must still respect:
- Legal unsubscribe requirements.
- Provider rules.
- Platform rules.
- User account restrictions.
- Regional compliance requirements.
Suppressed messages should still be recorded:
attempt.status = SUPPRESSED
reason = USER_OPTED_OUTThat way, support can explain why a notification was not sent.
Rate limits should exist at several levels:
provider
channel
user
tenant
notification type
country
campaignExamples:
Max 3 OTP SMS per user per 10 minutes
Max 10 marketing messages per user per day
Max 100 SMS per second through Provider A
Max 5 password reset emails per user per hourRedis is useful for fast counters and sliding windows.
If a rate limit is hit:
- Critical notifications can be delayed briefly or routed to a backup provider.
- Low-priority notifications can be rescheduled.
- Abusive requests can be rejected.
Templates should be versioned.
Example:
transaction_alert_v1
withdrawal_success_v3
otp_login_v2Template rendering should happen in workers using stored metadata.
Template records:
template_key
version
channel
locale
subject
body
variables_schema
status
created_at
updated_atRendering rules:
- Validate required variables.
- Escape user-supplied content.
- Use localized templates when available.
- Fall back to default locale.
- Store enough rendered output or metadata for audit, depending on PII policy.
Avoid storing full sensitive message bodies forever if they contain PII or financial data.
Push notifications need device token management.
Device token table:
id
user_id
platform
token_hash
encrypted_token
status
last_seen_at
created_at
updated_atPush-specific behavior:
- Send to active devices only.
- Remove or deactivate invalid tokens.
- Use collapse keys for replaceable messages.
- Use high priority only for urgent alerts.
- Respect mobile platform rules.
- Track provider responses per device when needed.
For multi-device users, a single logical push notification may create multiple device-level attempts.
SMS-specific concerns:
- Country-specific routing.
- Sender ID rules.
- Message length and segmentation.
- Unicode cost impact.
- Regulatory restrictions.
- Delivery receipt availability.
- Cost controls.
SMS should have strict rate limits because it is expensive.
For critical SMS:
Use primary provider first.
Retry transient failures.
Fail over to backup provider.
Stop retrying once delivery is confirmed.Email-specific concerns:
- Bounce handling.
- Suppression lists.
- Spam reputation.
- Dedicated IP pools for high scale.
- Transactional and marketing separation.
- Open and click tracking, if needed.
- Unsubscribe links for marketing email.
Transactional and marketing email should be separated:
transactional.example.com
marketing.example.comHard bounces should update user email health and suppress future sends to that address.
Metrics:
notifications_created_total
notification_attempts_total
notifications_sent_total
notifications_failed_total
notifications_suppressed_total
provider_latency_ms
provider_error_rate
provider_timeout_rate
queue_depth
queue_age_seconds
retry_count
dead_letter_count
delivery_rate
bounce_rate
sms_cost_estimateAlerts:
OTP delivery p95 > 30 seconds
SMS provider failure rate > 5%
Email bounce rate spike
Queue depth increasing for critical queue
Dead-letter count increasing
Outbox pending events older than 2 minutes
Webhook processing failures
Provider rate limit reachedStructured logs should include:
notification_id
attempt_id
user_id
channel
provider
idempotency_key
provider_message_id
worker_id
request_idDistributed tracing should connect:
producer service -> notification API -> outbox dispatcher -> worker -> provider adapter -> webhookThe admin dashboard should answer:
Was this notification created?
Was it sent?
Which channels were attempted?
Which provider was used?
What was the provider response?
Was it delivered?
Did it bounce or fail?
Was it retried?
Why was it suppressed?
Can it be safely replayed?Useful views:
- Search by user ID, notification ID, idempotency key, provider message ID.
- Timeline of notification events.
- Attempt history.
- Provider webhook history.
- Dead-letter queue.
- Retry and replay controls.
- Provider health dashboard.
- Queue depth dashboard.
Manual resend should:
- Require elevated permission.
- Record who triggered it.
- Use a new replay reason.
- Avoid bypassing compliance rules.
- Preserve the original notification audit trail.
Security requirements:
- Authenticate all producer services.
- Authorize notification types by service.
- Verify provider webhook signatures.
- Encrypt sensitive contact fields.
- Avoid logging raw phone numbers, emails, tokens, or message bodies.
- Mask PII in dashboards.
- Limit dashboard access by role.
- Audit manual resend actions.
- Store provider credentials in a secret manager.
PII examples:
email address
phone number
device token
message body containing financial or account dataUse tokenization, hashing, or encryption where appropriate.
Scaling techniques:
- Queue all sends.
- Separate queues by channel and priority.
- Horizontally scale workers.
- Use PostgreSQL indexes for status scans.
- Partition large tables by time.
- Archive old delivery events.
- Cache user preferences in Redis.
- Batch low-priority campaign creation.
- Enforce provider-specific throughput limits.
- Use multiple provider accounts if needed.
- Keep queued jobs small and ID-based.
- Use read replicas for analytics and dashboards.
Suggested indexes:
CREATE INDEX idx_notifications_status_scheduled
ON notifications (status, scheduled_at);
CREATE INDEX idx_attempts_status_channel_next
ON notification_attempts (status, channel, next_attempt_at);
CREATE UNIQUE INDEX idx_notifications_idempotency
ON notifications (idempotency_key);
CREATE UNIQUE INDEX idx_provider_events_unique
ON provider_delivery_events (provider, provider_event_id);
CREATE INDEX idx_outbox_pending
ON outbox_events (status, next_attempt_at, created_at);For campaign sends to millions of users, avoid inserting all rows synchronously inside the API request. Create a campaign job, expand recipients asynchronously, and throttle generation into channel queues.
Example: withdrawal successful alert.
1. Wallet service emits withdrawal_successful.
2. Notification API receives the request.
3. API validates the payload and idempotency key.
4. API creates notification and attempt records in PostgreSQL.
5. API creates an outbox event in the same transaction.
6. Outbox dispatcher publishes jobs to the queue.
7. SMS, push, and email workers pick up channel attempts.
8. Each worker acquires a database lease.
9. Each worker checks user preferences and rate limits.
10. Each worker renders the correct template.
11. Provider router selects the best provider.
12. Worker sends through the provider.
13. Provider response is saved.
14. Retry or fallback is scheduled if needed.
15. Provider webhook updates delivery status.
16. Parent notification status becomes SENT, PARTIALLY_SENT, or FAILED.
17. Metrics, logs, and audit trail are updated.Unique idempotency key prevents duplicate notification creation.
Existing notification is returned.Transactional outbox keeps the event in PostgreSQL.
Dispatcher retries publish later.
No notification is lost.Attempt lease expires.
Recovery job or queue retry picks it up.
Worker sends later.Attempt may remain PROCESSING until lease expires.
Recovery checks provider reference or waits for webhook.
If uncertain, policy decides whether to retry, fail over, or hold for reconciliation.Classify as uncertain.
Retry with same idempotency reference if supported.
For critical messages, fallback may be allowed after a delay.
For non-critical messages, wait and reconcile.Unique provider event ID prevents duplicate processing.Low-priority queues are paused or throttled.
Critical and high-priority queues continue.For a Python/Django system:
Django or FastAPI: Notification API
PostgreSQL: durable source of truth
Redis: Celery broker, rate limits, short-lived locks, cache
Celery: async workers
Celery Beat: scheduled recovery jobs
Prometheus/Grafana: metrics and dashboards
OpenTelemetry: tracing
Sentry or similar: error reportingFor very high throughput:
Kafka or SQS instead of Redis-backed Celery queues
Dedicated outbox dispatcher service
Dedicated webhook processor service
Partitioned PostgreSQL tables
Separate OLAP store for analyticsI would build this as a durable, queue-first notification platform.
The Notification API writes every request to PostgreSQL and creates an outbox event in the same transaction. A dispatcher publishes outbox events to queues. Channel-specific workers process attempts, acquire leases, check preferences and rate limits, render templates, route to providers, and record every provider response. Delivery webhooks update final statuses and are processed idempotently.
Reliability comes from:
idempotency keys
transactional outbox
database leases
short-lived Redis locks
channel-level attempt records
provider references
retry queues
provider failover
dead-letter queues
webhook deduplication
reconciliation jobs
audit logsGraceful degradation comes from:
priority queues
provider health checks
fallback providers
channel fallback for critical notifications
rate limits
backpressure
pausing low-priority campaigns
delaying non-critical sendsThe system should not assume any provider is reliable. Providers will timeout, rate-limit, duplicate webhooks, delay delivery receipts, and occasionally accept messages without returning a clean response. The architecture handles that by making PostgreSQL the source of truth, making workers idempotent, and continuously reconciling uncertain states.