Level 3 · 25 min
Outbox Pattern
The Outbox Pattern solves one of the hardest problems in distributed systems: reliably publishing events to a message broker as part of a database transaction. It eliminates the dual-write problem — the risk of saving to the DB but failing to publish (or vice versa).
The Dual-Write Problem
The naive approach: (1) save Order to DB, (2) publish OrderPlaced to Kafka. These are two separate operations. If the app crashes between steps 1 and 2, the order is saved but the event is never published — downstream services never know about it. Inventory is never reserved. The fulfillment service never starts. At a logistics company, this pattern caused ~300 lost order events per month during peak traffic (GC pauses killing pods between the two writes). Only detected during nightly reconciliation against the carrier API. If you reverse the order (publish first, then save), the event is published but the order save might fail — downstream services process an event for an order that doesn't exist. No matter which order you choose, there's a window of inconsistency. The dual-write problem: atomically committing to two separate systems (relational DB + message broker) requires 2PC — which most teams rightly avoid. You need a different approach that uses only one durable system as the source of truth.
Transactional Outbox
The Outbox Pattern solution: in the same DB transaction that saves the Order, also write the OrderPlaced event to an outbox table. Since both writes are in the same DB transaction, they're atomic — either both commit or both rollback. The outbox table schema: `(id UUID PRIMARY KEY, aggregate_type VARCHAR, aggregate_id VARCHAR, event_type VARCHAR, payload JSONB, created_at TIMESTAMPTZ DEFAULT NOW(), published_at TIMESTAMPTZ)`. Index: `CREATE INDEX ON outbox_events (published_at NULLS FIRST, created_at)` — the relay queries `WHERE published_at IS NULL ORDER BY created_at LIMIT 100`. A separate relay process reads unpublished events and publishes them to Kafka. After a successful Kafka publish acknowledge (acks=all), the relay updates `published_at = NOW()`. The DB is the single source of truth — the outbox row exists if and only if the business entity was saved. Key design choice: the `published_at` column approach (vs a `published BOOLEAN`) allows point-in-time queries for operational debugging (`SELECT * FROM outbox_events WHERE published_at IS NULL AND created_at < NOW() - INTERVAL '5 minutes'` — events stuck for more than 5 minutes indicate a relay problem). Add a `last_error` VARCHAR column to record relay failures for observability. Key insight from Designing Data-Intensive Applications: "Dual writes have some serious problems, one of which is a race condition... two clients concurrently want to update an item X. Both clients first write the new value to the database, then write it to the search index. Due to unlucky timing, the requests are interleaved: the database first sees write from client 1, the search index first sees write from client 2. The two systems are now permanently inconsistent with each other, even though no error occurred." The Outbox Pattern eliminates this race by making the database the single writer: both the business entity and the outbox event are committed in one transaction, giving the database's serialization order as the definitive sequence of events. Kleppmann describes this as "keeping one system as the source of truth and making all others derived" — exactly what the outbox achieves by treating the event log as a derivative of the database, not a parallel write target.
Relay Strategies and Guarantees
Two relay strategies: Polling relay — a background thread queries `SELECT * FROM outbox_events WHERE published_at IS NULL ORDER BY created_at LIMIT 100` on a schedule (e.g., every 500ms). Simple to implement, no additional infrastructure. Downsides: polling adds constant DB load even when idle; minimum latency equals the poll interval; at high throughput, the polling query can miss events if the batch processes slowly. Best for: <10K events/day, teams without Kafka expertise, or databases without CDC support. CDC-based relay (Change Data Capture) — Debezium captures every INSERT into the outbox table directly from the PostgreSQL WAL (Write-Ahead Log) and streams it to Kafka as a Kafka Connect source connector. Near-zero latency (milliseconds from commit to Kafka). No polling load on the DB. Scales to millions of events/day without degrading query performance. Debezium risk: it maintains a replication slot in PostgreSQL. If Debezium falls behind (e.g., connector paused for hours), the WAL accumulates unboundedly — the PostgreSQL primary disk can fill up. Monitor `pg_replication_slots` lag. Set `max_slot_wal_keep_size` (Postgres 13+) to cap WAL retention. The Inbox Pattern (consumer side): an inbox table on the consumer side (`processed_events(event_id UUID PRIMARY KEY, processed_at TIMESTAMPTZ)`) records processed event IDs. Before processing: `INSERT INTO processed_events(id) ON CONFLICT DO NOTHING` — if it returns 0 rows inserted, the event was already processed. Combine with database-level deduplication to make any consumer exactly-once from its own perspective.
Code example
// Transactional write + polling relay
@Transactional
void createOrder(CreateOrderCommand cmd) {
Order order = Order.from(cmd);
orderRepository.save(order);
// Same DB transaction: atomic — both commit or both rollback
outboxRepository.save(OutboxEvent.builder()
.id(UUID.randomUUID()) // idempotency key for consumers
.aggregateType("Order")
.aggregateId(order.getId().toString())
.eventType("OrderPlaced")
.payload(json.write(new OrderPlaced(order)))
.build());
}
// Polling relay (runs every 500ms in a separate thread)
@Scheduled(fixedDelay = 500)
void relayOutboxEvents() {
List<OutboxEvent> pending = outboxRepo.findUnpublished(100); // LIMIT 100
for (OutboxEvent evt : pending) {
try {
kafkaTemplate.send(evt.getEventType(), evt.getAggregateId(), evt.getPayload())
.get(5, TimeUnit.SECONDS); // synchronous send with timeout
outboxRepo.markPublished(evt.getId()); // only mark after confirmed ack
} catch (Exception e) {
outboxRepo.recordError(evt.getId(), e.getMessage()); // observability
log.error("Relay failed for event {}", evt.getId(), e);
}
}
}