TL;DR
- Commands are imperative: do X. Events are facts: X happened. Keep them distinct.
- For cross‑service, long‑running workflows, use a Saga with orchestration (central controller) or choreography (event‑driven).
- Make every step idempotent and compensatable. Use outbox/inbox, correlation IDs, retries + backoff, and DLQs.
- Event ordering is per key at best; encode version/sequence and design for at‑least‑once delivery.
- Don’t adopt full event sourcing unless you need an audit log and rebuildable state; integration events are enough for most systems.
1) Commands vs Events (use the right tool)
| Concept | What it is | Who consumes it | When emitted | Naming tips |
|---|---|---|---|---|
| Command | A request to perform an action (“CreateOrder”) | Exactly one handler (the owner of the capability) | Incoming at the boundary | VerbImperative (CreateOrder, ReserveStock) |
| Event | A fact about something that happened (“OrderCreated”) | 0..N subscribers | After state changes commit | NounPastTense (OrderCreated, StockReserved) |
Rule: a service owns its domain; it handles commands for that domain and emits events after durable changes.
2) Event sourcing vs “events for integration”
- Integration events (most common): you store state normally (tables/documents) and publish events to inform other services.
- Event sourcing (specialized): the event stream is the source of truth; state is rebuilt by folding events. Brings great auditability and temporal queries, but adds complexity (migrations, replay, large streams).
Start with integration events + outbox. Adopt event sourcing only when you specifically need its superpowers.
3) Sagas — long‑running business workflows
A Saga coordinates a set of local transactions across services, each with a compensation to undo/mitigate if later steps fail.
Two styles:
Orchestration (central brain)
Client → OrderService
└── Orchestrator (process manager)
1) Send ReserveStock
2) On StockReserved → Send ChargePayment
3) On PaymentCaptured → Send CreateShipment
4) On ShipmentCreated → Mark Order Completed
X) On failure → run compensations in reverse
Choreography (event‑driven dance)
OrderCreated ─► Inventory reserves
│ └─► StockReserved
▼ │
Payment service listens ▼
PaymentCaptured ◄──────────────── ShipmentCreated
│
└─► OrderCompleted
- Orchestration gives visibility and control (timeouts, retries) but adds a central component.
- Choreography is loosely coupled but can become hard to reason about as steps grow.
4) Reliability patterns you must use
Outbox (publish events reliably)
-- In the same DB transaction as your state change
BEGIN;
INSERT INTO orders(id, status, total) VALUES ($1, 'created', $2);
INSERT INTO outbox(id, topic, key, payload) VALUES ($2, 'OrderCreated', $1, $3);
COMMIT;
-- A relay publishes outbox rows and marks them sent
Inbox (dedupe)
CREATE TABLE inbox (msg_id text PRIMARY KEY, processed_at timestamptz default now());
-- On handler start:
INSERT INTO inbox (msg_id) VALUES ($msg_id) ON CONFLICT DO NOTHING;
-- If inserted → proceed; else duplicate → no‑op
Idempotency
- Identify operations by natural keys (
order_id,payment_id) and use UPSERT so repeats are safe.
Correlation & causality
- Include
correlation_id(ties all events in a saga) andcausation_id(the message that triggered this handler).
Timeouts & retries
- Retries with exponential backoff + jitter. Use DLQ for poison messages; alert and build replay tooling.
5) Designing a Saga (state machine)
Define states, steps, timeouts, and compensations.
Example: Order → Reserve stock → Capture payment → Create shipment
| Step | Command/Event | Service | Compensation on failure |
|---|---|---|---|
| Reserve stock | ReserveStock → StockReserved | Inventory | ReleaseStock |
| Capture payment | ChargePayment → PaymentCaptured | Payments | RefundPayment |
| Create shipment | CreateShipment → ShipmentCreated | Shipping | CancelShipment + RefundPayment + ReleaseStock |
Timeouts: if StockReserved not received in, say, 30s, compensate and fail. Persist saga state so it survives restarts.
6) Orchestrator sketch (TypeScript‑ish)
type SagaState = "new"|"reserving"|"charging"|"shipping"|"done"|"failed";
type Event = { type:string, id:string, correlationId:string, data:any };
export class OrderSaga {
state: SagaState = "new";
constructor(readonly id: string, readonly bus: Bus, readonly repo: Repo) {}
async start(cmd: { orderId: string }) {
this.state = "reserving";
await this.bus.send("ReserveStock", { orderId: cmd.orderId }, { correlationId: this.id });
await this.repo.saveSaga(this);
this.armTimeout("reserve", 30_000);
}
async on(e: Event) {
if (e.correlationId !== this.id) return; // ignore
switch (e.type) {
case "StockReserved":
if (this.state !== "reserving") return;
this.state = "charging";
await this.bus.send("ChargePayment", { orderId: e.data.orderId }, { correlationId: this.id });
this.armTimeout("charge", 30_000);
break;
case "PaymentCaptured":
if (this.state !== "charging") return;
this.state = "shipping";
await this.bus.send("CreateShipment", { orderId: e.data.orderId }, { correlationId: this.id });
this.armTimeout("ship", 60_000);
break;
case "ShipmentCreated":
this.state = "done";
await this.repo.saveSaga(this);
break;
case "StockReserveFailed":
case "PaymentFailed":
case "ShipmentFailed":
await this.compensate(e);
break;
}
}
async compensate(e: Event) {
// Orchestrated compensations; each should be idempotent
if (this.state === "shipping") await this.bus.send("CancelShipment", { orderId: this.id });
if (this.state !== "reserving") await this.bus.send("RefundPayment", { orderId: this.id });
await this.bus.send("ReleaseStock", { orderId: this.id });
this.state = "failed";
await this.repo.saveSaga(this);
}
armTimeout(kind: string, ms: number) {
// persist a wakeup; on expiry, publish a *_TimedOut event the saga listens for
}
}
Notes
- Persist saga state and resume on restart.
- Commands and compensations must be idempotent.
7) Choreography example (topics)
orders → OrderCreated{order_id}
inventory listens:
on OrderCreated → try ReserveStock → emit StockReserved/StockReserveFailed
payments listens:
on StockReserved → try ChargePayment → emit PaymentCaptured/PaymentFailed
shipping listens:
on PaymentCaptured → CreateShipment → emit ShipmentCreated/ShipmentFailed
orders listens:
on ShipmentCreated → mark Order Completed
on PaymentFailed/StockReserveFailed → mark Order Failed
- Add timeouts via a scheduler that emits
StockReserveTimedOutif no result in time. - Keep topics per domain and keys per aggregate (e.g.,
order_id) for per‑key ordering.
8) Read models & UX
- Users hate “pending forever.” Show immediate acceptance and a pending state while the saga runs.
- Build query/read models (CQRS) that join data via events into denormalized tables for fast UIs.
- For admin ops, expose a saga view: state, last event, retries, correlation ID.
9) Observability
- Tracing: propagate
trace_id,correlation_id, and includecausation_idon every message. Spans per step with events: sent, received, retry, timeout, compensation. - Metrics:
saga_started_total,saga_completed_total,saga_failed_total, per‑step latency histograms, retries, DLQ size. - Logging: structured logs with
saga,order_id,event,attempt,state.
10) Pitfalls & fast fixes
| Pitfall | Why it hurts | Fix | |---|---|---| | Using events to tell others what to do | Tight coupling creeps back | Use commands for requests; events for facts | | No outbox/inbox | Drops/dupes cause chaos | Add outbox (emit) + inbox (dedupe) | | Global ordering assumptions | Doesn’t exist | Order is per key; include sequence/version | | No compensations | Irreversible side‑effects | Design undo/mitigate steps upfront | | Orchestrators that keep all state in memory | Lost progress on crash | Persist saga state; resume on restart | | Chatty events with breaking schema | Subscriber breakage | Version events; use additive changes | | Infinite retries on permanent errors | System thrash | Backoff + jitter, DLQ + alerting | | Hard deletes of events | Lost audit/replay | Retain events (compact only when safe) |
Quick checklist
- [ ] Distinguish commands (requests) vs events (facts).
- [ ] Pick orchestration or choreography per workflow; document steps, timeouts, compensations.
- [ ] Implement outbox/inbox, idempotency, correlation/causation IDs.
- [ ] Key streams by aggregate ID; include version/sequence.
- [ ] Add DLQ, backoff, and saga dashboards.
- [ ] Version events and publish schemas/contracts.
One‑minute adoption plan
- Map one real workflow (e.g., order‑to‑cash) into steps, success criteria, compensations, and timeouts.
- Implement a minimal orchestrator or choreography with outbox/inbox and idempotent handlers.
- Add correlation IDs end‑to‑end and basic metrics/tracing.
- Provide a pending UX and a saga status endpoint.
- Iterate: measure failures/timeouts, tune retries, and refine compensations.