caduh

Events, Commands, and Sagas — coordinate cross‑service work without tight coupling

6 min read

Commands say 'do X', events say 'X happened'. Use sagas (orchestration or choreography) for long‑running, cross‑service workflows with compensations, idempotency, and reliable messaging.

TL;DR

  • Commands are imperative: do X. Events are facts: X happened. Keep them distinct.
  • For cross‑service, long‑running workflows, use a Saga with orchestration (central controller) or choreography (event‑driven).
  • Make every step idempotent and compensatable. Use outbox/inbox, correlation IDs, retries + backoff, and DLQs.
  • Event ordering is per key at best; encode version/sequence and design for at‑least‑once delivery.
  • Don’t adopt full event sourcing unless you need an audit log and rebuildable state; integration events are enough for most systems.

1) Commands vs Events (use the right tool)

| Concept | What it is | Who consumes it | When emitted | Naming tips | |---|---|---|---|---| | Command | A request to perform an action (“CreateOrder”) | Exactly one handler (the owner of the capability) | Incoming at the boundary | VerbImperative (CreateOrder, ReserveStock) | | Event | A fact about something that happened (“OrderCreated”) | 0..N subscribers | After state changes commit | NounPastTense (OrderCreated, StockReserved) |

Rule: a service owns its domain; it handles commands for that domain and emits events after durable changes.


2) Event sourcing vs “events for integration”

  • Integration events (most common): you store state normally (tables/documents) and publish events to inform other services.
  • Event sourcing (specialized): the event stream is the source of truth; state is rebuilt by folding events. Brings great auditability and temporal queries, but adds complexity (migrations, replay, large streams).

Start with integration events + outbox. Adopt event sourcing only when you specifically need its superpowers.


3) Sagas — long‑running business workflows

A Saga coordinates a set of local transactions across services, each with a compensation to undo/mitigate if later steps fail.

Two styles:

Orchestration (central brain)

Client → OrderService
          └── Orchestrator (process manager)
              1) Send ReserveStock
              2) On StockReserved → Send ChargePayment
              3) On PaymentCaptured → Send CreateShipment
              4) On ShipmentCreated → Mark Order Completed
              X) On failure → run compensations in reverse

Choreography (event‑driven dance)

OrderCreated ─► Inventory reserves
   │                        └─► StockReserved
   ▼                                │
 Payment service listens            ▼
 PaymentCaptured ◄──────────────── ShipmentCreated
   │
   └─► OrderCompleted
  • Orchestration gives visibility and control (timeouts, retries) but adds a central component.
  • Choreography is loosely coupled but can become hard to reason about as steps grow.

4) Reliability patterns you must use

Outbox (publish events reliably)

-- In the same DB transaction as your state change
BEGIN;
  INSERT INTO orders(id, status, total) VALUES ($1, 'created', $2);
  INSERT INTO outbox(id, topic, key, payload) VALUES ($2, 'OrderCreated', $1, $3);
COMMIT;
-- A relay publishes outbox rows and marks them sent

Inbox (dedupe)

CREATE TABLE inbox (msg_id text PRIMARY KEY, processed_at timestamptz default now());
-- On handler start:
INSERT INTO inbox (msg_id) VALUES ($msg_id) ON CONFLICT DO NOTHING;
-- If inserted → proceed; else duplicate → no‑op

Idempotency

  • Identify operations by natural keys (order_id, payment_id) and use UPSERT so repeats are safe.

Correlation & causality

  • Include correlation_id (ties all events in a saga) and causation_id (the message that triggered this handler).

Timeouts & retries

  • Retries with exponential backoff + jitter. Use DLQ for poison messages; alert and build replay tooling.

5) Designing a Saga (state machine)

Define states, steps, timeouts, and compensations.

Example: Order → Reserve stock → Capture payment → Create shipment

| Step | Command/Event | Service | Compensation on failure | |---|---|---|---| | Reserve stock | ReserveStockStockReserved | Inventory | ReleaseStock | | Capture payment | ChargePaymentPaymentCaptured | Payments | RefundPayment | | Create shipment | CreateShipmentShipmentCreated | Shipping | CancelShipment + RefundPayment + ReleaseStock |

Timeouts: if StockReserved not received in, say, 30s, compensate and fail. Persist saga state so it survives restarts.


6) Orchestrator sketch (TypeScript‑ish)

type SagaState = "new"|"reserving"|"charging"|"shipping"|"done"|"failed";
type Event = { type:string, id:string, correlationId:string, data:any };

export class OrderSaga {
  state: SagaState = "new";
  constructor(readonly id: string, readonly bus: Bus, readonly repo: Repo) {}

  async start(cmd: { orderId: string }) {
    this.state = "reserving";
    await this.bus.send("ReserveStock", { orderId: cmd.orderId }, { correlationId: this.id });
    await this.repo.saveSaga(this);
    this.armTimeout("reserve", 30_000);
  }

  async on(e: Event) {
    if (e.correlationId !== this.id) return; // ignore
    switch (e.type) {
      case "StockReserved":
        if (this.state !== "reserving") return;
        this.state = "charging";
        await this.bus.send("ChargePayment", { orderId: e.data.orderId }, { correlationId: this.id });
        this.armTimeout("charge", 30_000);
        break;
      case "PaymentCaptured":
        if (this.state !== "charging") return;
        this.state = "shipping";
        await this.bus.send("CreateShipment", { orderId: e.data.orderId }, { correlationId: this.id });
        this.armTimeout("ship", 60_000);
        break;
      case "ShipmentCreated":
        this.state = "done";
        await this.repo.saveSaga(this);
        break;
      case "StockReserveFailed":
      case "PaymentFailed":
      case "ShipmentFailed":
        await this.compensate(e);
        break;
    }
  }

  async compensate(e: Event) {
    // Orchestrated compensations; each should be idempotent
    if (this.state === "shipping") await this.bus.send("CancelShipment", { orderId: this.id });
    if (this.state !== "reserving") await this.bus.send("RefundPayment", { orderId: this.id });
    await this.bus.send("ReleaseStock", { orderId: this.id });
    this.state = "failed";
    await this.repo.saveSaga(this);
  }

  armTimeout(kind: string, ms: number) {
    // persist a wakeup; on expiry, publish a *_TimedOut event the saga listens for
  }
}

Notes

  • Persist saga state and resume on restart.
  • Commands and compensations must be idempotent.

7) Choreography example (topics)

orders → OrderCreated{order_id}
inventory listens:
  on OrderCreated → try ReserveStock → emit StockReserved/StockReserveFailed
payments listens:
  on StockReserved → try ChargePayment → emit PaymentCaptured/PaymentFailed
shipping listens:
  on PaymentCaptured → CreateShipment → emit ShipmentCreated/ShipmentFailed
orders listens:
  on ShipmentCreated → mark Order Completed
  on PaymentFailed/StockReserveFailed → mark Order Failed
  • Add timeouts via a scheduler that emits StockReserveTimedOut if no result in time.
  • Keep topics per domain and keys per aggregate (e.g., order_id) for per‑key ordering.

8) Read models & UX

  • Users hate “pending forever.” Show immediate acceptance and a pending state while the saga runs.
  • Build query/read models (CQRS) that join data via events into denormalized tables for fast UIs.
  • For admin ops, expose a saga view: state, last event, retries, correlation ID.

9) Observability

  • Tracing: propagate trace_id, correlation_id, and include causation_id on every message. Spans per step with events: sent, received, retry, timeout, compensation.
  • Metrics: saga_started_total, saga_completed_total, saga_failed_total, per‑step latency histograms, retries, DLQ size.
  • Logging: structured logs with saga, order_id, event, attempt, state.

10) Pitfalls & fast fixes

| Pitfall | Why it hurts | Fix | |---|---|---| | Using events to tell others what to do | Tight coupling creeps back | Use commands for requests; events for facts | | No outbox/inbox | Drops/dupes cause chaos | Add outbox (emit) + inbox (dedupe) | | Global ordering assumptions | Doesn’t exist | Order is per key; include sequence/version | | No compensations | Irreversible side‑effects | Design undo/mitigate steps upfront | | Orchestrators that keep all state in memory | Lost progress on crash | Persist saga state; resume on restart | | Chatty events with breaking schema | Subscriber breakage | Version events; use additive changes | | Infinite retries on permanent errors | System thrash | Backoff + jitter, DLQ + alerting | | Hard deletes of events | Lost audit/replay | Retain events (compact only when safe) |


Quick checklist

  • [ ] Distinguish commands (requests) vs events (facts).
  • [ ] Pick orchestration or choreography per workflow; document steps, timeouts, compensations.
  • [ ] Implement outbox/inbox, idempotency, correlation/causation IDs.
  • [ ] Key streams by aggregate ID; include version/sequence.
  • [ ] Add DLQ, backoff, and saga dashboards.
  • [ ] Version events and publish schemas/contracts.

One‑minute adoption plan

  1. Map one real workflow (e.g., order‑to‑cash) into steps, success criteria, compensations, and timeouts.
  2. Implement a minimal orchestrator or choreography with outbox/inbox and idempotent handlers.
  3. Add correlation IDs end‑to‑end and basic metrics/tracing.
  4. Provide a pending UX and a saga status endpoint.
  5. Iterate: measure failures/timeouts, tune retries, and refine compensations.