Circuit Breakers & Bulkheads, Explained — fail fast, contain blast radius

September 23rd, 20256 min read#dev #reliability #resilience #operations #http #grpc #ca-duh

Real-world settings and copy‑paste snippets for circuit breakers (closed/open/half‑open) and bulkheads (bounded concurrency/queues). Works with HTTP, gRPC, and message calls.

TL;DR

Circuit breaker: stop calling a failing dependency for a short time. States: closed → open → half‑open. Trip when error rate or timeouts spike; probe a few requests before closing.
Bulkhead: limit concurrency per dependency so one slow/pinned service doesn’t drown your threads. Use semaphores/queues with back‑pressure and timeouts.
Pair with timeouts + retries (with jitter) and degraded fallbacks (cache, stale data, partial page/API). Instrument attempts, open state, throttles.

1) Why these exist (the failure cascade)

When an upstream slows or fails: requests pile up → threads/CPU get stuck → queue growth → everything slows → your whole app tips over.
Circuit breakers make failing calls cheap (fail fast). Bulkheads cap how many calls can be in flight so other features keep working.

2) Circuit breaker basics

States

Closed: normal; count successes/failures in a sliding window.
Open: fail fast without calling the dependency for openDuration (cool‑down).
Half‑open: allow a small number of probes; if they succeed → close; if not → open again.

Trip criteria (pick one or combine)

Failure rate ≥ X% over last N calls (e.g., ≥50% over 50 calls).
Slow call rate ≥ Y% (timeouts > threshold).
Consecutive failures ≥ K (quick reaction for rare paths).

Reset: after openDuration, permit M trial calls (half‑open).

3) Bulkheads (bounded concurrency)

Per dependency, set a max concurrent calls (e.g., 50). Extra calls queue (bounded) or fail fast with 429/503.
Use separate pools per hot dependency (DB, cache, payments API).
For CPU pools, use work queues sized by cores/latency. For I/O, semaphore by upstream limits.

4) Recipes (drop‑in snippets)

A) TypeScript (simple bulkhead + breaker)

// npm i p-limit
import pLimit from "p-limit";

type State = "closed" | "open" | "half";
export function breaker({ failureRate = 0.5, window = 50, openMs = 10_000, halfProbes = 5 } = {}) {
  let state: State = "closed";
  let openedAt = 0;
  let outcomes: boolean[] = []; // true=ok, false=fail/timeout
  let probesLeft = 0;

  function record(ok: boolean) {
    outcomes.push(ok); if (outcomes.length > window) outcomes.shift();
    const fails = outcomes.filter(x => !x).length;
    const rate = outcomes.length ? fails / outcomes.length : 0;
    if (state === "closed" && outcomes.length >= Math.min(window, 10) && rate >= failureRate) {
      state = "open"; openedAt = Date.now(); probesLeft = halfProbes;
    } else if (state === "half") {
      probesLeft -= 1;
      if (!ok) { state = "open"; openedAt = Date.now(); probesLeft = halfProbes; }
      else if (probesLeft <= 0) { state = "closed"; outcomes = []; }
    }
  }

  async function exec<T>(fn: () => Promise<T>): Promise<T> {
    // State gate
    if (state === "open") {
      if (Date.now() - openedAt < openMs) throw new Error("circuit_open");
      state = "half"; // try probing
    }
    try {
      const res = await fn();
      record(true);
      return res;
    } catch (e) {
      record(false);
      throw e;
    }
  }
  return { exec, get state() { return state; } };
}

// Bulkhead: limit concurrency
export function bulkhead(limitN: number) {
  const limit = pLimit(limitN);
  return <T>(fn: () => Promise<T>) => limit(fn);
}

// Usage
const bhFetch = bulkhead(50);
const brk = breaker({ failureRate: 0.5, window: 100, openMs: 15_000, halfProbes: 3 });

async function getUser(id: string) {
  return brk.exec(() =>
    bhFetch(() => fetch(`https://api.example.com/users/${id}`, { signal: AbortSignal.timeout(1200) }))
      .then(r => { if (!r.ok) throw new Error(`status_${r.status}`); return r.json(); })
  );
}

B) Go (semaphore bulkhead + `gobreaker`)

import (
  "context"
  "errors"
  "net/http"
  "time"
  gb "github.com/sony/gobreaker"
)

// Bulkhead via buffered channel
type Bulkhead struct{ sem chan struct{} }
func NewBulkhead(n int) *Bulkhead { return &Bulkhead{sem: make(chan struct{}, n)} }
func (b *Bulkhead) Do(ctx context.Context, f func() error) error {
  select { case b.sem <- struct{}{}: defer func(){<-b.sem}()
    return f()
  case <-ctx.Done(): return ctx.Err() }
}

var st = gb.Settings{
  Name:        "users-api",
  Interval:    30 * time.Second,      // rolling window reset
  Timeout:     10 * time.Second,      // open state duration
  ReadyToTrip: func(counts gb.Counts) bool {
    // trip on >=50% failures once we have at least 20 calls
    total := counts.Requests
    if total < 20 { return false }
    failRate := float64(counts.TotalFailures) / float64(total)
    return failRate >= 0.5
  },
}
var cb = gb.NewCircuitBreaker(st)
var bh = NewBulkhead(50)

func getUser(ctx context.Context, id string) (*http.Response, error) {
  ctx, cancel := context.WithTimeout(ctx, 1500*time.Millisecond)
  defer cancel()
  var resp *http.Response
  err := bh.Do(ctx, func() error {
    _, err := cb.Execute(func() (any, error) {
      req, _ := http.NewRequestWithContext(ctx, "GET", "https://api/users/"+id, nil)
      r, e := http.DefaultClient.Do(req)
      if e != nil || r.StatusCode >= 500 { if e == nil { e = errors.New("upstream_5xx") } }
      resp = r; return nil, e
    })
    return err
  })
  return resp, err
}

C) Java/Resilience4j

CircuitBreakerConfig cbConfig = CircuitBreakerConfig.custom()
  .failureRateThreshold(50)                       // trip at >=50% failures
  .slowCallDurationThreshold(Duration.ofMillis(1000))
  .slowCallRateThreshold(50)                      // timeouts/slow calls count too
  .waitDurationInOpenState(Duration.ofSeconds(15))
  .permittedNumberOfCallsInHalfOpenState(3)
  .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
  .slidingWindowSize(100)
  .minimumNumberOfCalls(20)
  .build();

BulkheadConfig bhConfig = BulkheadConfig.custom()
  .maxConcurrentCalls(50)
  .maxWaitDuration(Duration.ofMillis(0))         // fail fast
  .build();

var cb = CircuitBreaker.of("users", cbConfig);
var bh = Bulkhead.of("users", bhConfig);

Supplier<String> call = () -> httpGet("/users/123");
String body = Decorators.ofSupplier(call)
  .withBulkhead(bh)
  .withCircuitBreaker(cb)
  .withTimeLimiter(TimeLimiter.of(Duration.ofMillis(1200)))
  .withFallback(ex -> cacheGetOrDefault())
  .get();

D) .NET (Polly, policy wrap)

var timeout = Policy.TimeoutAsync<HttpResponseMessage>(1.2);
var breaker = Policy<HttpResponseMessage>.Handle<Exception>()
  .OrResult(r => (int)r.StatusCode >= 500)
  .CircuitBreakerAsync(handledEventsAllowedBeforeBreaking: 20, durationOfBreak: TimeSpan.FromSeconds(15));
var bulkhead = Policy.BulkheadAsync<HttpResponseMessage>(maxParallelization: 50, maxQueuingActions: 0);

var policy = Policy.WrapAsync(timeout, breaker, bulkhead);
var res = await policy.ExecuteAsync(ct => httpClient.GetAsync(url, ct), CancellationToken.None);

E) Envoy/Proxy (edge protection)

# Circuit-ish caps (connection/request) + outlier detection (eject bad hosts)
circuit_breakers:
  thresholds:
    - priority: DEFAULT
      max_connections: 2000
      max_pending_requests: 1000
      max_requests: 3000
outlier_detection:
  consecutive_5xx: 5
  interval: 10s
  base_ejection_time: 30s
  max_ejection_percent: 50
retry_policy:
  retry_on: "5xx,reset,connect-failure,refused-stream,unavailable"
  num_retries: 2
  per_try_timeout: 1.2s

5) Tuning numbers (start here)

Timeouts: downstream timeout < caller timeout; e.g., edge 2.0s, service 1.5s.
Breaker: failureRate=50%, window=100, minCalls=20, open=15s, halfOpenProbes=3.
Bulkhead: set maxConcurrent = RPS × p95_latency × safety. Example: 100 RPS × 0.1s × 2 ≈ 20.
Prefer fail‑fast (no queue) for user‑facing paths; small bounded queues can be OK for background jobs.

6) Fallbacks & degradation

Cache/serve stale (stale-if-error), return partial data, or display a friendly “try later”.
For writes, prefer queue + outbox to avoid user‑visible errors; notify asynchronously.
Always tag responses with degraded: true in logs/headers/metrics so you can see impact.

7) Observability (make it obvious)

Metrics (per dependency):

circuit_state{closed|open|half}
failure_rate, slow_rate, tripped_total
bulkhead_inflight, bulkhead_rejected_total
request_latency_ms, timeout_total, retry_attempts_total

Logs: include dependency, attempt, circuit_state, open_until.
Tracing: span events on trip, probe allow/deny, and bulkhead reject.

8) Pitfalls & fast fixes

| Pitfall | Why it hurts | Fix | |---|---|---| | Breaker trips on tiny traffic | Noisy % math | Require min calls in window (e.g., 20) | | No half‑open probes | Stays open too long | Allow limited probes before closing | | Big unbounded queues | Latency explosion | Use bounded queues or fail fast | | Breaker without timeouts | Slow calls count as success | Count slow/timeout as failures | | One global pool | Contagion across features | Per‑dependency bulkheads | | Hidden retries everywhere | Load amplification | Centralize retry policy, cap attempts |

Quick checklist

[ ] Timeouts set per hop; shorter downstream.
[ ] Circuit breakers per dependency with min calls, failure/slow thresholds, half‑open probes.
[ ] Bulkheads: max concurrent + bounded queue (or fail fast).
[ ] Safe fallbacks for degraded mode; log degraded=true.
[ ] Metrics/tracing for state, rejects, trips.
[ ] Test in staging with chaos (latency/5xx) to validate tuning.

One‑minute adoption plan

Add timeouts and a shared retry helper (2–3 attempts, jitter).
Wrap each hot dependency with a breaker + bulkhead (start with suggested numbers).
Add a fallback path (stale cache/partial) and tag responses as degraded.
Ship metrics & alerts for open circuits and bulkhead rejects.
Run a staging game day: inject 5xx/1s latency; tune thresholds until UX is stable.