TL;DR
- Circuit breaker: stop calling a failing dependency for a short time. States: closed → open → half‑open. Trip when error rate or timeouts spike; probe a few requests before closing.
- Bulkhead: limit concurrency per dependency so one slow/pinned service doesn’t drown your threads. Use semaphores/queues with back‑pressure and timeouts.
- Pair with timeouts + retries (with jitter) and degraded fallbacks (cache, stale data, partial page/API). Instrument attempts, open state, throttles.
1) Why these exist (the failure cascade)
When an upstream slows or fails: requests pile up → threads/CPU get stuck → queue growth → everything slows → your whole app tips over.
Circuit breakers make failing calls cheap (fail fast). Bulkheads cap how many calls can be in flight so other features keep working.
2) Circuit breaker basics
States
- Closed: normal; count successes/failures in a sliding window.
- Open: fail fast without calling the dependency for
openDuration(cool‑down). - Half‑open: allow a small number of probes; if they succeed → close; if not → open again.
Trip criteria (pick one or combine)
- Failure rate ≥ X% over last N calls (e.g., ≥50% over 50 calls).
- Slow call rate ≥ Y% (timeouts > threshold).
- Consecutive failures ≥ K (quick reaction for rare paths).
Reset: after openDuration, permit M trial calls (half‑open).
3) Bulkheads (bounded concurrency)
- Per dependency, set a max concurrent calls (e.g., 50). Extra calls queue (bounded) or fail fast with
429/503. - Use separate pools per hot dependency (DB, cache, payments API).
- For CPU pools, use work queues sized by cores/latency. For I/O, semaphore by upstream limits.
4) Recipes (drop‑in snippets)
A) TypeScript (simple bulkhead + breaker)
// npm i p-limit
import pLimit from "p-limit";
type State = "closed" | "open" | "half";
export function breaker({ failureRate = 0.5, window = 50, openMs = 10_000, halfProbes = 5 } = {}) {
let state: State = "closed";
let openedAt = 0;
let outcomes: boolean[] = []; // true=ok, false=fail/timeout
let probesLeft = 0;
function record(ok: boolean) {
outcomes.push(ok); if (outcomes.length > window) outcomes.shift();
const fails = outcomes.filter(x => !x).length;
const rate = outcomes.length ? fails / outcomes.length : 0;
if (state === "closed" && outcomes.length >= Math.min(window, 10) && rate >= failureRate) {
state = "open"; openedAt = Date.now(); probesLeft = halfProbes;
} else if (state === "half") {
probesLeft -= 1;
if (!ok) { state = "open"; openedAt = Date.now(); probesLeft = halfProbes; }
else if (probesLeft <= 0) { state = "closed"; outcomes = []; }
}
}
async function exec<T>(fn: () => Promise<T>): Promise<T> {
// State gate
if (state === "open") {
if (Date.now() - openedAt < openMs) throw new Error("circuit_open");
state = "half"; // try probing
}
try {
const res = await fn();
record(true);
return res;
} catch (e) {
record(false);
throw e;
}
}
return { exec, get state() { return state; } };
}
// Bulkhead: limit concurrency
export function bulkhead(limitN: number) {
const limit = pLimit(limitN);
return <T>(fn: () => Promise<T>) => limit(fn);
}
// Usage
const bhFetch = bulkhead(50);
const brk = breaker({ failureRate: 0.5, window: 100, openMs: 15_000, halfProbes: 3 });
async function getUser(id: string) {
return brk.exec(() =>
bhFetch(() => fetch(`https://api.example.com/users/${id}`, { signal: AbortSignal.timeout(1200) }))
.then(r => { if (!r.ok) throw new Error(`status_${r.status}`); return r.json(); })
);
}
B) Go (semaphore bulkhead + gobreaker)
import (
"context"
"errors"
"net/http"
"time"
gb "github.com/sony/gobreaker"
)
// Bulkhead via buffered channel
type Bulkhead struct{ sem chan struct{} }
func NewBulkhead(n int) *Bulkhead { return &Bulkhead{sem: make(chan struct{}, n)} }
func (b *Bulkhead) Do(ctx context.Context, f func() error) error {
select { case b.sem <- struct{}{}: defer func(){<-b.sem}()
return f()
case <-ctx.Done(): return ctx.Err() }
}
var st = gb.Settings{
Name: "users-api",
Interval: 30 * time.Second, // rolling window reset
Timeout: 10 * time.Second, // open state duration
ReadyToTrip: func(counts gb.Counts) bool {
// trip on >=50% failures once we have at least 20 calls
total := counts.Requests
if total < 20 { return false }
failRate := float64(counts.TotalFailures) / float64(total)
return failRate >= 0.5
},
}
var cb = gb.NewCircuitBreaker(st)
var bh = NewBulkhead(50)
func getUser(ctx context.Context, id string) (*http.Response, error) {
ctx, cancel := context.WithTimeout(ctx, 1500*time.Millisecond)
defer cancel()
var resp *http.Response
err := bh.Do(ctx, func() error {
_, err := cb.Execute(func() (any, error) {
req, _ := http.NewRequestWithContext(ctx, "GET", "https://api/users/"+id, nil)
r, e := http.DefaultClient.Do(req)
if e != nil || r.StatusCode >= 500 { if e == nil { e = errors.New("upstream_5xx") } }
resp = r; return nil, e
})
return err
})
return resp, err
}
C) Java/Resilience4j
CircuitBreakerConfig cbConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // trip at >=50% failures
.slowCallDurationThreshold(Duration.ofMillis(1000))
.slowCallRateThreshold(50) // timeouts/slow calls count too
.waitDurationInOpenState(Duration.ofSeconds(15))
.permittedNumberOfCallsInHalfOpenState(3)
.slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
.slidingWindowSize(100)
.minimumNumberOfCalls(20)
.build();
BulkheadConfig bhConfig = BulkheadConfig.custom()
.maxConcurrentCalls(50)
.maxWaitDuration(Duration.ofMillis(0)) // fail fast
.build();
var cb = CircuitBreaker.of("users", cbConfig);
var bh = Bulkhead.of("users", bhConfig);
Supplier<String> call = () -> httpGet("/users/123");
String body = Decorators.ofSupplier(call)
.withBulkhead(bh)
.withCircuitBreaker(cb)
.withTimeLimiter(TimeLimiter.of(Duration.ofMillis(1200)))
.withFallback(ex -> cacheGetOrDefault())
.get();
D) .NET (Polly, policy wrap)
var timeout = Policy.TimeoutAsync<HttpResponseMessage>(1.2);
var breaker = Policy<HttpResponseMessage>.Handle<Exception>()
.OrResult(r => (int)r.StatusCode >= 500)
.CircuitBreakerAsync(handledEventsAllowedBeforeBreaking: 20, durationOfBreak: TimeSpan.FromSeconds(15));
var bulkhead = Policy.BulkheadAsync<HttpResponseMessage>(maxParallelization: 50, maxQueuingActions: 0);
var policy = Policy.WrapAsync(timeout, breaker, bulkhead);
var res = await policy.ExecuteAsync(ct => httpClient.GetAsync(url, ct), CancellationToken.None);
E) Envoy/Proxy (edge protection)
# Circuit-ish caps (connection/request) + outlier detection (eject bad hosts)
circuit_breakers:
thresholds:
- priority: DEFAULT
max_connections: 2000
max_pending_requests: 1000
max_requests: 3000
outlier_detection:
consecutive_5xx: 5
interval: 10s
base_ejection_time: 30s
max_ejection_percent: 50
retry_policy:
retry_on: "5xx,reset,connect-failure,refused-stream,unavailable"
num_retries: 2
per_try_timeout: 1.2s
5) Tuning numbers (start here)
- Timeouts: downstream timeout < caller timeout; e.g., edge 2.0s, service 1.5s.
- Breaker:
failureRate=50%,window=100,minCalls=20,open=15s,halfOpenProbes=3. - Bulkhead: set
maxConcurrent = RPS × p95_latency × safety. Example: 100 RPS × 0.1s × 2 ≈ 20. - Prefer fail‑fast (no queue) for user‑facing paths; small bounded queues can be OK for background jobs.
6) Fallbacks & degradation
- Cache/serve stale (
stale-if-error), return partial data, or display a friendly “try later”. - For writes, prefer queue + outbox to avoid user‑visible errors; notify asynchronously.
- Always tag responses with
degraded: truein logs/headers/metrics so you can see impact.
7) Observability (make it obvious)
Metrics (per dependency):
circuit_state{closed|open|half}failure_rate,slow_rate,tripped_totalbulkhead_inflight,bulkhead_rejected_totalrequest_latency_ms,timeout_total,retry_attempts_total
Logs: include dependency, attempt, circuit_state, open_until.
Tracing: span events on trip, probe allow/deny, and bulkhead reject.
8) Pitfalls & fast fixes
| Pitfall | Why it hurts | Fix | |---|---|---| | Breaker trips on tiny traffic | Noisy % math | Require min calls in window (e.g., 20) | | No half‑open probes | Stays open too long | Allow limited probes before closing | | Big unbounded queues | Latency explosion | Use bounded queues or fail fast | | Breaker without timeouts | Slow calls count as success | Count slow/timeout as failures | | One global pool | Contagion across features | Per‑dependency bulkheads | | Hidden retries everywhere | Load amplification | Centralize retry policy, cap attempts |
Quick checklist
- [ ] Timeouts set per hop; shorter downstream.
- [ ] Circuit breakers per dependency with min calls, failure/slow thresholds, half‑open probes.
- [ ] Bulkheads: max concurrent + bounded queue (or fail fast).
- [ ] Safe fallbacks for degraded mode; log
degraded=true. - [ ] Metrics/tracing for state, rejects, trips.
- [ ] Test in staging with chaos (latency/5xx) to validate tuning.
One‑minute adoption plan
- Add timeouts and a shared retry helper (2–3 attempts, jitter).
- Wrap each hot dependency with a breaker + bulkhead (start with suggested numbers).
- Add a fallback path (stale cache/partial) and tag responses as degraded.
- Ship metrics & alerts for open circuits and bulkhead rejects.
- Run a staging game day: inject 5xx/1s latency; tune thresholds until UX is stable.