Timeouts, Retries, and Backoff — the practical playbook

September 11th, 20255 min read#dev #reliability #resilience #networking #http #grpc #ca-duh

Design request deadlines, hop timeouts, and idempotent retries with jitter so you protect upstreams, tame tail latency, and avoid cascading failures.

TL;DR

Give every request a total deadline, and each hop a shorter timeout than its caller.
Retry only transient failures (timeouts, connection resets, 502/503/504/429). Don’t retry non‑idempotent work unless you implement Idempotency‑Key.
Use exponential backoff + jitter, cap by time budget (not huge counts), and stop after 2–3 attempts.
Propagate a trace ID, log attempt, backoff_ms, and cause. Send Retry-After on 429/503.
Servers must be stricter than clients: shorter timeouts, bounded concurrency, and circuit breakers. Hedged requests only for idempotent reads.

1) The model: deadline → budgets per hop

Client total deadline: 2500ms
  ├─ Edge/Gateway:    per‑try timeout 2000ms
  ├─ Service A:       timeout 1500ms
  └─ Database:        statement_timeout 800ms

If a hop can’t finish within its budget, fail fast → caller still has time to retry or degrade.

2) Retry decision table (what, when, how)

| Condition | Safe to retry? | Notes | |---|---|---| | Network error / connection reset / TLS handshake error | Yes | Immediate retry with jitter | | Timeout (no response) | Yes | Caller didn’t observe success; safe for idempotent ops | | 502 Bad Gateway / 503 Unavailable / 504 Gateway Timeout | Yes | Transient upstream trouble | | 429 Too Many Requests | Yes | Respect Retry-After (seconds or HTTP date) | | 408 Request Timeout | Yes | Treat like a network timeout | | 500 Internal Server Error | Maybe | Only if known transient; otherwise bubble up | | 4xx (400/401/403/404/422) | No | Client/action issue—don’t retry | | POST without idempotency key | No | Could double‑charge / double‑create | | POST with idempotency key | Yes | Server must dedupe by key | | GET/PUT/DELETE | Usually | Idempotent by spec; still cap attempts |

3) Backoff with jitter (don’t stampede)

Full‑jitter exponential

base = 100ms, factor = 2, cap = 2s
sleep = random(0, min(cap, base * 2^attempt))

Decorrelated jitter (avoids lock‑step once warmed)

sleep = min(cap, random(base, prev_sleep * 3))

Default schedule: try at t=0, then ~200–400ms, optionally ~1s. Stop when the deadline is near.

4) Idempotency keys (server pattern)

Accept a header like Idempotency-Key: <uuid> for mutating operations.

// Pseudo‑Node: dedupe POST /payments
const key = req.header("Idempotency-Key");
const scope = `${req.tenant}:${req.path}`;
const id = `${scope}:${key}`;

// 1) Atomically reserve the key (unique constraint):
// idempotency(id PRIMARY KEY, status, response, created_at)
try { insert(id, "processing", null); }
catch (e) { 
  // Key exist: return stored response (200 or error). 
  return send(row.response);
}

// 2) Process once
const result = await chargeCard(req.body);

// 3) Store final response and return it
update(id, "done", serialize(result), now());
return send(result);

TTL the record (e.g., 24–72h). Keep scope narrow (tenant + route) to avoid collisions.

5) Language & platform recipes

A) HTTP client (TypeScript + fetch)

type TransientCause = "network" | "timeout" | "502" | "503" | "504" | "429";
function isTransient(e: any): { ok: boolean, cause?: TransientCause } {
  if (e.name === "AbortError") return { ok: true, cause: "timeout" };
  if (!e.response) return { ok: true, cause: "network" };
  const s = e.response?.status;
  if ([502,503,504,429].includes(s)) return { ok: true, cause: String(s) as TransientCause };
  return { ok: false };
}

export async function getWithRetry(url: string, deadlineMs = 2500) {
  const started = Date.now();
  let sleep = 200;
  for (let attempt = 0; ; attempt++) {
    const remaining = deadlineMs - (Date.now() - started);
    if (remaining <= 0) throw new Error("deadline_exceeded");
    const ctrl = new AbortController();
    const t = setTimeout(() => ctrl.abort(), Math.min(remaining, 1500));
    try {
      const res = await fetch(url, { signal: ctrl.signal });
      if (res.status === 429) {
        const ra = parseInt(res.headers.get("retry-after") || "0", 10) * 1000;
        await new Promise(r => setTimeout(r, Math.min(ra || 300, remaining)));
        continue;
      }
      if (!res.ok && [502,503,504].includes(res.status)) throw { response: res };
      return res;
    } catch (e: any) {
      const { ok } = isTransient(e);
      if (!ok || attempt >= 2) throw e;
      const wait = Math.min(2000, Math.random() * sleep, remaining - 50);
      await new Promise(r => setTimeout(r, Math.max(0, wait)));
      sleep = Math.min(2000, sleep * 2);
    } finally { clearTimeout(t); }
  }
}

B) Go (context deadlines)

ctx, cancel := context.WithTimeout(context.Background(), 2500*time.Millisecond)
defer cancel()
req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
client := &http.Client{ Timeout: 1500 * time.Millisecond }
resp, err := client.Do(req)

C) gRPC (client retry policy)

{
  "methodConfig": [{
    "name": [{"service": "payments.Service"}],
    "retryPolicy": {
      "maxAttempts": 3,
      "initialBackoff": "0.2s",
      "maxBackoff": "1s",
      "backoffMultiplier": 2.0,
      "retryableStatusCodes": ["UNAVAILABLE", "RESOURCE_EXHAUSTED"]
    },
    "timeout": "1.5s"
  }]
}

D) Envoy/Proxy (per‑try timeout + retries)

route:
  retry_policy:
    retry_on: "5xx,reset,connect-failure,refused-stream,unavailable"
    num_retries: 2
    per_try_timeout: 1.5s

6) Server‑side resilience (be a good neighbor)

Shorter timeouts than clients; set DB statement_timeout below service timeout.
Bound concurrency (semaphores/queues); shed load early with 429/503 + Retry-After.
Circuit breaker on hot upstreams; bulkheads per dependency.
Avoid automatic retries inside the service for non‑idempotent ops.
Emit idempotency result cache for POSTs with keys; dedupe by unique index to handle races.

7) Observability (make retries visible)

Log fields (JSON): trace_id, attempt, backoff_ms, deadline_ms, per_try_timeout_ms, status, cause, idempotency_key.
Metrics:

requests_total{status}
retried_requests_total{cause}
retry_attempts_total
deadline_exceeded_total
throttled_total (429)
circuit_open gauge
p95/p99 latency by route and upstream.
Tracing: add a span event per retry with cause and sleep_ms.

8) Pitfalls & fast fixes

| Pitfall | Why it hurts | Fix | |---|---|---| | No timeouts | Threads/sockets pile up | Deadlines + per‑hop timeouts | | Retrying non‑idempotent POSTs | Double‑charges, dupes | Require Idempotency‑Key; dedupe | | synchronized retries | Thundering herd | Jitter backoff; small attempt count | | Retrying 4xx blindly | Wasted work | Restrict to transients | | Many small timeouts but huge total | Tail latency balloons | Cap by deadline | | Hidden retries in layers | Unpredictable load | Centralize policy; instrument attempts |

Quick checklist

[ ] Total deadline per request; hop timeouts strictly less.
[ ] Retry only transient errors; 2–3 attempts max.
[ ] Exponential + jitter backoff; respect Retry‑After.
[ ] Implement Idempotency‑Key for POST.
[ ] Circuit breakers + bounded concurrency.
[ ] Log attempt/backoff and expose metrics/traces.

One‑minute adoption plan

Choose a deadline (e.g., 2.5s) and set hop timeouts: edge 2.0s, service 1.5s, DB 0.8s.
Ship a retry helper with jitter and attempt cap ≤3.
Add Idempotency‑Key handling for mutating routes.
Configure proxy/gRPC retry policies; set per‑try timeouts.
Instrument retries and watch p95/p99 + retried_requests_total to tune.