TL;DR
- Give every request a total deadline, and each hop a shorter timeout than its caller.
- Retry only transient failures (timeouts, connection resets, 502/503/504/429). Don’t retry non‑idempotent work unless you implement Idempotency‑Key.
- Use exponential backoff + jitter, cap by time budget (not huge counts), and stop after 2–3 attempts.
- Propagate a trace ID, log
attempt,backoff_ms, andcause. SendRetry-Afteron 429/503. - Servers must be stricter than clients: shorter timeouts, bounded concurrency, and circuit breakers. Hedged requests only for idempotent reads.
1) The model: deadline → budgets per hop
Client total deadline: 2500ms
├─ Edge/Gateway: per‑try timeout 2000ms
├─ Service A: timeout 1500ms
└─ Database: statement_timeout 800ms
If a hop can’t finish within its budget, fail fast → caller still has time to retry or degrade.
2) Retry decision table (what, when, how)
| Condition | Safe to retry? | Notes |
|---|---|---|
| Network error / connection reset / TLS handshake error | Yes | Immediate retry with jitter |
| Timeout (no response) | Yes | Caller didn’t observe success; safe for idempotent ops |
| 502 Bad Gateway / 503 Unavailable / 504 Gateway Timeout | Yes | Transient upstream trouble |
| 429 Too Many Requests | Yes | Respect Retry-After (seconds or HTTP date) |
| 408 Request Timeout | Yes | Treat like a network timeout |
| 500 Internal Server Error | Maybe | Only if known transient; otherwise bubble up |
| 4xx (400/401/403/404/422) | No | Client/action issue—don’t retry |
| POST without idempotency key | No | Could double‑charge / double‑create |
| POST with idempotency key | Yes | Server must dedupe by key |
| GET/PUT/DELETE | Usually | Idempotent by spec; still cap attempts |
3) Backoff with jitter (don’t stampede)
Full‑jitter exponential
base = 100ms, factor = 2, cap = 2s
sleep = random(0, min(cap, base * 2^attempt))
Decorrelated jitter (avoids lock‑step once warmed)
sleep = min(cap, random(base, prev_sleep * 3))
Default schedule: try at t=0, then ~200–400ms, optionally ~1s. Stop when the deadline is near.
4) Idempotency keys (server pattern)
Accept a header like Idempotency-Key: <uuid> for mutating operations.
// Pseudo‑Node: dedupe POST /payments
const key = req.header("Idempotency-Key");
const scope = `${req.tenant}:${req.path}`;
const id = `${scope}:${key}`;
// 1) Atomically reserve the key (unique constraint):
// idempotency(id PRIMARY KEY, status, response, created_at)
try { insert(id, "processing", null); }
catch (e) {
// Key exist: return stored response (200 or error).
return send(row.response);
}
// 2) Process once
const result = await chargeCard(req.body);
// 3) Store final response and return it
update(id, "done", serialize(result), now());
return send(result);
TTL the record (e.g., 24–72h). Keep scope narrow (tenant + route) to avoid collisions.
5) Language & platform recipes
A) HTTP client (TypeScript + fetch)
type TransientCause = "network" | "timeout" | "502" | "503" | "504" | "429";
function isTransient(e: any): { ok: boolean, cause?: TransientCause } {
if (e.name === "AbortError") return { ok: true, cause: "timeout" };
if (!e.response) return { ok: true, cause: "network" };
const s = e.response?.status;
if ([502,503,504,429].includes(s)) return { ok: true, cause: String(s) as TransientCause };
return { ok: false };
}
export async function getWithRetry(url: string, deadlineMs = 2500) {
const started = Date.now();
let sleep = 200;
for (let attempt = 0; ; attempt++) {
const remaining = deadlineMs - (Date.now() - started);
if (remaining <= 0) throw new Error("deadline_exceeded");
const ctrl = new AbortController();
const t = setTimeout(() => ctrl.abort(), Math.min(remaining, 1500));
try {
const res = await fetch(url, { signal: ctrl.signal });
if (res.status === 429) {
const ra = parseInt(res.headers.get("retry-after") || "0", 10) * 1000;
await new Promise(r => setTimeout(r, Math.min(ra || 300, remaining)));
continue;
}
if (!res.ok && [502,503,504].includes(res.status)) throw { response: res };
return res;
} catch (e: any) {
const { ok } = isTransient(e);
if (!ok || attempt >= 2) throw e;
const wait = Math.min(2000, Math.random() * sleep, remaining - 50);
await new Promise(r => setTimeout(r, Math.max(0, wait)));
sleep = Math.min(2000, sleep * 2);
} finally { clearTimeout(t); }
}
}
B) Go (context deadlines)
ctx, cancel := context.WithTimeout(context.Background(), 2500*time.Millisecond)
defer cancel()
req, _ := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
client := &http.Client{ Timeout: 1500 * time.Millisecond }
resp, err := client.Do(req)
C) gRPC (client retry policy)
{
"methodConfig": [{
"name": [{"service": "payments.Service"}],
"retryPolicy": {
"maxAttempts": 3,
"initialBackoff": "0.2s",
"maxBackoff": "1s",
"backoffMultiplier": 2.0,
"retryableStatusCodes": ["UNAVAILABLE", "RESOURCE_EXHAUSTED"]
},
"timeout": "1.5s"
}]
}
D) Envoy/Proxy (per‑try timeout + retries)
route:
retry_policy:
retry_on: "5xx,reset,connect-failure,refused-stream,unavailable"
num_retries: 2
per_try_timeout: 1.5s
6) Server‑side resilience (be a good neighbor)
- Shorter timeouts than clients; set DB
statement_timeoutbelow service timeout. - Bound concurrency (semaphores/queues); shed load early with 429/503 +
Retry-After. - Circuit breaker on hot upstreams; bulkheads per dependency.
- Avoid automatic retries inside the service for non‑idempotent ops.
- Emit idempotency result cache for POSTs with keys; dedupe by unique index to handle races.
7) Observability (make retries visible)
Log fields (JSON): trace_id, attempt, backoff_ms, deadline_ms, per_try_timeout_ms, status, cause, idempotency_key.
Metrics:
requests_total{status}retried_requests_total{cause}retry_attempts_totaldeadline_exceeded_totalthrottled_total(429)circuit_opengauge- p95/p99 latency by route and upstream.
Tracing: add a span event per retry withcauseandsleep_ms.
8) Pitfalls & fast fixes
| Pitfall | Why it hurts | Fix | |---|---|---| | No timeouts | Threads/sockets pile up | Deadlines + per‑hop timeouts | | Retrying non‑idempotent POSTs | Double‑charges, dupes | Require Idempotency‑Key; dedupe | | synchronized retries | Thundering herd | Jitter backoff; small attempt count | | Retrying 4xx blindly | Wasted work | Restrict to transients | | Many small timeouts but huge total | Tail latency balloons | Cap by deadline | | Hidden retries in layers | Unpredictable load | Centralize policy; instrument attempts |
Quick checklist
- [ ] Total deadline per request; hop timeouts strictly less.
- [ ] Retry only transient errors; 2–3 attempts max.
- [ ] Exponential + jitter backoff; respect Retry‑After.
- [ ] Implement Idempotency‑Key for POST.
- [ ] Circuit breakers + bounded concurrency.
- [ ] Log
attempt/backoffand expose metrics/traces.
One‑minute adoption plan
- Choose a deadline (e.g., 2.5s) and set hop timeouts: edge 2.0s, service 1.5s, DB 0.8s.
- Ship a retry helper with jitter and attempt cap ≤3.
- Add Idempotency‑Key handling for mutating routes.
- Configure proxy/gRPC retry policies; set per‑try timeouts.
- Instrument retries and watch p95/p99 +
retried_requests_totalto tune.