Rate Limiting Strategies — windows, buckets, and distributed guards

October 3rd, 20256 min read#dev #api #reliability #performance #security #ca-duh

Fixed/sliding windows, token & leaky buckets, GCRA, and real-world implementations (Redis, NGINX, Envoy). Covers headers, bursts, backoff, multi‑DC, and tuning.

TL;DR

Choose your key (IP, user, token, route, tenant) and policy (hard vs soft, burst allowance) before picking an algorithm.
Token bucket (or GCRA) gives smooth limits with bursts and is the best default for public APIs.
Sliding window gives fairer “per‑minute” semantics than fixed windows; log variant is precise, counter is cheaper.
Implement atomically (Lua/script) in Redis or use NGINX/Envoy/cloud gateways for simple, fast edge enforcement.
Return 429 Too Many Requests with Retry-After and modern RateLimit‑ headers*. Expose limits per route/tenant.
In multi‑node/region setups, combine local fast limiters with a global budget (hierarchical) and design for clock skew.

1) What are you limiting?

Dimensions: per IP, user/account, API key, route (e.g., /login tighter), tenant, or combinations (tenant:user).
Burst policy: allow short bursts? (token bucket w/ capacity) • strict evenly‑spaced? (leaky bucket/GCRA)
Hard vs soft: hard = drop at 429; soft = queue (bounded) or degrade work. Most public APIs → hard.
Separate auth endpoints (login/password reset) from data APIs; much lower thresholds on auth to deter brute force.

2) Algorithms in 60 seconds

| Algorithm | How it works | Pros | Cons | Good for | |---|---|---|---|---| | Fixed window (N/min) | Counter per minute resets at boundary | Simple, cheap | Boundary burst (2× near edges), unfair | small internal tools | | Sliding window counter | Split into current+prev window weighted by time | Cheap, fewer spikes | Approximate, still spiky | most “/min” limits | | Sliding window log | Store timestamps; evict older than window | Precise fairness | Memory O(requests), heavier | login, abuse‑sensitive | | Token bucket | Tokens refill at rate; spend to proceed | Bursts allowed, smooth | Needs atomicity | public APIs, mobile | | Leaky bucket | Queue drains at fixed rate | Smooth, back‑pressure | Adds latency, queue mgmt | background ingestion | | GCRA | Math version of leaky bucket | Precise, low memory | Trickier to grok | gateways, OSS libs |

Defaults: token bucket (API), sliding counter (cheap per‑minute), log (critical fairness), GCRA (gateways/proxies).

3) Redis patterns (atomic)

A) Token bucket (Lua; 1 script, atomic)

Refill continuously by time delta (capacity C, fill rate R tokens/sec).

-- KEYS[1]=bucket key, ARGV[1]=now_ms, ARGV[2]=rate_per_sec, ARGV[3]=capacity, ARGV[4]=cost (tokens)
-- Returns: allowed (0/1), remaining, retry_after_ms
local key = KEYS[1]
local now = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local cap = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])

local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1]) or cap
local ts = tonumber(data[2]) or now

-- refill
local delta = math.max(0, now - ts) / 1000.0
tokens = math.min(cap, tokens + delta * rate)

local allowed = 0
local retry = 0
if tokens >= cost then
  tokens = tokens - cost
  allowed = 1
else
  retry = math.ceil(((cost - tokens) / rate) * 1000)
end

redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("PEXPIRE", key, math.ceil((cap / rate) * 1000)) -- idle expiry ~ full refill time
return {allowed, math.floor(tokens), retry}

Node wrapper (example)

const res = await redis.evalsha(SHA, 1, key, Date.now(), ratePerSec, capacity, 1);
const [allowed, remaining, retryMs] = res.map(Number);
if (!allowed) return tooMany(res, retryMs);

B) Sliding window log (ZSET)

-- KEYS[1]=key, ARGV[1]=now_ms, ARGV[2]=window_ms, ARGV[3]=limit
local key, now, win, limit = KEYS[1], tonumber(ARGV[1]), tonumber(ARGV[2]), tonumber(ARGV[3])
redis.call("ZREMRANGEBYSCORE", key, 0, now - win)
local count = redis.call("ZCARD", key)
if count < limit then
  redis.call("ZADD", key, now, tostring(now) .. ":" .. math.random())
  redis.call("PEXPIRE", key, win)
  return {1, limit - count - 1, 0}
else
  local oldest = redis.call("ZRANGE", key, 0, 0, "WITHSCORES")[2]
  local retry = (oldest and (oldest - (now - win))) or win
  return {0, 0, retry}
end

C) Sliding window counter (approx per minute)

-- KEYS[1]=key, ARGV[1]=now_ms, ARGV[2]=window_ms, ARGV[3]=limit
local k=KEYS[1]; local now=tonumber(ARGV[1]); local win=tonumber(ARGV[2]); local lim=tonumber(ARGV[3])
local cur = math.floor(now / win) * win
local prev = cur - win
local c1 = tonumber(redis.call("GET", k..":"..cur)) or 0
local c0 = tonumber(redis.call("GET", k..":"..prev)) or 0
local weight = (now - cur) / win
local est = c0 * (1 - weight) + c1
if est < lim then
  redis.call("INCR", k..":"..cur)
  redis.call("PEXPIRE", k..":"..cur, win*2)
  return {1, lim - math.ceil(est) - 1, 0}
else
  local retry = math.ceil((cur + win - now))
  return {0, 0, retry}
end

4) Edge configs (NGINX & Envoy)

NGINX (fixed window + burst)

limit_req_zone $binary_remote_addr zone=perip:10m rate=10r/s;
server {
  location /api/ {
    limit_req zone=perip burst=20 nodelay;   # allow bursts, then 429
    add_header Retry-After 1 always;
  }
}

Envoy (global service)

# Use external rate limit service (RLS) with descriptors (e.g., per-user, per-route)
rate_limit_service:
  grpc_service: { envoy_grpc: { cluster_name: rate_limit_cluster } }
# In the route, define descriptors:
typed_per_filter_config:
  envoy.filters.http.ratelimit:
    "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
    domain: apigw
    stage: 0
    request_type: external
    rate_limited_as_resource_exhausted: true
    descriptors:
      - entries:
          - key: route
            value: users_read
          - key: user
            value: "%REQ(x-user-id)%"

5) Response headers (be a good citizen)

Prefer the IETF draft/standard RateLimit response fields (supported by many gateways):

RateLimit-Limit: 120;window=60
RateLimit-Remaining: 17
RateLimit-Reset: 13
Retry-After: 13

Also widely used legacy headers:

X-RateLimit-Limit: 120
X-RateLimit-Remaining: 17
X-RateLimit-Reset: 1696346400  # epoch seconds

On 429 Too Many Requests, include Retry-After (seconds or HTTP date) and a machine‑readable error body.

6) Multi‑node & multi‑region

Single Redis: simplest; beware cross‑region latency.
Per‑node local limiter + global budget: fast local token bucket with periodic token sync (or deduct from a shared global pool for “write” actions).
Sticky hashing: route the same key to the same limiter node/shard for consistency.
Clock skew: base decisions on server time; for multi‑region Redis, use token bucket (less sensitive than exact windows).
Idempotency keys for write endpoints reduce retries storm impact.

7) Abuse & product considerations

Combine IP + user for anonymous endpoints; aggressive limits on /login, /signup, /password reset.
Add CAPTCHA/device fingerprint after soft threshold (tarpitting).
Per‑tenant scaling: tiers/quotas; rate limit by tenant plan. Burst pooling (token top‑ups) is fine if fair.
429 UX: show clear message and time to retry; SDKs should implement exponential backoff + jitter.

8) Observability & tuning

Metrics per key/policy:

allowed_total, limited_total, retry_after_ms histogram
remaining gauges (sampled), burst_exhausted_total
P95 latency added by limiter; Redis timeouts/errors

Logs: include key, policy, algorithm, capacity, remaining, retry_after.
Dashboards: top limited keys, hot routes, 429 rate vs traffic, edge vs origin enforcement split.

9) Pitfalls & fast fixes

| Pitfall | Why it hurts | Fix | |---|---|---| | Fixed window only | Boundary burst & unfairness | Use sliding or token bucket | | Non‑atomic updates | Race → burst holes | Use Lua/transactions or proxy limiters | | Global queueing | Latency spikes, memory | Prefer drop or small per‑key queues | | One size for all routes | Abusable endpoints unprotected | Different policies per route | | Ignoring clock skew | Inconsistent limits | Use token bucket; base on server time | | No headers on 429 | Poor client behavior | Send RateLimit‑* + Retry‑After | | Single point Redis | Outage = no traffic | Fallback to local fail‑open/close by route |

Quick checklist

[ ] Pick keys/policies per route (IP, user, tenant).
[ ] Choose algorithm: token bucket (default), sliding for “per‑minute”, log for sensitive flows.
[ ] Implement atomically (Redis Lua) or via NGINX/Envoy/gateway.
[ ] Emit RateLimit‑* and Retry‑After; return 429.
[ ] Add metrics/logging; test with load + burst traffic.
[ ] Plan for multi‑region (local + global budget, sticky hashing).

One‑minute adoption plan

Set per‑route policies (e.g., /login 5/min/IP; /api/* 120/min/user; /search token bucket 10 r/s with burst 20).
Ship a Redis token bucket (Lua) as your default enforcement.
Add RateLimit‑* + Retry-After to responses; update SDKs to back off with jitter.
Create dashboards for 429s and top keys; run a burst test in staging.
For multi‑region, add a local limiter and test fail‑open/close modes.