caduh

A Practical Guide to API Error Handling

5 min read

Status codes you should actually use, consistent error shapes (Problem Details, JSend), and production‑ready logging/observability—so clients get clear signals and you get actionable telemetry.

TL;DR

  • Use clear HTTP status codes with one canonical error shape across all endpoints.
  • Prefer Problem Details (RFC 7807); JSend is fine if you already use it—just be consistent.
  • Include actionable fields: code, message, details, trace_id, docs link, and where useful a hint or errors[] per‑field.
  • Log structured JSON with correlation IDs; never leak secrets. Sample noisy errors; page only on SLO‑relevant ones.

1) Status codes that pull their weight

| Situation | Status | Notes | |---|---|---| | Success | 200 OK, 201 Created, 202 Accepted, 204 No Content | 201 for new resources with Location header | | Client error (bad input) | 400 Bad Request | Invalid JSON, schema violation (or use 422) | | Validation errors | 422 Unprocessable Content | Body is well‑formed but semantically invalid | | Authn/Authz | 401 Unauthorized, 403 Forbidden | 401 ⇒ unauthenticated; 403 ⇒ authenticated but not allowed | | Not found | 404 Not Found | Don’t leak existence in sensitive contexts | | Method/Media | 405 Method Not Allowed, 415 Unsupported Media Type, 406 Not Acceptable | | | Conflict | 409 Conflict | Version mismatch, unique constraint conflicts | | Rate limiting | 429 Too Many Requests | Add Retry-After (seconds or HTTP date) | | Upstream timeout | 504 Gateway Timeout | Client can retry | | Server overload/maintenance | 503 Service Unavailable | Add Retry-After; keep brief | | Generic server error | 500 Internal Server Error | Catch‑all; instrument and fix root cause |

Pick between 400 vs 422 and stick to it. Many teams use 422 for field‑level validation because 400 often means malformed payload.


2) One consistent error format

Option A — Problem Details (RFC 7807)

HTTP/1.1 422 Unprocessable Content
Content-Type: application/problem+json
Traceparent: 00-2c3e...-9f7c...-01

{
  "type": "https://docs.example.com/problems/validation-error",
  "title": "Your request parameters didn't validate",
  "status": 422,
  "detail": "The 'email' field must be a valid address.",
  "instance": "/users",
  "trace_id": "01HZX2C9Q5C3PZ3M1E0K7",
  "errors": [
    { "field": "email", "code": "invalid_email", "message": "Must be a valid email" },
    { "field": "age", "code": "min", "message": "Must be at least 18" }
  ]
}
  • type is a URI for documentation (can be per‑class of problem).
  • instance can echo the request path or a unique occurrence id.
  • Add custom members (trace_id, errors) as needed.

Option B — JSend (if that’s your house style)

{
  "status": "fail",
  "data": {
    "errors": [
      { "field": "email", "code": "invalid_email", "message": "Must be a valid email" }
    ]
  },
  "code": "VALIDATION_ERROR",
  "trace_id": "01HZX2..."
}

Whatever you choose, document it once and reuse everywhere (including 401/403/404/500/503/429).


3) Headers that help clients (and you)

  • Correlation: include/propagate Traceparent (W3C) and echo an X-Request-Id or your own trace_id in the body.
  • Retry guidance: Retry-After on 429/503; set expectation that non‑idempotent operations may fail even after retry.
  • Caching: Cache-Control: no-store on sensitive error bodies.
  • Versioning: Content-Type should be explicit (application/problem+json).

4) Logging & observability (production‑grade)

Log shape (JSON)

{
  "ts": "2025-09-07T10:20:30.123Z",
  "level": "ERROR",
  "service": "users-api",
  "method": "POST",
  "path": "/users",
  "status": 422,
  "duration_ms": 37,
  "client_ip": "203.0.113.10",
  "user_id": "u_123", 
  "tenant_id": "t_acme",
  "trace_id": "01HZX2C9Q5C3...",
  "error": { "type": "validation", "code": "invalid_email", "message": "Must be a valid email" }
}

Principles

  • Structured logs only; no free‑form stack traces without fields.
  • Redact secrets/PII; never log full tokens or passwords.
  • Sample noisy 4xx; always log 5xx.
  • Emit metrics (requests_total{status}, errors_total{code}, latency_bucket) and traces (OpenTelemetry).
  • Define SLOs (e.g., 99.9% non‑5xx) and alert on error budgets, not every spike.

5) Tiny middleware examples

Express (Node)

// Error shape
class ApiError extends Error {
  status = 500;
  code = "INTERNAL_ERROR";
  details?: unknown;
  constructor(status: number, code: string, message: string, details?: unknown) {
    super(message); this.status = status; this.code = code; this.details = details;
  }
}

// Central handler
app.use((err, req, res, next) => {
  const traceId = req.headers["x-request-id"] || crypto.randomUUID();
  const status = err.status ?? 500;
  const body = {
    type: "https://docs.example.com/problems/" + (err.code || "internal-error"),
    title: err.message || "Internal Server Error",
    status,
    trace_id: traceId,
    ...(err.details ? { errors: err.details } : {})
  };
  // log (structured)
  logger.error({ trace_id: traceId, status, path: req.path, error: { code: err.code, message: err.message } });
  res.setHeader("Content-Type", "application/problem+json");
  res.status(status).json(body);
});

FastAPI (Python)

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse

app = FastAPI()

@app.exception_handler(Exception)
async def problem_details_handler(request: Request, exc: Exception):
    trace_id = request.headers.get("x-request-id") or uuid4().hex
    status = getattr(exc, "status_code", 500)
    body = {
        "type": f"https://docs.example.com/problems/{getattr(exc, 'code', 'internal-error')}",
        "title": str(exc) or "Internal Server Error",
        "status": status,
        "trace_id": trace_id
    }
    # TODO: structured log here
    return JSONResponse(content=body, status_code=status, media_type="application/problem+json")

6) Validation & per‑field errors

  • Validate early (schema/DTO) and return all field issues in one response.
  • Include errors[] with field, code, message, and optional hint (e.g., allowed range).
  • Keep messages developer‑friendly (end‑users get localized messages elsewhere).

7) Retries & idempotency

  • Safe methods (GET, HEAD) are idempotent; clients may retry on network errors/5xx.
  • For mutating endpoints, support idempotency keys (header like Idempotency-Key) to avoid duplicate side effects on retry.
  • Back off with exponential + jitter; don’t rely on clients to be polite—enforce with 429.

8) Documentation (one page to rule them all)

  • Document the error envelope once (Problem Details/JSend).
  • In OpenAPI, reference a shared schema for errors and link named type URIs:
components:
  schemas:
    ProblemDetails:
      type: object
      required: [type, title, status]
      properties:
        type: { type: string, format: uri }
        title: { type: string }
        status: { type: integer }
        detail: { type: string }
        instance: { type: string }
        trace_id: { type: string }
        errors:
          type: array
          items:
            type: object
            properties:
              field: { type: string }
              code: { type: string }
              message: { type: string }

9) Pitfalls & fast fixes

| Pitfall | Why it hurts | Fix | |---|---|---| | Inconsistent shapes per endpoint | Clients need custom parsers | One schema; SDKs parse once | | Returning stacks to clients | Leaks internals/PII | Log stacks; send clean messages only | | Overusing 500/400 for everything | No signal | Map to correct codes; add code/type | | Missing correlation IDs | Hard to debug | Propagate Traceparent/X-Request-Id | | Noisy alerts on 4xx | Pager fatigue | Alert on 5xx & SLOs; sample 4xx | | Vague validation errors | Support tickets | Include errors[] with field codes & hints |


10) Quick checklist

  • [ ] Pick Problem Details or JSend; document and enforce.
  • [ ] Map the 10–12 status codes you’ll use; lint in CI.
  • [ ] Add correlation IDs and structured logs everywhere.
  • [ ] Return field‑level errors; keep messages action‑oriented.
  • [ ] Use Retry‑After on 429/503; support idempotency keys.
  • [ ] Define SLOs & alerts for 5xx; sample the rest.

One‑minute adoption plan

  1. Add a shared error response to your OpenAPI and generate clients.
  2. Implement a global error handler (middleware) that emits Problem Details with trace_id.
  3. Centralize logging (structured JSON) and propagate Traceparent.
  4. Review endpoints; fix status codes and add Retry-After where needed.
  5. Document the error model; link type URIs to a single help page.