caduh

Structured Logging That Survives Production

9 min readUpdated

How to make logs useful under real traffic: correlation IDs, JSON event shape, sampling rules, and PII redaction that does not depend on everyone remembering.

Structured Logging That Survives Production

Correlation IDs, JSON logs, sampling, and PII redaction that keep working when traffic gets messy

Goal: make logs searchable, safe, and cheap enough to keep, without turning production debugging into archaeology.


TL;DR

  • Log events, not sentences. Use JSON with stable fields.
  • Put a correlation ID on every request, job, message, and downstream call.
  • Keep one small set of required fields: ts, level, service, env, event, trace_id, and the thing being acted on.
  • Log high-cardinality details only when they help debugging. Do not turn every log line into a metrics label soup.
  • Sample noisy success and expected-client-failure logs. Do not sample rare failures blindly.
  • Redact sensitive data centrally. Do not rely on every developer remembering which fields are dangerous.
  • Logs are not a place for full payloads, access tokens, passwords, cookies, payment data, or "temporary" dumps.

1) Logs are an interface, not a text file

Plain text logs feel convenient until you need to ask production questions:

  • What happened to this request?
  • Which tenant saw the failures?
  • Did the retry happen before or after the downstream timeout?
  • Is this one user broken, or is the whole route degraded?
  • Can support safely look at this without seeing private data?

If logs are free-form strings, every answer becomes a fragile search query.

Instead, treat each log line as an event with a stable schema.

{
  "ts": "2026-04-24T10:20:30.123Z",
  "level": "INFO",
  "service": "billing-api",
  "env": "prod",
  "event": "invoice.created",
  "trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6",
  "request_id": "req_9b65df",
  "tenant_id": "t_acme",
  "user_id": "u_123",
  "invoice_id": "inv_456",
  "duration_ms": 84
}

The exact fields will vary by system, but the principle should not: machines should be able to filter, group, and join logs without parsing English.


2) Use one baseline event shape

A boring schema is a good schema.

| Field | Required? | Notes | |---|---:|---| | ts | yes | ISO timestamp in UTC | | level | yes | DEBUG, INFO, WARN, ERROR | | service | yes | Stable service name | | env | yes | prod, staging, dev | | event | yes | Machine-readable name like payment.failed | | trace_id | yes | End-to-end correlation across services | | request_id | recommended | One inbound HTTP request or job attempt | | span_id | useful | If you emit traces | | tenant_id | useful | Essential in multi-tenant systems | | user_id | useful | Prefer internal IDs, not email addresses | | duration_ms | useful | For completed operations | | error | on failure | Structured error details |

Good event names are specific enough to search and stable enough to alert on:

auth.login.failed
checkout.payment.authorized
checkout.payment.declined
worker.email.delivered
worker.email.retry_scheduled

Avoid event names that are just severity in disguise:

error
failed
something_bad_happened

Those names force every query to depend on other fields.


3) Correlation IDs: the thread through the system

A correlation ID is the handle you use to follow one unit of work across boundaries.

For HTTP, accept an incoming ID if it is valid. Otherwise create one. Then propagate it everywhere:

  • response headers
  • application logs
  • outbound HTTP calls
  • queue messages
  • background jobs
  • error reports
  • traces

Prefer the W3C traceparent header if you are already using tracing. If you are not, an X-Request-Id header is still far better than nothing.

GET /checkout HTTP/1.1
Traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
X-Request-Id: req_9b65df

The practical rule:

Every log created while handling the same request, job, or message should contain the same trace_id.

If a customer gives support a request ID, an engineer should be able to paste it into the log search and reconstruct the story in minutes.


4) JSON logs: structured, boring, and queryable

This is useful:

{
  "ts": "2026-04-24T10:21:03.300Z",
  "level": "ERROR",
  "service": "checkout-api",
  "env": "prod",
  "event": "payment.authorize.failed",
  "trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6",
  "tenant_id": "t_acme",
  "user_id": "u_123",
  "payment_provider": "stripe",
  "status": 502,
  "duration_ms": 1200,
  "error": {
    "type": "upstream_timeout",
    "code": "PAYMENT_PROVIDER_TIMEOUT",
    "message": "Payment provider timed out"
  }
}

This is much less useful:

Payment failed for acme user 123 because stripe timed out after 1200ms

The text line is readable, but it is hard to aggregate safely. Is "acme" a tenant? Is "123" a user, request, or invoice? Is "stripe" always lowercase? Does "timed out" match "timeout" in another service?

Use structured fields. Keep the human-friendly message if your logger supports it, but make it secondary.

{
  "level": "WARN",
  "event": "job.retry_scheduled",
  "message": "Email delivery retry scheduled",
  "trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6",
  "job_id": "job_789",
  "attempt": 2,
  "next_attempt_in_ms": 30000
}

5) Levels should mean something

If everything is ERROR, nothing is.

| Level | Use it for | |---|---| | DEBUG | Local or short-lived diagnostic detail | | INFO | Important state changes and completed business events | | WARN | Unexpected but handled conditions | | ERROR | Failed operation needing investigation or action |

Examples:

  • A user entering a wrong password is usually INFO or WARN, not ERROR.
  • A downstream dependency timing out after retries is ERROR.
  • A validation error from a client is not automatically an ERROR.
  • A job that failed once but will retry can be WARN; final exhaustion is ERROR.

Good levels make alerts quieter and logs easier to scan.


6) Sampling: reduce noise without lying to yourself

Production traffic can produce a crushing amount of logs. Sampling keeps costs under control, but careless sampling hides the events you need most.

Good defaults:

  • Keep all ERROR logs unless volume is truly extreme.
  • Keep all security-sensitive events: auth failures, privilege changes, key creation, data exports.
  • Sample high-volume INFO events like health checks, successful reads, and expected 404s.
  • Sample by stable key when you need a complete story for a subset, such as trace_id or user_id.
  • Record the sampling rate in the log event.

Example:

{
  "level": "INFO",
  "event": "request.completed",
  "trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6",
  "method": "GET",
  "route": "/products/:id",
  "status": 200,
  "duration_ms": 31,
  "sample_rate": 0.05
}

That sample_rate matters. Without it, later analysis can undercount reality by a factor of 20.

A simple sampling policy

| Event | Policy | |---|---| | 5xx responses | keep 100% | | failed jobs after final retry | keep 100% | | auth/security/audit events | keep 100% | | 4xx validation errors | sample 1-10%, depending on volume | | health checks | drop or sample very heavily | | successful high-volume reads | sample by route and traffic level |

Sampling is a cost control. It is not a substitute for metrics. Use metrics for counts and rates; use sampled logs for examples and debugging detail.


7) PII redaction has to be automatic

The safest logging policy is not "please remember not to log secrets."

It is:

  • sensitive field names are redacted by the logger
  • sensitive headers are blocked by default
  • large request and response bodies are not logged by default
  • allowlists are preferred over denylists for payload logging
  • tests cover the redaction rules

Redact or avoid:

| Data | Safer alternative | |---|---| | Passwords | never log | | Access tokens | log token prefix or hash only if needed | | Cookies | never log raw | | Authorization headers | never log raw | | Email addresses | internal user_id | | Phone numbers | internal user_id or masked value | | Postal addresses | internal address ID | | Payment card data | never log | | Full request bodies | selected allowlisted fields |

This is the kind of log line you want:

{
  "event": "profile.update.rejected",
  "trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6",
  "user_id": "u_123",
  "fields": ["email", "phone"],
  "reason": "validation_failed"
}

Not this:

{
  "event": "profile.update.rejected",
  "email": "[email protected]",
  "phone": "+15551234567",
  "body": {
    "password": "correct-horse-battery-staple"
  }
}

When there is a real need to debug payload shape, log a schema version, field names, sizes, hashes, or a short-lived diagnostic event with explicit approval and expiry.


8) Minimal implementation pattern

In most web services, logging should be wired in at the boundary:

  1. Read or create a correlation ID.
  2. Attach request context to async-local storage or request scope.
  3. Emit one request completion log.
  4. Let application code add event-specific fields without manually passing IDs everywhere.
  5. Redact at the logger sink.

Tiny TypeScript sketch:

import crypto from "node:crypto";
import { AsyncLocalStorage } from "node:async_hooks";

type LogContext = {
  trace_id: string;
  request_id: string;
  service: string;
  env: string;
};

const context = new AsyncLocalStorage<LogContext>();

const REDACTED = "[REDACTED]";
const SENSITIVE_KEYS = new Set([
  "authorization",
  "cookie",
  "password",
  "token",
  "access_token",
  "refresh_token",
  "secret",
  "api_key"
]);

function redact(value: unknown): unknown {
  if (!value || typeof value !== "object") return value;
  if (Array.isArray(value)) return value.map(redact);

  return Object.fromEntries(
    Object.entries(value as Record<string, unknown>).map(([key, child]) => {
      if (SENSITIVE_KEYS.has(key.toLowerCase())) return [key, REDACTED];
      return [key, redact(child)];
    })
  );
}

function log(level: "INFO" | "WARN" | "ERROR", event: string, fields = {}) {
  const base = context.getStore();
  const entry = redact({
    ts: new Date().toISOString(),
    level,
    event,
    ...base,
    ...fields
  });

  process.stdout.write(JSON.stringify(entry) + "\n");
}

function requestMiddleware(req, res, next) {
  const started = performance.now();
  const incoming = req.headers["traceparent"] || req.headers["x-request-id"];
  const trace_id = typeof incoming === "string" ? incoming : crypto.randomUUID();
  const request_id = crypto.randomUUID();

  context.run(
    {
      trace_id,
      request_id,
      service: "checkout-api",
      env: process.env.NODE_ENV || "dev"
    },
    () => {
      res.setHeader("X-Request-Id", request_id);

      res.on("finish", () => {
        log("INFO", "request.completed", {
          method: req.method,
          route: req.route?.path || req.path,
          status: res.statusCode,
          duration_ms: Math.round(performance.now() - started)
        });
      });

      next();
    }
  );
}

This is not a full logging library. It shows the shape: context is automatic, output is JSON, and redaction happens at the edge.


9) What to log for common operations

HTTP request completed

{
  "event": "request.completed",
  "method": "POST",
  "route": "/orders",
  "status": 201,
  "duration_ms": 92,
  "trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6"
}

Downstream call failed

{
  "event": "downstream.request.failed",
  "dependency": "inventory-api",
  "operation": "reserve_stock",
  "status": 503,
  "duration_ms": 800,
  "retryable": true,
  "attempt": 1,
  "trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6"
}

Job exhausted retries

{
  "level": "ERROR",
  "event": "job.retries_exhausted",
  "job": "send_receipt_email",
  "job_id": "job_789",
  "attempts": 5,
  "last_error_code": "SMTP_TIMEOUT",
  "trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6"
}

Security-sensitive action

{
  "level": "INFO",
  "event": "user.mfa_disabled",
  "actor_user_id": "u_admin",
  "target_user_id": "u_123",
  "tenant_id": "t_acme",
  "ip_hash": "sha256:8ad3...",
  "trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6"
}

Notice what is missing: raw tokens, cookies, passwords, full request bodies, and email addresses.


10) Anti-patterns

| Anti-pattern | Why it hurts | Better | |---|---|---| | Logging strings only | Hard to query and aggregate | JSON fields | | New field names per service | Queries break across systems | Shared schema | | Missing correlation IDs | Cannot reconstruct a request | Generate and propagate IDs | | Logging full payloads | Leaks sensitive data and explodes cost | Allowlist fields | | Sampling all logs equally | Rare failures disappear | Keep critical events | | Treating logs as metrics | Expensive and incomplete | Emit real metrics | | High-cardinality everything | Costly and hard to query | Use stable dimensions | | Redaction in application code only | Easy to bypass | Redact in logger/sink |


11) Rollout checklist

  • [ ] Pick required fields and document them.
  • [ ] Add request/job/message correlation IDs at the boundary.
  • [ ] Emit JSON logs in every production service.
  • [ ] Standardize event names for important flows.
  • [ ] Add a redaction layer for sensitive keys and headers.
  • [ ] Stop logging full request and response bodies by default.
  • [ ] Define sampling rules by event type and severity.
  • [ ] Include sample_rate on sampled logs.
  • [ ] Keep security and audit events unsampled.
  • [ ] Add tests that prove secrets are redacted.
  • [ ] Verify support can find all logs for one trace_id.

The practical standard

Structured logging does not need to be fancy. The durable version is almost boring:

  • one JSON object per event
  • one correlation ID per unit of work
  • one shared field vocabulary
  • one central redaction path
  • one explicit sampling policy

That is enough to turn logs from a last-ditch text search into a production debugging tool you can trust.