Structured Logging That Survives Production
Correlation IDs, JSON logs, sampling, and PII redaction that keep working when traffic gets messy
Goal: make logs searchable, safe, and cheap enough to keep, without turning production debugging into archaeology.
TL;DR
- Log events, not sentences. Use JSON with stable fields.
- Put a correlation ID on every request, job, message, and downstream call.
- Keep one small set of required fields:
ts,level,service,env,event,trace_id, and the thing being acted on. - Log high-cardinality details only when they help debugging. Do not turn every log line into a metrics label soup.
- Sample noisy success and expected-client-failure logs. Do not sample rare failures blindly.
- Redact sensitive data centrally. Do not rely on every developer remembering which fields are dangerous.
- Logs are not a place for full payloads, access tokens, passwords, cookies, payment data, or "temporary" dumps.
1) Logs are an interface, not a text file
Plain text logs feel convenient until you need to ask production questions:
- What happened to this request?
- Which tenant saw the failures?
- Did the retry happen before or after the downstream timeout?
- Is this one user broken, or is the whole route degraded?
- Can support safely look at this without seeing private data?
If logs are free-form strings, every answer becomes a fragile search query.
Instead, treat each log line as an event with a stable schema.
{
"ts": "2026-04-24T10:20:30.123Z",
"level": "INFO",
"service": "billing-api",
"env": "prod",
"event": "invoice.created",
"trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6",
"request_id": "req_9b65df",
"tenant_id": "t_acme",
"user_id": "u_123",
"invoice_id": "inv_456",
"duration_ms": 84
}
The exact fields will vary by system, but the principle should not: machines should be able to filter, group, and join logs without parsing English.
2) Use one baseline event shape
A boring schema is a good schema.
| Field | Required? | Notes |
|---|---:|---|
| ts | yes | ISO timestamp in UTC |
| level | yes | DEBUG, INFO, WARN, ERROR |
| service | yes | Stable service name |
| env | yes | prod, staging, dev |
| event | yes | Machine-readable name like payment.failed |
| trace_id | yes | End-to-end correlation across services |
| request_id | recommended | One inbound HTTP request or job attempt |
| span_id | useful | If you emit traces |
| tenant_id | useful | Essential in multi-tenant systems |
| user_id | useful | Prefer internal IDs, not email addresses |
| duration_ms | useful | For completed operations |
| error | on failure | Structured error details |
Good event names are specific enough to search and stable enough to alert on:
auth.login.failed
checkout.payment.authorized
checkout.payment.declined
worker.email.delivered
worker.email.retry_scheduled
Avoid event names that are just severity in disguise:
error
failed
something_bad_happened
Those names force every query to depend on other fields.
3) Correlation IDs: the thread through the system
A correlation ID is the handle you use to follow one unit of work across boundaries.
For HTTP, accept an incoming ID if it is valid. Otherwise create one. Then propagate it everywhere:
- response headers
- application logs
- outbound HTTP calls
- queue messages
- background jobs
- error reports
- traces
Prefer the W3C traceparent header if you are already using tracing. If you are not, an X-Request-Id header is still far better than nothing.
GET /checkout HTTP/1.1
Traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
X-Request-Id: req_9b65df
The practical rule:
Every log created while handling the same request, job, or message should contain the same
trace_id.
If a customer gives support a request ID, an engineer should be able to paste it into the log search and reconstruct the story in minutes.
4) JSON logs: structured, boring, and queryable
This is useful:
{
"ts": "2026-04-24T10:21:03.300Z",
"level": "ERROR",
"service": "checkout-api",
"env": "prod",
"event": "payment.authorize.failed",
"trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6",
"tenant_id": "t_acme",
"user_id": "u_123",
"payment_provider": "stripe",
"status": 502,
"duration_ms": 1200,
"error": {
"type": "upstream_timeout",
"code": "PAYMENT_PROVIDER_TIMEOUT",
"message": "Payment provider timed out"
}
}
This is much less useful:
Payment failed for acme user 123 because stripe timed out after 1200ms
The text line is readable, but it is hard to aggregate safely. Is "acme" a tenant? Is "123" a user, request, or invoice? Is "stripe" always lowercase? Does "timed out" match "timeout" in another service?
Use structured fields. Keep the human-friendly message if your logger supports it, but make it secondary.
{
"level": "WARN",
"event": "job.retry_scheduled",
"message": "Email delivery retry scheduled",
"trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6",
"job_id": "job_789",
"attempt": 2,
"next_attempt_in_ms": 30000
}
5) Levels should mean something
If everything is ERROR, nothing is.
| Level | Use it for |
|---|---|
| DEBUG | Local or short-lived diagnostic detail |
| INFO | Important state changes and completed business events |
| WARN | Unexpected but handled conditions |
| ERROR | Failed operation needing investigation or action |
Examples:
- A user entering a wrong password is usually
INFOorWARN, notERROR. - A downstream dependency timing out after retries is
ERROR. - A validation error from a client is not automatically an
ERROR. - A job that failed once but will retry can be
WARN; final exhaustion isERROR.
Good levels make alerts quieter and logs easier to scan.
6) Sampling: reduce noise without lying to yourself
Production traffic can produce a crushing amount of logs. Sampling keeps costs under control, but careless sampling hides the events you need most.
Good defaults:
- Keep all
ERRORlogs unless volume is truly extreme. - Keep all security-sensitive events: auth failures, privilege changes, key creation, data exports.
- Sample high-volume
INFOevents like health checks, successful reads, and expected 404s. - Sample by stable key when you need a complete story for a subset, such as
trace_idoruser_id. - Record the sampling rate in the log event.
Example:
{
"level": "INFO",
"event": "request.completed",
"trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6",
"method": "GET",
"route": "/products/:id",
"status": 200,
"duration_ms": 31,
"sample_rate": 0.05
}
That sample_rate matters. Without it, later analysis can undercount reality by a factor of 20.
A simple sampling policy
| Event | Policy | |---|---| | 5xx responses | keep 100% | | failed jobs after final retry | keep 100% | | auth/security/audit events | keep 100% | | 4xx validation errors | sample 1-10%, depending on volume | | health checks | drop or sample very heavily | | successful high-volume reads | sample by route and traffic level |
Sampling is a cost control. It is not a substitute for metrics. Use metrics for counts and rates; use sampled logs for examples and debugging detail.
7) PII redaction has to be automatic
The safest logging policy is not "please remember not to log secrets."
It is:
- sensitive field names are redacted by the logger
- sensitive headers are blocked by default
- large request and response bodies are not logged by default
- allowlists are preferred over denylists for payload logging
- tests cover the redaction rules
Redact or avoid:
| Data | Safer alternative |
|---|---|
| Passwords | never log |
| Access tokens | log token prefix or hash only if needed |
| Cookies | never log raw |
| Authorization headers | never log raw |
| Email addresses | internal user_id |
| Phone numbers | internal user_id or masked value |
| Postal addresses | internal address ID |
| Payment card data | never log |
| Full request bodies | selected allowlisted fields |
This is the kind of log line you want:
{
"event": "profile.update.rejected",
"trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6",
"user_id": "u_123",
"fields": ["email", "phone"],
"reason": "validation_failed"
}
Not this:
{
"event": "profile.update.rejected",
"email": "[email protected]",
"phone": "+15551234567",
"body": {
"password": "correct-horse-battery-staple"
}
}
When there is a real need to debug payload shape, log a schema version, field names, sizes, hashes, or a short-lived diagnostic event with explicit approval and expiry.
8) Minimal implementation pattern
In most web services, logging should be wired in at the boundary:
- Read or create a correlation ID.
- Attach request context to async-local storage or request scope.
- Emit one request completion log.
- Let application code add event-specific fields without manually passing IDs everywhere.
- Redact at the logger sink.
Tiny TypeScript sketch:
import crypto from "node:crypto";
import { AsyncLocalStorage } from "node:async_hooks";
type LogContext = {
trace_id: string;
request_id: string;
service: string;
env: string;
};
const context = new AsyncLocalStorage<LogContext>();
const REDACTED = "[REDACTED]";
const SENSITIVE_KEYS = new Set([
"authorization",
"cookie",
"password",
"token",
"access_token",
"refresh_token",
"secret",
"api_key"
]);
function redact(value: unknown): unknown {
if (!value || typeof value !== "object") return value;
if (Array.isArray(value)) return value.map(redact);
return Object.fromEntries(
Object.entries(value as Record<string, unknown>).map(([key, child]) => {
if (SENSITIVE_KEYS.has(key.toLowerCase())) return [key, REDACTED];
return [key, redact(child)];
})
);
}
function log(level: "INFO" | "WARN" | "ERROR", event: string, fields = {}) {
const base = context.getStore();
const entry = redact({
ts: new Date().toISOString(),
level,
event,
...base,
...fields
});
process.stdout.write(JSON.stringify(entry) + "\n");
}
function requestMiddleware(req, res, next) {
const started = performance.now();
const incoming = req.headers["traceparent"] || req.headers["x-request-id"];
const trace_id = typeof incoming === "string" ? incoming : crypto.randomUUID();
const request_id = crypto.randomUUID();
context.run(
{
trace_id,
request_id,
service: "checkout-api",
env: process.env.NODE_ENV || "dev"
},
() => {
res.setHeader("X-Request-Id", request_id);
res.on("finish", () => {
log("INFO", "request.completed", {
method: req.method,
route: req.route?.path || req.path,
status: res.statusCode,
duration_ms: Math.round(performance.now() - started)
});
});
next();
}
);
}
This is not a full logging library. It shows the shape: context is automatic, output is JSON, and redaction happens at the edge.
9) What to log for common operations
HTTP request completed
{
"event": "request.completed",
"method": "POST",
"route": "/orders",
"status": 201,
"duration_ms": 92,
"trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6"
}
Downstream call failed
{
"event": "downstream.request.failed",
"dependency": "inventory-api",
"operation": "reserve_stock",
"status": 503,
"duration_ms": 800,
"retryable": true,
"attempt": 1,
"trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6"
}
Job exhausted retries
{
"level": "ERROR",
"event": "job.retries_exhausted",
"job": "send_receipt_email",
"job_id": "job_789",
"attempts": 5,
"last_error_code": "SMTP_TIMEOUT",
"trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6"
}
Security-sensitive action
{
"level": "INFO",
"event": "user.mfa_disabled",
"actor_user_id": "u_admin",
"target_user_id": "u_123",
"tenant_id": "t_acme",
"ip_hash": "sha256:8ad3...",
"trace_id": "01HW2Y7BR8R9K2VQ4C4N3XW9M6"
}
Notice what is missing: raw tokens, cookies, passwords, full request bodies, and email addresses.
10) Anti-patterns
| Anti-pattern | Why it hurts | Better | |---|---|---| | Logging strings only | Hard to query and aggregate | JSON fields | | New field names per service | Queries break across systems | Shared schema | | Missing correlation IDs | Cannot reconstruct a request | Generate and propagate IDs | | Logging full payloads | Leaks sensitive data and explodes cost | Allowlist fields | | Sampling all logs equally | Rare failures disappear | Keep critical events | | Treating logs as metrics | Expensive and incomplete | Emit real metrics | | High-cardinality everything | Costly and hard to query | Use stable dimensions | | Redaction in application code only | Easy to bypass | Redact in logger/sink |
11) Rollout checklist
- [ ] Pick required fields and document them.
- [ ] Add request/job/message correlation IDs at the boundary.
- [ ] Emit JSON logs in every production service.
- [ ] Standardize event names for important flows.
- [ ] Add a redaction layer for sensitive keys and headers.
- [ ] Stop logging full request and response bodies by default.
- [ ] Define sampling rules by event type and severity.
- [ ] Include
sample_rateon sampled logs. - [ ] Keep security and audit events unsampled.
- [ ] Add tests that prove secrets are redacted.
- [ ] Verify support can find all logs for one
trace_id.
The practical standard
Structured logging does not need to be fancy. The durable version is almost boring:
- one JSON object per event
- one correlation ID per unit of work
- one shared field vocabulary
- one central redaction path
- one explicit sampling policy
That is enough to turn logs from a last-ditch text search into a production debugging tool you can trust.