caduh

OpenTelemetry for Mortals

10 min readUpdated

Which signal answers which question, where baggage helps, where it leaks, and how one bad metric label can quietly torch your observability budget.

OpenTelemetry for Mortals

Traces, metrics, logs, baggage, and the cardinality traps that quietly hurt later

Goal: help you choose the right OpenTelemetry signal for the right job, avoid the attribute mistakes that make dashboards useless, and stop treating OTel like a magical backend-shaped blob.


TL;DR

  • OpenTelemetry is not a backend. It standardizes how telemetry is generated, propagated, collected, and exported.
  • Traces answer: what happened in this one request?
  • Metrics answer: how often, how much, how bad, and is it getting worse?
  • Logs answer: what exactly was recorded at this moment?
  • Baggage answers: what tiny bit of request context needs to travel downstream?
  • Before you obsess over signals, set your resource attributes correctly, especially service.name.
  • Most teams hurt themselves with metrics cardinality, not missing spans. Raw user IDs, UUIDs, full URLs, and other unbounded values are where telemetry costs and query quality go to die.
  • Baggage is propagated data, not magic metadata. It is separate from signal attributes and should stay small, intentional, and non-sensitive.
  • If you are getting started, direct export can be fine. At scale, use the Collector so apps can offload retries, batching, filtering, and export concerns.

1) First: what OpenTelemetry is and isn’t

OpenTelemetry is an observability framework and toolkit for generating, exporting, collecting, and processing telemetry such as traces, metrics, and logs.

It is not:

  • your storage layer
  • your query UI
  • your alerting product
  • your tracing backend

That distinction matters because teams often install OTel and then ask, “Where do I look at the data?”

The answer is:

your app -> OpenTelemetry SDK / auto-instrumentation -> optional Collector -> backend

If you are small, you can often send telemetry directly to a backend and get value quickly. The OpenTelemetry docs explicitly say that is fine for development or small-scale use. In general, though, they recommend using a Collector alongside your service so the app can offload retries, batching, encryption, and filtering.


2) The shortest useful signal map

| Signal | Best question | Unit of analysis | Great for | Bad at | |---|---|---|---|---| | Trace | Why was this request slow or broken? | One request / flow | Latency debugging, service dependencies, outliers | Cheap long-term counting | | Metric | Is the system healthy over time? | Aggregated time series | SLOs, alerting, trends, capacity | Explaining one weird request | | Log | What happened at this moment? | One recorded event | Exact details, audits, errors, edge cases | Aggregate health on its own | | Baggage | What tiny context should travel downstream? | Propagated key-value context | Passing request context across boundaries | Storing business state or secrets |

If you remember one thing, remember this:

  • traces are for paths
  • metrics are for trends
  • logs are for details
  • baggage is for propagation

3) Set resources first or the rest gets messy

OpenTelemetry resources describe the entity producing telemetry. The official docs call out things like process name, pod name, namespace, container, and especially service.name.

That matters because resource data lands on everything produced by that provider.

Practical rule:

  • if an attribute describes the thing emitting telemetry, it probably belongs on the resource
  • if it describes one request / one event / one measurement, it probably belongs on the signal itself

Minimum sane setup

Set these early:

  • service.name
  • deployment environment
  • service version
  • cloud / host / container / k8s identity if relevant

OpenTelemetry’s docs explicitly recommend setting service.name yourself because SDKs otherwise default it to unknown_service.

Example

export OTEL_SERVICE_NAME=checkout
export OTEL_RESOURCE_ATTRIBUTES=deployment.environment.name=prod,service.version=1.8.4
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

If you skip resource hygiene, you will end up trying to reconstruct identity from ad hoc span, log, and metric attributes later. That usually gets ugly fast.


4) Traces: use them to explain a path, not count the universe

OpenTelemetry describes traces as the path of a request through your application.

That is the right framing.

Traces are where you answer questions like:

  • Why is checkout slow for some users?
  • Which downstream dependency added 800ms?
  • Did this queue publish happen before or after the DB write?
  • Which span failed first?

What to put on spans

Good span data:

  • operation name
  • route or RPC method
  • downstream system name
  • status and error type
  • a few stable attributes that explain the operation

Bad span data:

  • giant payloads
  • secrets
  • every request header
  • opaque blobs that belong in logs instead

One useful distinction

Span attributes describe the operation.

Span events are better for meaningful points in time inside that operation.

If you are choosing between “add a second span” and “add a span event,” the test is simple:

  • duration with a meaningful start and end -> likely a span
  • important moment during a span -> likely an event

Traces are not your uptime chart

Tracing is amazing for causality and terrible as your only aggregate signal. If your whole observability plan is “we have traces,” you are one cardinality or sampling decision away from not being able to answer the boring-but-critical question:

“Is the system getting worse?”

That is a metrics question.


5) Metrics: aggregate on purpose or regret it later

The OpenTelemetry docs define a metric as a measurement captured at runtime. The important part is not the measurement itself. It is the aggregation.

Metrics are for:

  • request rate
  • error rate
  • latency distributions
  • queue depth
  • active sessions
  • CPU and memory usage
  • business counters that truly matter over time

Use the right instrument shape

  • Counter for things that only go up
  • UpDownCounter for values that can rise and fall
  • Gauge for current-state values
  • Histogram for latency, size, and other distributions

If you care about percentiles or latency shape, start with histograms, not hand-rolled averages.

Metrics are where cardinality hurts most

OpenTelemetry’s metrics SDK spec defines cardinality as the number of unique combinations of attributes and says SDKs should support a cardinality limit. If nothing is configured, the default is 2000.

That does not mean 2000 is safe.

It means:

  • you can still create expensive or degraded series before that matters
  • your backend may struggle long before you hit the SDK default
  • once you overflow, the SDK may aggregate excess measurements into an overflow bucket instead of giving you clean per-attribute breakdowns

The dangerous labels list

Avoid these on metrics unless you have a very strong reason:

  • user.id
  • email
  • session.id
  • request_id
  • trace_id
  • full_url
  • full_path
  • arbitrary search queries
  • raw error messages
  • order IDs, cart IDs, invoice IDs

Safer replacements

| Dangerous metric attribute | Safer replacement | |---|---| | full path like /users/123/orders/456 | route template like /users/{userId}/orders/{orderId} | | user ID | user tier, cohort, plan, or omit entirely | | raw error message | stable error code or error class | | trace ID | exemplar support or logs/traces correlation instead | | order ID | count failures by operation type, not entity ID |

Views are your friend

The metrics docs call out Views as the place where you customize aggregation and which attributes are reported. The metrics SDK spec also says cardinality limits should be enforced after attribute filtering, which is exactly why views matter: you can drop bad attributes before they poison the stream.

If your metrics backend feels noisy, expensive, or weirdly slow, views and attribute filtering should be one of the first places you look.


6) Logs: keep them structured and correlated

OpenTelemetry logs are still just logs. OTel does not make bad logs good.

What it can do well is:

  • model logs consistently
  • correlate logs with active traces and spans
  • route logs through the same pipeline as other telemetry

The official docs note that when you enable autoinstrumentation or activate an SDK, OpenTelemetry can automatically correlate logs with the active trace and span by including their IDs.

That is useful because it lets you move from:

  • a failed request in a trace

to:

  • the precise log event that explains it

without inventing yet another homegrown correlation scheme.

Good OTel log habits

  • structured fields
  • stable severity levels
  • trace/span correlation
  • resource attributes for service identity
  • attributes for event details

Bad OTel log habits

  • dumping giant JSON payloads into every log body
  • putting secrets in log attributes
  • using logs as substitute metrics
  • indexing every free-form field like it is cheap

Logs are where details live. They are not where you should do all your counting.

If you need help deciding what errors should look like and what should be logged, see API Error Handling That Helps Clients Recover.


7) Baggage: tiny propagated context, not a distributed junk drawer

OpenTelemetry baggage is a key-value store that travels alongside context across service boundaries.

That sounds abstract, so here is the plain-English version:

If some small bit of request context is known early and needs to be available downstream, baggage is one way to carry it.

Examples:

  • account or tenant identifier
  • customer plan tier
  • traffic cohort
  • support-session marker

Baggage is not the same as attributes

This matters enough to say clearly:

  • baggage is propagated context
  • attributes are data attached to a span, metric, log, or resource

The official baggage docs explicitly note that baggage is a separate store and is not automatically attached to spans, metrics, or logs unless you explicitly read it and add it.

Baggage rules that save pain

  • keep it small
  • keep it stable
  • keep it non-sensitive
  • keep the key set intentional

The official docs also warn that baggage can be propagated to downstream services and can show up in network headers. Do not put credentials, API keys, or PII there.

When not to use baggage

Do not use baggage for:

  • auth decisions you would not trust from the caller
  • large payload fragments
  • secrets
  • mutable workflow state
  • “it might be useful someday” data

If the information is critical and trusted, it usually belongs in your actual request model, token model, or service data, not just in propagated baggage.


8) The cardinality traps almost everyone hits once

Here is the failure pattern:

  1. a team adds “helpful” attributes everywhere
  2. the backend cost climbs
  3. dashboards get slower
  4. alerting gets noisier
  5. nobody trusts the breakdowns anymore

The classic traps

Raw URLs on metrics

/product/123, /product/124, /product/125 is not one route. It is an unbounded series explosion.

Use route templates instead.

User IDs on counters and histograms

This feels analytically powerful and is usually a terrible default for metrics.

If you need per-user debugging, traces or logs are the better home.

Error message text as an attribute

A stable error code is good.

A raw exception message is cardinality roulette.

Request IDs and trace IDs on metrics

That turns every metric point into basically one-off data.

If you want point-in-time correlation, use:

  • logs with trace correlation
  • exemplars where supported
  • tracing links

not a metric label that creates a brand-new series per request.

Baggage copied blindly into metric attributes

This is a very human mistake:

  • baggage carries client.id
  • middleware blindly adds baggage to all signal attributes
  • metrics cardinality explodes

Just because baggage is available everywhere does not mean it belongs everywhere.

The practical filter

Before adding any attribute to a metric, ask:

  1. Does this value come from a small bounded set?
  2. Will I still be happy if I have one time series per distinct value?
  3. Would route template, tier, status class, or region answer the same question more cheaply?

If any answer feels shaky, do not put it on the metric.


9) A sane starter architecture

If you are a small team, this is enough:

app
  -> OTLP exporter
  -> OpenTelemetry Collector
  -> backend

Start with:

  • traces for inbound HTTP and key downstream calls
  • request/error/latency metrics
  • structured logs with trace correlation
  • service.name and environment resources

Do not start with:

  • every possible instrumentation package
  • twenty custom business metrics
  • baggage everywhere
  • tail sampling rules you do not understand
  • “let’s tag every metric by customer”

What small teams should actually ship first

  1. set service.name
  2. get one service sending traces
  3. add latency/error/request metrics
  4. make logs trace-correlated
  5. only then add custom attributes and business telemetry carefully

That order gets you usable telemetry fast without creating an expensive taxonomy problem before you even have working dashboards.


10) FAQ

Do I need all four signals on day one?

No. Most teams get value fastest from traces plus a few core metrics, then add log correlation, then use baggage only where propagation solves a real problem.

Is the Collector mandatory?

No. OpenTelemetry’s own docs say direct export is fine for development and small-scale use. But in general they recommend a Collector because it handles retries, batching, filtering, and export concerns more cleanly.

Should I put customer IDs in baggage?

Only if you truly need that context downstream and it is safe to propagate. Even then, be careful about where it gets copied afterward, especially into metrics.

Why are my dashboards suddenly expensive or noisy?

Usually because you added high-cardinality attributes to metrics or indexed too many free-form log fields.

Are traces enough without metrics?

No. Traces are for paths and outliers. Metrics are for aggregate health and alerting. You usually need both.


11) Final recommendation

If you want the shortest durable OpenTelemetry rule set:

  • resources identify the emitter
  • traces explain a path
  • metrics summarize behavior
  • logs keep the details
  • baggage carries a little context
  • cardinality is the bill waiting to happen

That is the version of OpenTelemetry most teams actually need.

For adjacent topics, see Profiling in Production and API Error Handling That Helps Clients Recover.