Profiling in Production: CPU & Heap for Node, Python, and Go — and How to Read Flamegraphs
Production-safe tactics, copy‑paste commands, and a flamegraph field guide
Goal: Find the biggest wins with minimal risk. Prefer sampling profilers and short capture windows. Keep profiles internal, add auth, and avoid long-running synchronous profilers in hot paths.
TL;DR
- Use sampling in prod (e.g., Go
pprof,py-spy, Node inspector/0x/Clinic Flame). Keep captures 15–60s. - Tighten scope: profile one service, one pod, reproduce, capture again.
- CPU flamegraph tells you where compute time goes; heap/alloc flamegraph tells you where memory is held/allocated.
- For continuous profiling, consider eBPF-based systems (Grafana Pyroscope/Parca).
- Lock down access: internal-only, auth, and rotate artifacts out of the cluster after analysis.
Quick Reference (Cheat Sheet)
Go (pprof)
// main.go
import (
_ "net/http/pprof"
"log"
"net/http"
)
func main() {
go func() {
log.Println("pprof on :6060")
http.ListenAndServe("0.0.0.0:6060", nil)
}()
// ... start app
}
Capture CPU (30s) and heap snapshots:
# CPU profile (30s sampling window)
curl -s http://SERVICE:6060/debug/pprof/profile?seconds=30 -o cpu.pb.gz
# Heap profile (in-use objects/bytes)
curl -s http://SERVICE:6060/debug/pprof/heap -o heap.pb.gz
# View interactively
go tool pprof -http=:0 cpu.pb.gz
go tool pprof -http=:0 heap.pb.gz
Python (py-spy, tracemalloc)
# Top-like view of hottest Python functions (attach to PID, no restarts)
py-spy top -p <PID>
# 30s CPU flamegraph (SVG)
py-spy record -p <PID> -o cpu.svg --duration 30 --rate 100
# Speedscope JSON (for speedscope.app)
py-spy record -p <PID> -o cpu.speedscope.json --format speedscope --duration 30
Enable/compare heap snapshots (in code):
# tracemalloc_example.py
import tracemalloc, time
tracemalloc.start()
# snapshot A
snap1 = tracemalloc.take_snapshot()
# ... run endpoint / reproduce load ...
time.sleep(10)
# snapshot B & compare
snap2 = tracemalloc.take_snapshot()
for stat in snap2.compare_to(snap1, 'lineno')[:20]:
print(stat)
Deep dives: pympler, objgraph, memray (alloc flamegraphs).
Node.js (Inspector, 0x, Clinic Flame, heap snapshots)
One-off CPU capture via inspector (attach):
# If process started with --inspect or --inspect=0.0.0.0:9229
# Use Chrome DevTools > Performance to record 20–30s CPU
# or programmatically (see below).
Programmatic CPU/Heap capture (safe trigger):
// profiler.js
import fs from 'node:fs';
import inspector from 'node:inspector';
const session = new inspector.Session(); session.connect();
export async function captureCpu(seconds = 20, out = 'cpu.cpuprofile') {
await session.post('Profiler.enable');
await session.post('Profiler.start');
await new Promise(r => setTimeout(r, seconds * 1000));
const { profile } = await new Promise((resolve, reject) =>
session.post('Profiler.stop', (err, params) => err ? reject(err) : resolve(params))
);
fs.writeFileSync(out, JSON.stringify(profile));
}
export async function captureHeap(out = 'heap.heapsnapshot') {
await session.post('HeapProfiler.enable');
const chunks = [];
await new Promise((resolve, reject) => {
session.on('HeapProfiler.addHeapSnapshotChunk', m => chunks.push(m.params.chunk));
session.post('HeapProfiler.takeHeapSnapshot', { reportProgress: false }, (err) => err ? reject(err) : resolve());
});
fs.writeFileSync(out, chunks.join(''));
}
CLI tools (lower friction):
# Clinic Flame (collect & open flamegraph)
npx clinic flame -- node server.js
# Attach to a running PID and save a flamegraph
npx 0x -p <PID>
CPU vs Heap: What You’re Measuring
| Profile | What it shows | Typical uses | Common pitfalls | |---|---|---|---| | CPU (sampling) | Where CPU time is spent (hot paths) | Slow endpoints, high CPU, timeouts | Sampling bias, missing async edges | | Allocations | Where memory is allocated | GC pressure, short‑lived churn | Focus on bytes vs objects wisely | | Heap (in-use) | What retains memory now | Leaks, bloaty caches, large objects | Snapshots capture now, not lifetime | | Wall-time | Time including I/O wait | Thread pool saturation, blocking I/O | Interpret vs CPU carefully |
For short spikes and intermittent stalls, do multiple short CPU captures around the incident.
Production Safety Checklist
- [ ] Restrict profiling endpoints to internal networks; add auth.
- [ ] Use short capture windows (15–60s) at predictable low overhead.
- [ ] Prefer sampling profilers (pprof, py-spy, 0x/Clinic).
- [ ] Version & label artifacts (
service,commit,pod,ts). - [ ] Offload artifacts to secure storage; set retention.
- [ ] Document a kill switch (env flag / feature toggle).
- [ ] In k8s, use
kubectl port-forwardinstead of exposing pprof publicly.
Go in Production (pprof)
1) Enable
// import side-effect registers handlers at /debug/pprof/*
import _ "net/http/pprof"
Add an HTTP server (separate port preferred).
2) Capture
# CPU (30s)
curl -s http://localhost:6060/debug/pprof/profile?seconds=30 -o cpu.pb.gz
# Heap
curl -s http://localhost:6060/debug/pprof/heap -o heap.pb.gz
# Goroutines / Mutex / Block
curl -s http://localhost:6060/debug/pprof/goroutine -o gr.pb.gz
curl -s http://localhost:6060/debug/pprof/mutex -o mutex.pb.gz
curl -s http://localhost:6060/debug/pprof/block -o block.pb.gz
3) Analyze
go tool pprof -http=:0 cpu.pb.gz
# or CLI
go tool pprof cpu.pb.gz <<'PP'
top
list mypkg.Function
web
quit
PP
Reading: Wide frames near the top indicate heavy leaf hotspots; follow callers downwards to locate the entry path. Check mutex and block profiles if CPU isn’t the bottleneck.
Heap: Open heap.pb.gz and toggle inuse_space vs alloc_space; the former shows retained memory (leaks).
Tip: If CPU time is low but latency is high, try block, mutex, and goroutine profiles—CPU flamegraphs won’t show you waiting.
Python in Production
Option A: py-spy (attach; no restart)
# Find PID, then:
py-spy top -p <PID>
# CPU flamegraph (30s)
py-spy record -p <PID> -o cpu.svg --duration 30 --rate 100
# High-level profile for a single process container:
kubectl exec -it deploy/api -- sh -lc "py-spy record -p $(pidof python) -o /tmp/cpu.svg --duration 30"
kubectl cp default/api-pod:/tmp/cpu.svg ./cpu.svg
Pros: zero code changes, minimal overhead, safe for prod.
Limit: pure-Python view (C-extensions show as native frames unless --native and symbols available).
Option B: cProfile / yappi (in-app, toggleable)
# toggle_profiler.py (use a signal or admin endpoint)
import cProfile, pstats, io
PROF = None
def start():
global PROF; PROF = cProfile.Profile(); PROF.enable()
def stop(dump="cpu.pstats"):
global PROF; PROF.disable()
s = io.StringIO(); pstats.Stats(PROF, stream=s).sort_stats("tottime").print_stats(50)
with open("cpu.txt", "w") as f: f.write(s.getvalue())
PROF.dump_stats(dump)
Then convert to flamegraph via tools like gprof2dot or upload to speedscope with proper conversion.
Memory (tracemalloc)
import tracemalloc
tracemalloc.start()
# later, on an admin endpoint:
def heap_report():
snap = tracemalloc.take_snapshot()
top = snap.statistics("lineno")[:20]
return "\n".join(str(s) for s in top)
Use two snapshots around a load test and run compare_to to find growth deltas. For retention trees, objgraph.show_backrefs() can reveal who is holding references (require graphviz).
Node.js in Production
CPU
Attach via inspector (Chrome DevTools or programmatically): lowest friction if you start with --inspect. For running processes without restart, use 0x:
# Attach and generate flamegraph HTML
npx 0x -p <PID>
Controlled programmatic capture (admin endpoint triggers captureCpu(20) from earlier snippet). Keep the window short to reduce overhead.
Heap
v8.writeHeapSnapshot()(Node >= 11.13) writes a.heapsnapshotreadable in Chrome DevTools Memory.
import v8 from 'node:v8';
const file = v8.writeHeapSnapshot(); // returns filename
- Or via inspector
HeapProfiler.takeHeapSnapshot(see snippet above). Compare two snapshots to isolate growth.
Detecting leaks: Watch for monotonically growing retained size in specific classes/arrays, timers without clearTimeout, caches without eviction, event listeners accumulation, or long-lived closures capturing large objects.
Reading Flamegraphs (Field Guide)
Flamegraphs aggregate stack samples:
- X-axis = total time (or memory) across samples; width equals weight. Order is arbitrary; width, not position, matters.
- Y-axis = stack depth (bottom = roots/callers, top = current function).
- Leaf frames near the top are where time is spent; wide blocks = hot spots.
- Colors are usually non-semantic (except some tools); don’t over-interpret.
CPU flamegraph workflow
- Find the widest block near the top. That’s your hottest leaf (e.g.,
json.Marshal,regex,bcrypt). - Walk down to see who called it; the bottommost wide paths show request entry points.
- If CPU time is low but requests are slow, switch profiles (mutex/block/gc/wall-time).
- Confirm with a second capture after a change.
Heap/alloc flamegraph workflow
- Toggle in-use vs allocated (Go) or select retained size (Node DevTools).
- Identify growth by comparing two snapshots (Python
tracemalloc.compare_to, Node DevTools “Compare”). - Pay attention to maps/objects/byte slices and buffers/strings; large retained sets often hide in caches or queues.
- Follow backrefs (who keeps it alive): collectors, singletons, global maps, event bus.
Interpreting common shapes
- Tall, narrow towers: deep recursion or generic frameworks; not necessarily hot.
- Short, very wide plateaus: hot tight loops or heavy leaf functions.
- “Comb” patterns: many similar narrow leaves—often fan-out across handlers or middleware chains.
- GC/allocator bars: frequent allocations; look for churn and object pooling opportunities.
Pitfalls
- Sampling bias: default rates (e.g., 100 Hz) can miss ultra-short functions. Raise rate cautiously.
- Inlined functions: may fold into callers; use symbolized builds/
-gcflags=all=-lfor Go during experiments (not prod). - Async stacks: Node async edges may need async stack collection; Python awaits split stacks; Go goroutines multiplex stacks—interpret within tool limits.
eBPF & Continuous Profiling (Advanced)
For always-on, low-overhead visibility:
- Pyroscope/Grafana: agents for Go/Node/Python and eBPF mode to collect without app changes.
- Parca: open-source eBPF continuous profiler.
Pros: fleet-wide comparisons, regressions over time, per-commit deltas.
Cons: infra complexity, symbolization, and storage costs—start small.
Triage Recipes
- CPU 80–100%: 30s CPU capture → identify top leaf(s) → ship a micro-fix (cache, batch, pool) → re-capture.
- Latent but low CPU: collect block/mutex (Go) or wall-time (Node DevTools), inspect thread pools and I/O waits.
- OOM / rising RSS: take 2 heap snapshots minutes apart → diff → find retaining path → fix cache/refs → verify with new snapshots.
- GC pressure: allocation flamegraph (Go
alloc_space, Pythonmemray, Node Allocation Timeline) → reduce churn/pool objects.
Kubernetes Handy Commands
# Port-forward a single pod's pprof (Go)
kubectl -n prod port-forward pod/api-xyz 6060:6060
# Exec py-spy inside a pod (Python)
kubectl -n prod exec -it deploy/pyapi -- sh -lc "apk add py3-pip && pip install py-spy && py-spy top -p $(pidof python)"
# Copy artifacts
kubectl cp prod/api-xyz:/cpu.pb.gz ./cpu.pb.gz
What Good Looks Like (Outcomes)
- A linked incident doc with: profile artifact, screen grab of hotspot, fix diff, before/after metrics.
- SLO-aligned runbook: when to capture, how long, where to store, how to read.
- One graph per change that proves we burned the hottest coal.
Appendix: Programmatic Toggles
Go: on-demand CPU profile
import (
"os"
"runtime/pprof"
"time"
)
func CaptureCPU(path string, d time.Duration) error {
f, err := os.Create(path); if err != nil { return err }
if err := pprof.StartCPUProfile(f); err != nil { return err }
time.Sleep(d)
pprof.StopCPUProfile()
return f.Close()
}
Python: on-demand cProfile
from contextlib import contextmanager
import cProfile, pstats, io
@contextmanager
def cpu_profile(out_txt="cpu.txt"):
pr = cProfile.Profile(); pr.enable()
try:
yield
finally:
pr.disable()
s = io.StringIO()
pstats.Stats(pr, stream=s).sort_stats("tottime").print_stats(50)
with open(out_txt, "w") as f: f.write(s.getvalue())
Node: admin route to trigger heap snapshot
import express from 'express';
import v8 from 'node:v8';
const app = express();
app.post('/admin/heap-snapshot', (req, res) => {
const file = v8.writeHeapSnapshot();
res.json({ file });
});