Profiling in Production: CPU & Heap for Node, Python, and Go — and How to Read Flamegraphs

October 30th, 20258 min read#profiling #observability #performance #nodejs #python #golang #pprof #flamegraph #py-spy #ebpf

A practical, production-safe guide to CPU and memory profiling across Node.js, Python, and Go with real commands, code snippets, and a field guide to reading flamegraphs.

Profiling in Production: CPU & Heap for Node, Python, and Go — and How to Read Flamegraphs

Production-safe tactics, copy‑paste commands, and a flamegraph field guide

Goal: Find the biggest wins with minimal risk. Prefer sampling profilers and short capture windows. Keep profiles internal, add auth, and avoid long-running synchronous profilers in hot paths.

TL;DR

Use sampling in prod (e.g., Go pprof, py-spy, Node inspector/0x/Clinic Flame). Keep captures 15–60s.
Tighten scope: profile one service, one pod, reproduce, capture again.
CPU flamegraph tells you where compute time goes; heap/alloc flamegraph tells you where memory is held/allocated.
For continuous profiling, consider eBPF-based systems (Grafana Pyroscope/Parca).
Lock down access: internal-only, auth, and rotate artifacts out of the cluster after analysis.

Quick Reference (Cheat Sheet)

Go (pprof)

// main.go
import (
  _ "net/http/pprof"
  "log"
  "net/http"
)

func main() {
  go func() {
    log.Println("pprof on :6060")
    http.ListenAndServe("0.0.0.0:6060", nil)
  }()
  // ... start app
}

Capture CPU (30s) and heap snapshots:

# CPU profile (30s sampling window)
curl -s http://SERVICE:6060/debug/pprof/profile?seconds=30 -o cpu.pb.gz

# Heap profile (in-use objects/bytes)
curl -s http://SERVICE:6060/debug/pprof/heap -o heap.pb.gz

# View interactively
go tool pprof -http=:0 cpu.pb.gz
go tool pprof -http=:0 heap.pb.gz

Python (py-spy, tracemalloc)

# Top-like view of hottest Python functions (attach to PID, no restarts)
py-spy top -p <PID>

# 30s CPU flamegraph (SVG)
py-spy record -p <PID> -o cpu.svg --duration 30 --rate 100

# Speedscope JSON (for speedscope.app)
py-spy record -p <PID> -o cpu.speedscope.json --format speedscope --duration 30

Enable/compare heap snapshots (in code):

# tracemalloc_example.py
import tracemalloc, time
tracemalloc.start()

# snapshot A
snap1 = tracemalloc.take_snapshot()
# ... run endpoint / reproduce load ...
time.sleep(10)

# snapshot B & compare
snap2 = tracemalloc.take_snapshot()
for stat in snap2.compare_to(snap1, 'lineno')[:20]:
    print(stat)

Deep dives: pympler, objgraph, memray (alloc flamegraphs).

Node.js (Inspector, 0x, Clinic Flame, heap snapshots)

One-off CPU capture via inspector (attach):

# If process started with --inspect or --inspect=0.0.0.0:9229
# Use Chrome DevTools > Performance to record 20–30s CPU
# or programmatically (see below).

Programmatic CPU/Heap capture (safe trigger):

// profiler.js
import fs from 'node:fs';
import inspector from 'node:inspector';
const session = new inspector.Session(); session.connect();

export async function captureCpu(seconds = 20, out = 'cpu.cpuprofile') {
  await session.post('Profiler.enable');
  await session.post('Profiler.start');
  await new Promise(r => setTimeout(r, seconds * 1000));
  const { profile } = await new Promise((resolve, reject) =>
    session.post('Profiler.stop', (err, params) => err ? reject(err) : resolve(params))
  );
  fs.writeFileSync(out, JSON.stringify(profile));
}

export async function captureHeap(out = 'heap.heapsnapshot') {
  await session.post('HeapProfiler.enable');
  const chunks = [];
  await new Promise((resolve, reject) => {
    session.on('HeapProfiler.addHeapSnapshotChunk', m => chunks.push(m.params.chunk));
    session.post('HeapProfiler.takeHeapSnapshot', { reportProgress: false }, (err) => err ? reject(err) : resolve());
  });
  fs.writeFileSync(out, chunks.join(''));
}

CLI tools (lower friction):

# Clinic Flame (collect & open flamegraph)
npx clinic flame -- node server.js

# Attach to a running PID and save a flamegraph
npx 0x -p <PID>

CPU vs Heap: What You’re Measuring

| Profile | What it shows | Typical uses | Common pitfalls | |---|---|---|---| | CPU (sampling) | Where CPU time is spent (hot paths) | Slow endpoints, high CPU, timeouts | Sampling bias, missing async edges | | Allocations | Where memory is allocated | GC pressure, short‑lived churn | Focus on bytes vs objects wisely | | Heap (in-use) | What retains memory now | Leaks, bloaty caches, large objects | Snapshots capture now, not lifetime | | Wall-time | Time including I/O wait | Thread pool saturation, blocking I/O | Interpret vs CPU carefully |

For short spikes and intermittent stalls, do multiple short CPU captures around the incident.

Production Safety Checklist

[ ] Restrict profiling endpoints to internal networks; add auth.
[ ] Use short capture windows (15–60s) at predictable low overhead.
[ ] Prefer sampling profilers (pprof, py-spy, 0x/Clinic).
[ ] Version & label artifacts (service, commit, pod, ts).
[ ] Offload artifacts to secure storage; set retention.
[ ] Document a kill switch (env flag / feature toggle).
[ ] In k8s, use kubectl port-forward instead of exposing pprof publicly.

Go in Production (pprof)

1) Enable

// import side-effect registers handlers at /debug/pprof/*
import _ "net/http/pprof"

Add an HTTP server (separate port preferred).

2) Capture

# CPU (30s)
curl -s http://localhost:6060/debug/pprof/profile?seconds=30 -o cpu.pb.gz
# Heap
curl -s http://localhost:6060/debug/pprof/heap -o heap.pb.gz
# Goroutines / Mutex / Block
curl -s http://localhost:6060/debug/pprof/goroutine -o gr.pb.gz
curl -s http://localhost:6060/debug/pprof/mutex -o mutex.pb.gz
curl -s http://localhost:6060/debug/pprof/block -o block.pb.gz

3) Analyze

go tool pprof -http=:0 cpu.pb.gz
# or CLI
go tool pprof cpu.pb.gz <<'PP'
top
list mypkg.Function
web
quit
PP

Reading: Wide frames near the top indicate heavy leaf hotspots; follow callers downwards to locate the entry path. Check mutex and block profiles if CPU isn’t the bottleneck.

Heap: Open heap.pb.gz and toggle inuse_space vs alloc_space; the former shows retained memory (leaks).

Tip: If CPU time is low but latency is high, try block, mutex, and goroutine profiles—CPU flamegraphs won’t show you waiting.

Python in Production

Option A: `py-spy` (attach; no restart)

# Find PID, then:
py-spy top -p <PID>

# CPU flamegraph (30s)
py-spy record -p <PID> -o cpu.svg --duration 30 --rate 100

# High-level profile for a single process container:
kubectl exec -it deploy/api -- sh -lc "py-spy record -p $(pidof python) -o /tmp/cpu.svg --duration 30"
kubectl cp default/api-pod:/tmp/cpu.svg ./cpu.svg

Pros: zero code changes, minimal overhead, safe for prod.
Limit: pure-Python view (C-extensions show as native frames unless --native and symbols available).

Option B: `cProfile` / `yappi` (in-app, toggleable)

# toggle_profiler.py (use a signal or admin endpoint)
import cProfile, pstats, io
PROF = None

def start():
    global PROF; PROF = cProfile.Profile(); PROF.enable()

def stop(dump="cpu.pstats"):
    global PROF; PROF.disable()
    s = io.StringIO(); pstats.Stats(PROF, stream=s).sort_stats("tottime").print_stats(50)
    with open("cpu.txt", "w") as f: f.write(s.getvalue())
    PROF.dump_stats(dump)

Then convert to flamegraph via tools like gprof2dot or upload to speedscope with proper conversion.

Memory (tracemalloc)

import tracemalloc

tracemalloc.start()
# later, on an admin endpoint:
def heap_report():
    snap = tracemalloc.take_snapshot()
    top = snap.statistics("lineno")[:20]
    return "\n".join(str(s) for s in top)

Use two snapshots around a load test and run compare_to to find growth deltas. For retention trees, objgraph.show_backrefs() can reveal who is holding references (require graphviz).

Node.js in Production

CPU

Attach via inspector (Chrome DevTools or programmatically): lowest friction if you start with --inspect. For running processes without restart, use 0x:

# Attach and generate flamegraph HTML
npx 0x -p <PID>

Controlled programmatic capture (admin endpoint triggers captureCpu(20) from earlier snippet). Keep the window short to reduce overhead.

Heap

v8.writeHeapSnapshot() (Node >= 11.13) writes a .heapsnapshot readable in Chrome DevTools Memory.

import v8 from 'node:v8';
const file = v8.writeHeapSnapshot(); // returns filename

Or via inspector HeapProfiler.takeHeapSnapshot (see snippet above). Compare two snapshots to isolate growth.

Detecting leaks: Watch for monotonically growing retained size in specific classes/arrays, timers without clearTimeout, caches without eviction, event listeners accumulation, or long-lived closures capturing large objects.

Reading Flamegraphs (Field Guide)

Flamegraphs aggregate stack samples:

X-axis = total time (or memory) across samples; width equals weight. Order is arbitrary; width, not position, matters.
Y-axis = stack depth (bottom = roots/callers, top = current function).
Leaf frames near the top are where time is spent; wide blocks = hot spots.
Colors are usually non-semantic (except some tools); don’t over-interpret.

CPU flamegraph workflow

Find the widest block near the top. That’s your hottest leaf (e.g., json.Marshal, regex, bcrypt).
Walk down to see who called it; the bottommost wide paths show request entry points.
If CPU time is low but requests are slow, switch profiles (mutex/block/gc/wall-time).
Confirm with a second capture after a change.

Heap/alloc flamegraph workflow

Toggle in-use vs allocated (Go) or select retained size (Node DevTools).
Identify growth by comparing two snapshots (Python tracemalloc.compare_to, Node DevTools “Compare”).
Pay attention to maps/objects/byte slices and buffers/strings; large retained sets often hide in caches or queues.
Follow backrefs (who keeps it alive): collectors, singletons, global maps, event bus.

Interpreting common shapes

Tall, narrow towers: deep recursion or generic frameworks; not necessarily hot.
Short, very wide plateaus: hot tight loops or heavy leaf functions.
“Comb” patterns: many similar narrow leaves—often fan-out across handlers or middleware chains.
GC/allocator bars: frequent allocations; look for churn and object pooling opportunities.

Pitfalls

Sampling bias: default rates (e.g., 100 Hz) can miss ultra-short functions. Raise rate cautiously.
Inlined functions: may fold into callers; use symbolized builds/-gcflags=all=-l for Go during experiments (not prod).
Async stacks: Node async edges may need async stack collection; Python awaits split stacks; Go goroutines multiplex stacks—interpret within tool limits.

eBPF & Continuous Profiling (Advanced)

For always-on, low-overhead visibility:

Pyroscope/Grafana: agents for Go/Node/Python and eBPF mode to collect without app changes.
Parca: open-source eBPF continuous profiler.
Pros: fleet-wide comparisons, regressions over time, per-commit deltas.
Cons: infra complexity, symbolization, and storage costs—start small.

Triage Recipes

CPU 80–100%: 30s CPU capture → identify top leaf(s) → ship a micro-fix (cache, batch, pool) → re-capture.
Latent but low CPU: collect block/mutex (Go) or wall-time (Node DevTools), inspect thread pools and I/O waits.
OOM / rising RSS: take 2 heap snapshots minutes apart → diff → find retaining path → fix cache/refs → verify with new snapshots.
GC pressure: allocation flamegraph (Go alloc_space, Python memray, Node Allocation Timeline) → reduce churn/pool objects.

Kubernetes Handy Commands

# Port-forward a single pod's pprof (Go)
kubectl -n prod port-forward pod/api-xyz 6060:6060

# Exec py-spy inside a pod (Python)
kubectl -n prod exec -it deploy/pyapi -- sh -lc "apk add py3-pip && pip install py-spy && py-spy top -p $(pidof python)"

# Copy artifacts
kubectl cp prod/api-xyz:/cpu.pb.gz ./cpu.pb.gz

What Good Looks Like (Outcomes)

A linked incident doc with: profile artifact, screen grab of hotspot, fix diff, before/after metrics.
SLO-aligned runbook: when to capture, how long, where to store, how to read.
One graph per change that proves we burned the hottest coal.

Appendix: Programmatic Toggles

Go: on-demand CPU profile

import (
  "os"
  "runtime/pprof"
  "time"
)

func CaptureCPU(path string, d time.Duration) error {
  f, err := os.Create(path); if err != nil { return err }
  if err := pprof.StartCPUProfile(f); err != nil { return err }
  time.Sleep(d)
  pprof.StopCPUProfile()
  return f.Close()
}

Python: on-demand `cProfile`

from contextlib import contextmanager
import cProfile, pstats, io

@contextmanager
def cpu_profile(out_txt="cpu.txt"):
    pr = cProfile.Profile(); pr.enable()
    try:
        yield
    finally:
        pr.disable()
        s = io.StringIO()
        pstats.Stats(pr, stream=s).sort_stats("tottime").print_stats(50)
        with open(out_txt, "w") as f: f.write(s.getvalue())

Node: admin route to trigger heap snapshot

import express from 'express';
import v8 from 'node:v8';
const app = express();
app.post('/admin/heap-snapshot', (req, res) => {
  const file = v8.writeHeapSnapshot();
  res.json({ file });
});

Profiling in Production: CPU & Heap for Node, Python, and Go — and How to Read Flamegraphs

TL;DR

Quick Reference (Cheat Sheet)

Go (pprof)

Python (py-spy, tracemalloc)

Node.js (Inspector, 0x, Clinic Flame, heap snapshots)

CPU vs Heap: What You’re Measuring

Production Safety Checklist

Go in Production (pprof)

1) Enable

2) Capture

3) Analyze

Python in Production

Option A: py-spy (attach; no restart)

Option B: cProfile / yappi (in-app, toggleable)

Memory (tracemalloc)

Node.js in Production

CPU

Heap

Reading Flamegraphs (Field Guide)

CPU flamegraph workflow

Heap/alloc flamegraph workflow

Interpreting common shapes

Pitfalls

eBPF & Continuous Profiling (Advanced)

Triage Recipes

Kubernetes Handy Commands

What Good Looks Like (Outcomes)

Appendix: Programmatic Toggles

Go: on-demand CPU profile

Python: on-demand cProfile

Node: admin route to trigger heap snapshot

Option A: `py-spy` (attach; no restart)

Option B: `cProfile` / `yappi` (in-app, toggleable)

Python: on-demand `cProfile`