Zero‑Downtime Migrations — expand/contract without breaking prod

September 14th, 20254 min read#dev #databases #migrations #reliability #postgres #mysql #ca-duh

Change schemas safely with backward‑compatible steps, dual writes, background backfills, and rolling deploys. Postgres‑flavored, but patterns apply to MySQL and friends.

TL;DR

Migrate in three phases: expand → migrate → contract. Ship code that works with old + new before you flip anything.
Avoid long locks: add columns NULL, build indexes online (e.g., CONCURRENTLY/ONLINE/INSTANT), and backfill in batches.
Use dual writes (or triggers) during the move, then validate and eventually enforce (NOT NULL/unique).
Roll out with feature flags and a rollback path; drop old schema only after a confidence window.

1) The pattern (rename column example)

Goal: users.full_name → users.display_name (no downtime).

Expand – add the new shape, keep old working.

ALTER TABLE users ADD COLUMN display_name text NULL;

Dual write – app writes both fields (or use a trigger).

-- App layer pseudo-code
UPDATE users SET full_name=$1, display_name=$1 WHERE id=$2;

Backfill – copy historical data in small chunks.

-- Postgres example: 1k at a time
UPDATE users
SET display_name = full_name
WHERE display_name IS NULL
ORDER BY id
LIMIT 1000;

Flip reads – read display_name with fallback to full_name.

SELECT COALESCE(display_name, full_name) AS name FROM users WHERE id=$1;

Enforce & contract – stop dual writes, set constraints, drop old.

ALTER TABLE users ALTER COLUMN display_name SET NOT NULL;
ALTER TABLE users DROP COLUMN full_name;

For table renames/splits, keep a compat view or routing layer in front while you backfill and flip traffic.

2) Safe DDL by platform (quick notes)

Postgres: use CREATE INDEX CONCURRENTLY / DROP INDEX CONCURRENTLY; add constraints with NOT VALID then VALIDATE CONSTRAINT; add new columns NULL, backfill, then SET DEFAULT and SET NOT NULL.
MySQL/InnoDB: prefer ALGORITHM=INPLACE/INSTANT (where available) or online DDL; for older versions or tricky ops, use tools like pt-online-schema-change.
Any DB: check docs for which operations rewrite the table or take exclusive locks; when in doubt, perform in off‑peak and benchmark on staging.

3) Backfills that don’t hurt

Process in batches (e.g., 500–5000 rows), commit between batches, and sleep briefly to keep p95 latency stable.
Order by PK to be friendly to the buffer cache; avoid random I/O.
Add a resume checkpoint (e.g., last processed id) so jobs restart safely.
Throttle by observability: slow down if DB CPU or read/write latency spikes.

Pseudo worker (Postgres)

-- Track last_id in a control table.
WITH b AS (
  SELECT id FROM users WHERE id > $last_id AND display_name IS NULL ORDER BY id LIMIT 1000
)
UPDATE users u
SET display_name = u.full_name
FROM b
WHERE u.id = b.id
RETURNING u.id;

4) Dual writes (choose your weapon)

App-level (simpler to reason about; code owns consistency):

// in write path
await db.tx(async t => {
  await t.query("UPDATE users SET full_name=$1, display_name=$1 WHERE id=$2", [name, id]);
});

Trigger-based (DB enforces mirroring during migration):

CREATE OR REPLACE FUNCTION mirror_full_to_display() RETURNS trigger AS $$
BEGIN
  NEW.display_name := COALESCE(NEW.display_name, NEW.full_name);
  RETURN NEW;
END $$ LANGUAGE plpgsql;

CREATE TRIGGER users_mirror
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION mirror_full_to_display();

Triggers add latency; use them temporarily and drop after the transition.

5) Indexes & constraints without downtime

Build online, then enforce:

-- Postgres: unique constraint without blocking writes
CREATE UNIQUE INDEX CONCURRENTLY idx_users_email_unique ON users(email);
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE USING INDEX idx_users_email_unique;

Validate later (Postgres):

ALTER TABLE users ADD CONSTRAINT users_email_format CHECK (email ~* '^[^@]+@[^@]+\.[^@]+$') NOT VALID;
ALTER TABLE users VALIDATE CONSTRAINT users_email_format;

Pattern: create (non‑blocking) → validate → enforce, so you can roll back cleanly.

6) Rollout choreography (two+ deploys)

Deploy A – Code handles both schemas (reads new with fallback; writes both).
Expand – Run DDL to add columns/indexes online.
Backfill – Start background job; monitor DB and app metrics.
Flip reads – Feature flag to prefer the new path; watch error/latency.
Validate – Check row counts, add/validate constraints.
Contract – Stop dual writes, remove old columns/indexes.
Clean up – Remove flags/triggers and delete backfill code.

7) Rollback plan (don’t skip)

Keep old read path behind a feature flag until after the confidence window.
Don’t drop old columns/indexes until you can replay back (or you’re sure you won’t need to).
Record backfill progress so you can resume or undo partial changes.
If a release goes bad: flip flag back, stop backfill, keep system healthy, investigate offline.

8) Observability during migrations

Dashboards: DB CPU, locks, replication lag, buffer cache hit ratio, statement latency.
App: p50/p95/p99 latency, error rates, timeouts, saturation (queue depth, thread pool usage).
Logs/Traces: tag writes with migration_phase, include rows processed per batch.

9) Pitfalls & fast fixes

| Pitfall | Why it hurts | Fix | |---|---|---| | Adding NOT NULL on a huge table | Full scan + lock | Add column NULL → backfill → then SET NOT NULL | | Rebuilding index blocking writes | Long outage | Build CONCURRENTLY/ONLINE, swap later | | “Just rename the column” | Old code 500s | Add new, dual‑write, fallback reads, then drop | | Backfill overloads DB | User latency spikes | Batch + sleep; throttle by latency/CPU | | No rollback path | Risky deploy | Keep flags & old paths until confident | | Trigger left on forever | Hidden perf cost | Remove triggers after migration |

Quick checklist

[ ] Plan expand → migrate → contract steps with owners and timings.
[ ] Add new columns NULL; build indexes online.
[ ] Dual‑write during the cutover; backfill in batches with checkpoints.
[ ] Flip reads behind a flag; watch latency/errors.
[ ] Validate & enforce constraints; then drop old schema.
[ ] Keep a rollback switch until the confidence window passes.

One‑minute adoption plan

Write the step list (expand/migrate/contract) with a rollback for each step.
Ship a release that supports both schemas.
Run online DDL and start a batched backfill with checkpoints.
Flip reads via feature flag; monitor closely.
Validate constraints, then remove dual writes and old columns.