caduh

Zero‑Downtime Migrations — expand/contract without breaking prod

4 min read

Change schemas safely with backward‑compatible steps, dual writes, background backfills, and rolling deploys. Postgres‑flavored, but patterns apply to MySQL and friends.

TL;DR

  • Migrate in three phases: expand → migrate → contract. Ship code that works with old + new before you flip anything.
  • Avoid long locks: add columns NULL, build indexes online (e.g., CONCURRENTLY/ONLINE/INSTANT), and backfill in batches.
  • Use dual writes (or triggers) during the move, then validate and eventually enforce (NOT NULL/unique).
  • Roll out with feature flags and a rollback path; drop old schema only after a confidence window.

1) The pattern (rename column example)

Goal: users.full_nameusers.display_name (no downtime).

  1. Expand – add the new shape, keep old working.
ALTER TABLE users ADD COLUMN display_name text NULL;
  1. Dual write – app writes both fields (or use a trigger).
-- App layer pseudo-code
UPDATE users SET full_name=$1, display_name=$1 WHERE id=$2;
  1. Backfill – copy historical data in small chunks.
-- Postgres example: 1k at a time
UPDATE users
SET display_name = full_name
WHERE display_name IS NULL
ORDER BY id
LIMIT 1000;
  1. Flip reads – read display_name with fallback to full_name.
SELECT COALESCE(display_name, full_name) AS name FROM users WHERE id=$1;
  1. Enforce & contract – stop dual writes, set constraints, drop old.
ALTER TABLE users ALTER COLUMN display_name SET NOT NULL;
ALTER TABLE users DROP COLUMN full_name;

For table renames/splits, keep a compat view or routing layer in front while you backfill and flip traffic.


2) Safe DDL by platform (quick notes)

  • Postgres: use CREATE INDEX CONCURRENTLY / DROP INDEX CONCURRENTLY; add constraints with NOT VALID then VALIDATE CONSTRAINT; add new columns NULL, backfill, then SET DEFAULT and SET NOT NULL.
  • MySQL/InnoDB: prefer ALGORITHM=INPLACE/INSTANT (where available) or online DDL; for older versions or tricky ops, use tools like pt-online-schema-change.
  • Any DB: check docs for which operations rewrite the table or take exclusive locks; when in doubt, perform in off‑peak and benchmark on staging.

3) Backfills that don’t hurt

  • Process in batches (e.g., 500–5000 rows), commit between batches, and sleep briefly to keep p95 latency stable.
  • Order by PK to be friendly to the buffer cache; avoid random I/O.
  • Add a resume checkpoint (e.g., last processed id) so jobs restart safely.
  • Throttle by observability: slow down if DB CPU or read/write latency spikes.

Pseudo worker (Postgres)

-- Track last_id in a control table.
WITH b AS (
  SELECT id FROM users WHERE id > $last_id AND display_name IS NULL ORDER BY id LIMIT 1000
)
UPDATE users u
SET display_name = u.full_name
FROM b
WHERE u.id = b.id
RETURNING u.id;

4) Dual writes (choose your weapon)

App-level (simpler to reason about; code owns consistency):

// in write path
await db.tx(async t => {
  await t.query("UPDATE users SET full_name=$1, display_name=$1 WHERE id=$2", [name, id]);
});

Trigger-based (DB enforces mirroring during migration):

CREATE OR REPLACE FUNCTION mirror_full_to_display() RETURNS trigger AS $$
BEGIN
  NEW.display_name := COALESCE(NEW.display_name, NEW.full_name);
  RETURN NEW;
END $$ LANGUAGE plpgsql;

CREATE TRIGGER users_mirror
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION mirror_full_to_display();

Triggers add latency; use them temporarily and drop after the transition.


5) Indexes & constraints without downtime

Build online, then enforce:

-- Postgres: unique constraint without blocking writes
CREATE UNIQUE INDEX CONCURRENTLY idx_users_email_unique ON users(email);
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE USING INDEX idx_users_email_unique;

Validate later (Postgres):

ALTER TABLE users ADD CONSTRAINT users_email_format CHECK (email ~* '^[^@]+@[^@]+\.[^@]+$') NOT VALID;
ALTER TABLE users VALIDATE CONSTRAINT users_email_format;

Pattern: create (non‑blocking) → validateenforce, so you can roll back cleanly.


6) Rollout choreography (two+ deploys)

  1. Deploy A – Code handles both schemas (reads new with fallback; writes both).
  2. Expand – Run DDL to add columns/indexes online.
  3. Backfill – Start background job; monitor DB and app metrics.
  4. Flip reads – Feature flag to prefer the new path; watch error/latency.
  5. Validate – Check row counts, add/validate constraints.
  6. Contract – Stop dual writes, remove old columns/indexes.
  7. Clean up – Remove flags/triggers and delete backfill code.

7) Rollback plan (don’t skip)

  • Keep old read path behind a feature flag until after the confidence window.
  • Don’t drop old columns/indexes until you can replay back (or you’re sure you won’t need to).
  • Record backfill progress so you can resume or undo partial changes.
  • If a release goes bad: flip flag back, stop backfill, keep system healthy, investigate offline.

8) Observability during migrations

  • Dashboards: DB CPU, locks, replication lag, buffer cache hit ratio, statement latency.
  • App: p50/p95/p99 latency, error rates, timeouts, saturation (queue depth, thread pool usage).
  • Logs/Traces: tag writes with migration_phase, include rows processed per batch.

9) Pitfalls & fast fixes

| Pitfall | Why it hurts | Fix | |---|---|---| | Adding NOT NULL on a huge table | Full scan + lock | Add column NULL → backfill → then SET NOT NULL | | Rebuilding index blocking writes | Long outage | Build CONCURRENTLY/ONLINE, swap later | | “Just rename the column” | Old code 500s | Add new, dual‑write, fallback reads, then drop | | Backfill overloads DB | User latency spikes | Batch + sleep; throttle by latency/CPU | | No rollback path | Risky deploy | Keep flags & old paths until confident | | Trigger left on forever | Hidden perf cost | Remove triggers after migration |


Quick checklist

  • [ ] Plan expand → migrate → contract steps with owners and timings.
  • [ ] Add new columns NULL; build indexes online.
  • [ ] Dual‑write during the cutover; backfill in batches with checkpoints.
  • [ ] Flip reads behind a flag; watch latency/errors.
  • [ ] Validate & enforce constraints; then drop old schema.
  • [ ] Keep a rollback switch until the confidence window passes.

One‑minute adoption plan

  1. Write the step list (expand/migrate/contract) with a rollback for each step.
  2. Ship a release that supports both schemas.
  3. Run online DDL and start a batched backfill with checkpoints.
  4. Flip reads via feature flag; monitor closely.
  5. Validate constraints, then remove dual writes and old columns.