TL;DR
- Migrate in three phases: expand → migrate → contract. Ship code that works with old + new before you flip anything.
- Avoid long locks: add columns NULL, build indexes online (e.g., CONCURRENTLY/ONLINE/INSTANT), and backfill in batches.
- Use dual writes (or triggers) during the move, then validate and eventually enforce (
NOT NULL/unique). - Roll out with feature flags and a rollback path; drop old schema only after a confidence window.
1) The pattern (rename column example)
Goal: users.full_name → users.display_name (no downtime).
- Expand – add the new shape, keep old working.
ALTER TABLE users ADD COLUMN display_name text NULL;
- Dual write – app writes both fields (or use a trigger).
-- App layer pseudo-code
UPDATE users SET full_name=$1, display_name=$1 WHERE id=$2;
- Backfill – copy historical data in small chunks.
-- Postgres example: 1k at a time
UPDATE users
SET display_name = full_name
WHERE display_name IS NULL
ORDER BY id
LIMIT 1000;
- Flip reads – read
display_namewith fallback tofull_name.
SELECT COALESCE(display_name, full_name) AS name FROM users WHERE id=$1;
- Enforce & contract – stop dual writes, set constraints, drop old.
ALTER TABLE users ALTER COLUMN display_name SET NOT NULL;
ALTER TABLE users DROP COLUMN full_name;
For table renames/splits, keep a compat view or routing layer in front while you backfill and flip traffic.
2) Safe DDL by platform (quick notes)
- Postgres: use
CREATE INDEX CONCURRENTLY/DROP INDEX CONCURRENTLY; add constraints withNOT VALIDthenVALIDATE CONSTRAINT; add new columns NULL, backfill, thenSET DEFAULTandSET NOT NULL. - MySQL/InnoDB: prefer
ALGORITHM=INPLACE/INSTANT(where available) or online DDL; for older versions or tricky ops, use tools like pt-online-schema-change. - Any DB: check docs for which operations rewrite the table or take exclusive locks; when in doubt, perform in off‑peak and benchmark on staging.
3) Backfills that don’t hurt
- Process in batches (e.g., 500–5000 rows), commit between batches, and sleep briefly to keep p95 latency stable.
- Order by PK to be friendly to the buffer cache; avoid random I/O.
- Add a resume checkpoint (e.g., last processed
id) so jobs restart safely. - Throttle by observability: slow down if DB CPU or read/write latency spikes.
Pseudo worker (Postgres)
-- Track last_id in a control table.
WITH b AS (
SELECT id FROM users WHERE id > $last_id AND display_name IS NULL ORDER BY id LIMIT 1000
)
UPDATE users u
SET display_name = u.full_name
FROM b
WHERE u.id = b.id
RETURNING u.id;
4) Dual writes (choose your weapon)
App-level (simpler to reason about; code owns consistency):
// in write path
await db.tx(async t => {
await t.query("UPDATE users SET full_name=$1, display_name=$1 WHERE id=$2", [name, id]);
});
Trigger-based (DB enforces mirroring during migration):
CREATE OR REPLACE FUNCTION mirror_full_to_display() RETURNS trigger AS $$
BEGIN
NEW.display_name := COALESCE(NEW.display_name, NEW.full_name);
RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER users_mirror
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION mirror_full_to_display();
Triggers add latency; use them temporarily and drop after the transition.
5) Indexes & constraints without downtime
Build online, then enforce:
-- Postgres: unique constraint without blocking writes
CREATE UNIQUE INDEX CONCURRENTLY idx_users_email_unique ON users(email);
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE USING INDEX idx_users_email_unique;
Validate later (Postgres):
ALTER TABLE users ADD CONSTRAINT users_email_format CHECK (email ~* '^[^@]+@[^@]+\.[^@]+$') NOT VALID;
ALTER TABLE users VALIDATE CONSTRAINT users_email_format;
Pattern: create (non‑blocking) → validate → enforce, so you can roll back cleanly.
6) Rollout choreography (two+ deploys)
- Deploy A – Code handles both schemas (reads new with fallback; writes both).
- Expand – Run DDL to add columns/indexes online.
- Backfill – Start background job; monitor DB and app metrics.
- Flip reads – Feature flag to prefer the new path; watch error/latency.
- Validate – Check row counts, add/validate constraints.
- Contract – Stop dual writes, remove old columns/indexes.
- Clean up – Remove flags/triggers and delete backfill code.
7) Rollback plan (don’t skip)
- Keep old read path behind a feature flag until after the confidence window.
- Don’t drop old columns/indexes until you can replay back (or you’re sure you won’t need to).
- Record backfill progress so you can resume or undo partial changes.
- If a release goes bad: flip flag back, stop backfill, keep system healthy, investigate offline.
8) Observability during migrations
- Dashboards: DB CPU, locks, replication lag, buffer cache hit ratio, statement latency.
- App: p50/p95/p99 latency, error rates, timeouts, saturation (queue depth, thread pool usage).
- Logs/Traces: tag writes with
migration_phase, include rows processed per batch.
9) Pitfalls & fast fixes
| Pitfall | Why it hurts | Fix |
|---|---|---|
| Adding NOT NULL on a huge table | Full scan + lock | Add column NULL → backfill → then SET NOT NULL |
| Rebuilding index blocking writes | Long outage | Build CONCURRENTLY/ONLINE, swap later |
| “Just rename the column” | Old code 500s | Add new, dual‑write, fallback reads, then drop |
| Backfill overloads DB | User latency spikes | Batch + sleep; throttle by latency/CPU |
| No rollback path | Risky deploy | Keep flags & old paths until confident |
| Trigger left on forever | Hidden perf cost | Remove triggers after migration |
Quick checklist
- [ ] Plan expand → migrate → contract steps with owners and timings.
- [ ] Add new columns NULL; build indexes online.
- [ ] Dual‑write during the cutover; backfill in batches with checkpoints.
- [ ] Flip reads behind a flag; watch latency/errors.
- [ ] Validate & enforce constraints; then drop old schema.
- [ ] Keep a rollback switch until the confidence window passes.
One‑minute adoption plan
- Write the step list (expand/migrate/contract) with a rollback for each step.
- Ship a release that supports both schemas.
- Run online DDL and start a batched backfill with checkpoints.
- Flip reads via feature flag; monitor closely.
- Validate constraints, then remove dual writes and old columns.