ALTER TABLE, 5 Million Rows, and the Deploy That Took Down the Site
A migration that takes 50ms on your dev database can lock a production table for 20 minutes. Here's how to find out before your users do.
It was a Thursday afternoon deploy. The migration looked harmless: add a NOT NULL column with a default value to the orders table. It passed review. It ran fine on staging. The deploy pipeline showed green.
Then production went quiet. Not the good kind of quiet. The orders table — 8 million rows — was locked. Every API endpoint that touched orders started queuing. Load balancers began returning 502s. The on-call Slack channel lit up. Forty minutes later, the migration finished, the locks released, and the incident retro began.
The migration wasn't wrong. The schema change was exactly what the team needed. The problem was that nobody tested it against 8 million rows before running it against 8 million rows.
The Migration Time Bomb
Schema migrations have a property that most code changes don't: their execution time is a function of data volume. A migration that alters a table's structure doesn't just update metadata — on many operations, PostgreSQL has to rewrite or scan every row.
This creates a class of problems that are completely invisible in development:
Operation
100 rows
1M rows
5M rows
ADD COLUMN ... DEFAULT (pre-PG 11)
< 1ms
8s
42s
ALTER COLUMN TYPE (varchar to text)
< 1ms
12s
65s
CREATE INDEX (single column)
< 1ms
4s
22s
ADD COLUMN NOT NULL DEFAULT (PG 11+)
< 1ms
< 1ms
< 1ms
CREATE INDEX CONCURRENTLY
< 1ms
6s
35s
That fourth row is interesting — PostgreSQL 11 made ADD COLUMN ... DEFAULT a metadata-only operation. But the other operations still rewrite or scan the table. And even "fast" operations have nuances that bite at scale.
The point: your dev database with 50 rows will never tell you which category your migration falls into.
5 Migrations That Look Harmless (Until They're Not)
1. Adding a NOT NULL Column With Backfill
-- Step 1: Looks fineALTERTABLE orders ADDCOLUMN region VARCHAR(50);
-- Step 2: The backfill that locks the tableUPDATE orders SET region = 'us-east-1'WHERE region ISNULL;
-- Step 3: Now add the constraintALTERTABLE orders ALTERCOLUMN region SETNOTNULL
-- Step 1: Looks fineALTERTABLE orders ADDCOLUMN region VARCHAR(50);
-- Step 2: The backfill that locks the tableUPDATE orders SET region = 'us-east-1'WHERE region ISNULL;
-- Step 3: Now add the constraintALTERTABLE orders ALTERCOLUMN region SETNOTNULL
-- Step 1: Looks fineALTERTABLE orders ADDCOLUMN region VARCHAR(50);
-- Step 2: The backfill that locks the tableUPDATE orders SET region = 'us-east-1'WHERE region ISNULL;
-- Step 3: Now add the constraintALTERTABLE orders ALTERCOLUMN region SETNOTNULL
Step 2 is the problem. A single UPDATE touching 5 million rows acquires row locks across the entire table. Concurrent writes block. Auto-vacuum can't run. If you have foreign key references, those tables might lock too.
At scale: A 5M-row backfill UPDATE can take 30+ seconds and block all concurrent writes to the table for the duration.
The safer pattern:
-- Batch the backfillUPDATE orders SET region = 'us-east-1'WHERE id IN(SELECT id FROM orders WHERE region ISNULLLIMIT10000);
-- Repeat until done
-- Batch the backfillUPDATE orders SET region = 'us-east-1'WHERE id IN(SELECT id FROM orders WHERE region ISNULLLIMIT10000);
-- Repeat until done
-- Batch the backfillUPDATE orders SET region = 'us-east-1'WHERE id IN(SELECT id FROM orders WHERE region ISNULLLIMIT10000);
-- Repeat until done
But you won't know you need the safer pattern unless you've tested the naive one against real data volumes.
2. Creating an Index on a Large Table
-- Acquires a SHARE lock — blocks writes for the entire buildCREATE INDEX idx_orders_customer_id ON orders (customer_id)
-- Acquires a SHARE lock — blocks writes for the entire buildCREATE INDEX idx_orders_customer_id ON orders (customer_id)
-- Acquires a SHARE lock — blocks writes for the entire buildCREATE INDEX idx_orders_customer_id ON orders (customer_id)
On 100 rows, this is instant. On 10 million rows, this holds a SHARE lock for the duration of the index build. Every INSERT, UPDATE, and DELETE on that table queues behind it.
The safer pattern:
-- Builds the index without blocking writesCREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id)
-- Builds the index without blocking writesCREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id)
-- Builds the index without blocking writesCREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id)
CONCURRENTLY doesn't block writes, but it takes roughly 2-3x longer and can fail if there are concurrent schema changes. It also can't run inside a transaction block — which means most migration frameworks need special handling.
3. Changing a Column Type
-- "Just" widening a varchar fieldALTERTABLE users ALTERCOLUMN name TYPE VARCHAR(500)
-- "Just" widening a varchar fieldALTERTABLE users ALTERCOLUMN name TYPE VARCHAR(500)
-- "Just" widening a varchar fieldALTERTABLE users ALTERCOLUMN name TYPE VARCHAR(500)
This looks like a metadata change. It isn't. PostgreSQL rewrites the entire table when you change a column type — even when the new type is strictly wider. For a 5M-row table, that's a full table rewrite while holding an ACCESS EXCLUSIVE lock. Nothing reads. Nothing writes. The table is gone from the application's perspective until the rewrite finishes.
The safer pattern: Add a new column, backfill in batches, swap with a rename, drop the old column. More steps, but no extended lock.
Adding a foreign key requires PostgreSQL to validate every existing row. On 5 million orders, that's a full table scan while holding a lock on both orders and customers. If customers is also a large table, you've just locked two critical tables simultaneously.
The safer pattern:
-- Add the constraint without validating existing rowsALTERTABLE orders ADDCONSTRAINT fk_orders_customer
FOREIGNKEY(customer_id)REFERENCES customers (id)NOT VALID;
-- Validate in a separate step (holds a weaker lock)ALTERTABLE orders VALIDATE CONSTRAINT
-- Add the constraint without validating existing rowsALTERTABLE orders ADDCONSTRAINT fk_orders_customer
FOREIGNKEY(customer_id)REFERENCES customers (id)NOT VALID;
-- Validate in a separate step (holds a weaker lock)ALTERTABLE orders VALIDATE CONSTRAINT
-- Add the constraint without validating existing rowsALTERTABLE orders ADDCONSTRAINT fk_orders_customer
FOREIGNKEY(customer_id)REFERENCES customers (id)NOT VALID;
-- Validate in a separate step (holds a weaker lock)ALTERTABLE orders VALIDATE CONSTRAINT
NOT VALID adds the constraint for new rows immediately, then VALIDATE checks existing rows with a less aggressive lock. But again — you need real data to know whether the VALIDATE step takes 2 seconds or 20 minutes.
5. Dropping a Column (Yes, Really)
ALTERTABLE orders DROPCOLUMN
ALTERTABLE orders DROPCOLUMN
ALTERTABLE orders DROPCOLUMN
In PostgreSQL, DROP COLUMN doesn't actually rewrite the table — it marks the column as invisible. Fast, right? Usually. But it still acquires an ACCESS EXCLUSIVE lock. If there are long-running queries reading from the table, the DROP waits for them to finish. While it waits, every new query queues behind it. A 1ms metadata operation can block the table for minutes if there's a slow SELECT running.
At scale: The lock acquisition time becomes unpredictable based on concurrent query workload — something you can only test with realistic data and realistic query patterns.
The Seedfast Workflow: Seed, Migrate, Measure
The fix is straightforward: run your migration against production-scale data before running it in production.
Step 1: Seed Production-Scale Data
# Match your production table sizesseedfast seed --scope "seed 5 million orders with customers, order items, and payments"
# Match your production table sizesseedfast seed --scope "seed 5 million orders with customers, order items, and payments"
# Match your production table sizesseedfast seed --scope "seed 5 million orders with customers, order items, and payments"
Seedfast analyzes your schema, resolves foreign keys, and generates realistic data with proper distributions. You get the right proportions automatically:
Seeding Plan:public.customers— 500,000recordspublic.orders— 5,000,000recordspublic.order_items— 12,000,000recordspublic.payments— 4,800,000records
Total:22,300,000records across 4tablesApprove? (Y/n)
Seeding Plan:public.customers— 500,000recordspublic.orders— 5,000,000recordspublic.order_items— 12,000,000recordspublic.payments— 4,800,000records
Total:22,300,000records across 4tablesApprove? (Y/n)
Seeding Plan:public.customers— 500,000recordspublic.orders— 5,000,000recordspublic.order_items— 12,000,000recordspublic.payments— 4,800,000records
Total:22,300,000records across 4tablesApprove? (Y/n)
Step 2: Run the Migration and Measure
# Time the migrationtime psql $DATABASE_URL -f migrations/20260225_add_region_column.sql
# Or withyour migration frameworktime flyway migratetime rails db:migratetime alembic upgrade head
# Time the migrationtime psql $DATABASE_URL -f migrations/20260225_add_region_column.sql
# Or withyour migration frameworktime flyway migratetime rails db:migratetime alembic upgrade head
# Time the migrationtime psql $DATABASE_URL -f migrations/20260225_add_region_column.sql
# Or withyour migration frameworktime flyway migratetime rails db:migratetime alembic upgrade head
Step 3: Check for Lock Contention
While the migration runs, open another terminal and monitor locks:
-- See what's locked and what's waitingSELECT
blocked.pid AS blocked_pid,
blocked_activity.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking_activity.query AS blocking_query,
now() - blocked_activity.query_start AS waiting_time
FROM pg_catalog.pg_locks blocked
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked.pid = blocked_activity.pid
JOIN pg_catalog.pg_locks blocking ON blocked.transactionid = blocking.transactionid AND blocked.pid != blocking.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON
-- See what's locked and what's waitingSELECT
blocked.pid AS blocked_pid,
blocked_activity.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking_activity.query AS blocking_query,
now() - blocked_activity.query_start AS waiting_time
FROM pg_catalog.pg_locks blocked
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked.pid = blocked_activity.pid
JOIN pg_catalog.pg_locks blocking ON blocked.transactionid = blocking.transactionid AND blocked.pid != blocking.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON
-- See what's locked and what's waitingSELECT
blocked.pid AS blocked_pid,
blocked_activity.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking_activity.query AS blocking_query,
now() - blocked_activity.query_start AS waiting_time
FROM pg_catalog.pg_locks blocked
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked.pid = blocked_activity.pid
JOIN pg_catalog.pg_locks blocking ON blocked.transactionid = blocking.transactionid AND blocked.pid != blocking.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON
If you see queries stacking up behind your migration, that's exactly what will happen in production — except with real user traffic behind them.
Step 4: Benchmark the Safe Alternative
If the naive migration locks the table for too long, implement the safer pattern and measure again:
# Reseed a fresh database(or use a separate test database)seedfast seed --scope "seed 5 million orders with customers"
# Run the batched migrationtime psql $DATABASE_URL -f migrations/20260225_add_region_column_safe.sql
# Reseed a fresh database(or use a separate test database)seedfast seed --scope "seed 5 million orders with customers"
# Run the batched migrationtime psql $DATABASE_URL -f migrations/20260225_add_region_column_safe.sql
# Reseed a fresh database(or use a separate test database)seedfast seed --scope "seed 5 million orders with customers"
# Run the batched migrationtime psql $DATABASE_URL -f migrations/20260225_add_region_column_safe.sql
Now you have concrete numbers: "The naive migration locks orders for 38 seconds. The batched version takes 2 minutes total but never holds a lock for more than 200ms."
That's the data your team needs to make the right call.
CI/CD: Automated Migration Benchmarking
Don't make this a manual exercise. Put it in your pipeline.
name: Migration Benchmark
on:
pull_request:
paths:
- 'migrations/**'jobs:
benchmark-migration:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16 env:
POSTGRES_DB: bench
POSTGRES_USER: bench
POSTGRES_PASSWORD: bench
ports:
- 5432:5432
steps:
- uses: actions/checkout@v4
- name: Apply base schema
run: |
for f in migrations/*.sql; do psql $DATABASE_URL -f "$f" 2>/dev/null || true done env:
DATABASE_URL: postgres://bench:bench@localhost:5432/bench
- name: Seed test data
run: seedfast seed --scope "seed 1 million orders with customers and payments" --output plain
env:
SEEDFAST_API_KEY: ${{ secrets.SEEDFAST_API_KEY }}
DATABASE_URL: postgres://bench:bench@localhost:5432/bench
- name: Benchmark new migration
run: |
NEW_MIGRATIONS=$(git diff --name-only origin/main -- migrations/ | sort) for f in $NEW_MIGRATIONS; do echo "--- Benchmarking: $f ---" time psql $DATABASE_URL -f "$f" done env:
DATABASE_URL: postgres://bench:bench@localhost:5432/bench
- name: Check migration duration
run: |
echo "Review migration timing above. Migrations over 30s need batching or CONCURRENTLY."
name: Migration Benchmark
on:
pull_request:
paths:
- 'migrations/**'jobs:
benchmark-migration:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16 env:
POSTGRES_DB: bench
POSTGRES_USER: bench
POSTGRES_PASSWORD: bench
ports:
- 5432:5432
steps:
- uses: actions/checkout@v4
- name: Apply base schema
run: |
for f in migrations/*.sql; do psql $DATABASE_URL -f "$f" 2>/dev/null || true done env:
DATABASE_URL: postgres://bench:bench@localhost:5432/bench
- name: Seed test data
run: seedfast seed --scope "seed 1 million orders with customers and payments" --output plain
env:
SEEDFAST_API_KEY: ${{ secrets.SEEDFAST_API_KEY }}
DATABASE_URL: postgres://bench:bench@localhost:5432/bench
- name: Benchmark new migration
run: |
NEW_MIGRATIONS=$(git diff --name-only origin/main -- migrations/ | sort) for f in $NEW_MIGRATIONS; do echo "--- Benchmarking: $f ---" time psql $DATABASE_URL -f "$f" done env:
DATABASE_URL: postgres://bench:bench@localhost:5432/bench
- name: Check migration duration
run: |
echo "Review migration timing above. Migrations over 30s need batching or CONCURRENTLY."
name: Migration Benchmark
on:
pull_request:
paths:
- 'migrations/**'jobs:
benchmark-migration:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16 env:
POSTGRES_DB: bench
POSTGRES_USER: bench
POSTGRES_PASSWORD: bench
ports:
- 5432:5432
steps:
- uses: actions/checkout@v4
- name: Apply base schema
run: |
for f in migrations/*.sql; do psql $DATABASE_URL -f "$f" 2>/dev/null || true done env:
DATABASE_URL: postgres://bench:bench@localhost:5432/bench
- name: Seed test data
run: seedfast seed --scope "seed 1 million orders with customers and payments" --output plain
env:
SEEDFAST_API_KEY: ${{ secrets.SEEDFAST_API_KEY }}
DATABASE_URL: postgres://bench:bench@localhost:5432/bench
- name: Benchmark new migration
run: |
NEW_MIGRATIONS=$(git diff --name-only origin/main -- migrations/ | sort) for f in $NEW_MIGRATIONS; do echo "--- Benchmarking: $f ---" time psql $DATABASE_URL -f "$f" done env:
DATABASE_URL: postgres://bench:bench@localhost:5432/bench
- name: Check migration duration
run: |
echo "Review migration timing above. Migrations over 30s need batching or CONCURRENTLY."
Now every PR that adds a migration gets benchmarked against realistic data. No more "it ran fine on staging" — because staging had 200 rows.
For Smaller Teams
Even without full CI integration, you can add a one-liner to your PR template:
## Migration Checklist
- []Tested against production-scale data(`seedfast seed --scope "..."`)
- []Migration completesinunder 30seconds on 1M+ rows
- []No ACCESS EXCLUSIVE locks held formore than 5seconds
- []Uses CONCURRENTLY forindex creation(ifapplicable)
- []Backfills are batched(ifapplicable)
## Migration Checklist
- []Tested against production-scale data(`seedfast seed --scope "..."`)
- []Migration completesinunder 30seconds on 1M+ rows
- []No ACCESS EXCLUSIVE locks held formore than 5seconds
- []Uses CONCURRENTLY forindex creation(ifapplicable)
- []Backfills are batched(ifapplicable)
## Migration Checklist
- []Tested against production-scale data(`seedfast seed --scope "..."`)
- []Migration completesinunder 30seconds on 1M+ rows
- []No ACCESS EXCLUSIVE locks held formore than 5seconds
- []Uses CONCURRENTLY forindex creation(ifapplicable)
- []Backfills are batched(ifapplicable)
PostgreSQL Migration Gotchas Worth Knowing
A few additional things that catch teams off guard at scale:
Transaction-wrapped migrations lock longer than you think. Most frameworks run each migration inside a transaction. That means the lock acquired at the start isn't released until the entire migration file finishes — including any backfills, constraint validations, or index builds inside the same transaction.
CONCURRENTLY can't run in a transaction. If your framework wraps migrations in transactions (Rails, Flyway, Alembic by default), CREATE INDEX CONCURRENTLY will fail. You need to configure your framework to run specific migrations outside a transaction block.
pg_repack is your friend for zero-downtime rewrites. When you need to rewrite a table (e.g., changing a column type), pg_repack can do it with minimal locking by rebuilding the table in the background. But you need to test it against real data volumes first — because pg_repack on a 50M-row table still takes significant time and I/O.
Autovacuum matters. A large UPDATE for backfill generates dead tuples. If autovacuum can't keep up, table bloat causes subsequent queries to slow down. Test the migration and then check table bloat with pg_stat_user_tables.
Before You Run That Migration in Production
Checklist for any migration touching a table with more than 100K rows:
Seed production-scale data locally
seedfast seed --scope "seed [your production row count] [table] with related records"
seedfast seed --scope "seed [your production row count] [table] with related records"
seedfast seed --scope "seed [your production row count] [table] with related records"
Time the migration. If it takes more than 10 seconds, consider batching or alternative approaches.
Monitor locks during execution. An ACCESS EXCLUSIVE lock held for more than a few seconds will impact production traffic.
Test the rollback too. A 30-second migration with a 5-minute rollback is a risky deploy.
Run it during low traffic if the lock duration is unavoidable. But know exactly how long "unavoidable" is — in seconds, not "it should be quick."
Check CONCURRENTLY support in your migration framework for index operations.
Validate constraints separately using NOT VALID + VALIDATE CONSTRAINT for foreign keys and check constraints.
Batch backfills — never UPDATE millions of rows in a single statement.
The common thread: every item on this list requires production-scale data to validate. You can't measure lock duration on 50 rows. You can't benchmark a backfill on an empty table. You can't discover that CONCURRENTLY fails inside your framework's transaction wrapper without actually running it.
Stop discovering migration problems in production. Discover them on your laptop.