All posts

ALTER TABLE, 5 Million Rows, and the Deploy That Took Down the Site

By Mikhail Shytsko, Founder at Seedfast · · Updated

A migration that takes 50ms on your dev database can lock a production table for 20 minutes. Here's how to find out before your users do.

It was a Thursday afternoon deploy. The migration looked harmless: add a NOT NULL column with a default value to the orders table. It passed review. It ran fine on staging. The deploy pipeline showed green.

Then production went quiet. Not the good kind of quiet. The orders table — 8 million rows — was locked. Every API endpoint that touched orders started queuing. Load balancers began returning 502s. The on-call Slack channel lit up. Forty minutes later, the migration finished, the locks released, and the incident retro began.

The migration wasn't wrong. The schema change was exactly what the team needed. The problem was that nobody tested it against 8 million rows before running it against 8 million rows.

This article is about migration performance — lock duration, rewrite time, and the row counts that turn a 50ms change into a 40-minute outage. Its companion is migration correctness: whether the migration writes the right data at all. For that side — catching NULL concatenations, failed constraints, and silently rounded values before they ship — see how to review a migration against realistic data.

  • Migration time scales with data volume. A change that runs in 50ms on 50 development rows can hold a lock for minutes on millions of production rows. The SQL is identical; only the data is different.
  • The dangerous operations rewrite or scan the whole table — backfilling a new column, a non-concurrent CREATE INDEX, an integer-to-bigint type change, and validating a new FOREIGN KEY. Each one holds a lock proportional to row count.
  • Some changes are metadata-only and safe at any size — adding a column with a constant default (PostgreSQL 11+), widening a varchar, or varchar to text. You cannot tell which class a migration falls into by reading the SQL; you have to run it against production-scale data.
  • Lock duration, not total runtime, is what takes the site down. A two-minute migration that never holds a lock longer than 200ms is safer than a 30-second one that locks the table the entire time.
  • Seedfast seeds production-scale, foreign-key-valid data from your schema in one command, so you can time any migration and watch for lock contention before it ships. Its correctness cousin — catching data-shape bugs at small volume — is migration review.

Schema migrations have a property that most code changes don't: their execution time is a function of data volume. A migration that alters a table's structure doesn't just update metadata — on many operations, PostgreSQL has to rewrite or scan every row.

This creates a class of problems that are completely invisible in development:

Operation100 rows1M rows5M rows
ADD COLUMN ... DEFAULT (pre-PG 11)< 1ms8s42s
ALTER COLUMN TYPE (integer to bigint)< 1ms12s65s
CREATE INDEX (single column)< 1ms4s22s
ADD COLUMN NOT NULL DEFAULT (PG 11+)< 1ms< 1ms< 1ms
CREATE INDEX CONCURRENTLY< 1ms6s35s

Figures are illustrative orders of magnitude on commodity hardware — the point is the shape of the curve, not the exact seconds. Your real numbers depend on hardware, table width, indexes, and concurrent load.

That fourth row is interesting — PostgreSQL 11 made ADD COLUMN ... DEFAULT a metadata-only operation for constant defaults. But the other operations still rewrite or scan the table. And even "fast" operations have nuances that bite at scale.

The point: your dev database with 50 rows will never tell you which category your migration falls into.

-- Step 1: Looks fine
ALTER TABLE orders ADD COLUMN region VARCHAR(50);

-- Step 2: The backfill that locks the table
UPDATE orders SET region = 'us-east-1' WHERE region IS NULL;

-- Step 3: Now add the constraint
ALTER TABLE orders ALTER COLUMN region SET NOT NULL;

Step 2 is the problem. A single UPDATE touching 5 million rows acquires row locks across the entire table. Concurrent writes block. Autovacuum can't reclaim the dead tuples the backfill creates while the transaction stays open. If you have foreign key references, those tables might lock too.

At scale: A 5M-row backfill UPDATE can take 30+ seconds and block all concurrent writes to the table for the duration.

The safer pattern:

-- Batch the backfill
UPDATE orders SET region = 'us-east-1'
WHERE id IN (SELECT id FROM orders WHERE region IS NULL LIMIT 10000);
-- Repeat until no rows remain

But you won't know you need the safer pattern unless you've tested the naive one against real data volumes.

-- Acquires a SHARE lock — blocks writes for the entire build
CREATE INDEX idx_orders_customer_id ON orders (customer_id);

On 100 rows, this is instant. On 10 million rows, this holds a SHARE lock for the duration of the index build. Every INSERT, UPDATE, and DELETE on that table queues behind it.

The safer pattern:

-- Builds the index without blocking writes
CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);

CONCURRENTLY doesn't block writes, but it takes roughly 2-3x longer and can fail if there are concurrent schema changes. It also can't run inside a transaction block — which means most migration frameworks need special handling.

-- Widening an integer primary key that's about to overflow
ALTER TABLE events ALTER COLUMN id TYPE BIGINT;

This is the migration every growing team eventually runs — an integer ID column approaching its ~2.1 billion ceiling. It looks like a one-line metadata tweak. It isn't. Because integer and bigint have different on-disk representations, PostgreSQL has to rewrite the entire table while holding an ACCESS EXCLUSIVE lock. For a 5M-row table, nothing reads and nothing writes until the rewrite finishes.

Here's the trap: not every type change rewrites. Widening a varchar, or going from varchar to text, is binary-coercible — PostgreSQL changes only the catalog and returns instantly, even on 50 million rows. So name VARCHAR(255) to name VARCHAR(500) is free, while id INTEGER to id BIGINT rewrites the whole table. Reading the SQL won't tell you which one you wrote; only running it against real volume will.

The safer pattern for a true rewrite: Add a new column, backfill in batches, swap with a rename, drop the old column. More steps, but no extended lock.

ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
  FOREIGN KEY (customer_id) REFERENCES customers (id);

Adding a foreign key requires PostgreSQL to validate every existing row. On 5 million orders, that's a full table scan while holding a lock on both orders and customers. If customers is also a large table, you've just locked two critical tables simultaneously — and it only works if every order already points at a real customer, the kind of referential integrity that production data drifts away from over time.

The safer pattern:

-- Add the constraint without validating existing rows
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
  FOREIGN KEY (customer_id) REFERENCES customers (id) NOT VALID;

-- Validate in a separate step (holds a weaker lock)
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_customer;

NOT VALID adds the constraint for new rows immediately, then VALIDATE checks existing rows with a less aggressive lock. But again — you need real data to know whether the VALIDATE step takes 2 seconds or 20 minutes.

ALTER TABLE orders DROP COLUMN legacy_status;

In PostgreSQL, DROP COLUMN doesn't actually rewrite the table — it marks the column as invisible. Fast, right? Usually. But it still acquires an ACCESS EXCLUSIVE lock. If there are long-running queries reading from the table, the DROP waits for them to finish. While it waits, every new query queues behind it. A 1ms metadata operation can block the table for minutes if there's a slow SELECT running.

At scale: The lock acquisition time becomes unpredictable based on concurrent query workload — something you can only test with realistic data and realistic query patterns.

The fix is straightforward: run your migration against production-scale data before running it in production. Seedfast generates realistic, relational test data directly from your database schema — without a copy of your production rows and without seed scripts to maintain — so you can stand up a table with the same row counts as production and time the migration against it.

# Match your production table sizes
seedfast seed --scope "seed 5 million orders with customers, order items, and payments"

Seedfast reads your live schema, resolves foreign keys into a valid insert order, and generates realistic data with proper distributions. You get the right proportions automatically (the large-volume seeding guide covers batching and scope tuning for the highest row counts):

Seeding Plan:
  public.customers    — 500,000 records
  public.orders       — 5,000,000 records
  public.order_items  — 12,000,000 records
  public.payments     — 4,800,000 records

Total: 22,300,000 records across 4 tables

Approve? (Y/n)
# Time the migration
time psql $DATABASE_URL -f migrations/20260225_add_region_column.sql

# Or with your migration framework
time flyway migrate
time rails db:migrate
time alembic upgrade head

While the migration runs, open another terminal and monitor locks:

-- See which queries are blocked and what's blocking them
SELECT
  blocked.pid          AS blocked_pid,
  blocked.query        AS blocked_query,
  blocking.pid         AS blocking_pid,
  blocking.query       AS blocking_query,
  now() - blocked.query_start AS waiting_time
FROM pg_stat_activity AS blocked
JOIN pg_stat_activity AS blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid));

If you see queries stacking up behind your migration, that's exactly what will happen in production — except with real user traffic behind them. PostgreSQL's explicit locking documentation explains which lock modes conflict.

If the naive migration locks the table for too long, implement the safer pattern and measure again:

# Reseed a fresh database (or use a separate test database)
seedfast seed --scope "seed 5 million orders with customers"

# Run the batched migration
time psql $DATABASE_URL -f migrations/20260225_add_region_column_safe.sql

Now you have concrete numbers: "The naive migration locks orders for 38 seconds. The batched version takes 2 minutes total but never holds a lock for more than 200ms."

That's the data your team needs to make the right call.

Seedfast generates realistic rows that satisfy your foreign-key constraints, so you can run both candidate migrations against a populated copy and see which lock pattern actually holds up; the seeding guide covers a first run.

Don't make this a manual exercise. Put it in your pipeline.

name: Migration Benchmark

on:
  pull_request:
    paths:
      - 'migrations/**'

jobs:
  benchmark-migration:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: bench
          POSTGRES_USER: bench
          POSTGRES_PASSWORD: bench
        ports:
          - 5432:5432

    steps:
      - uses: actions/checkout@v4

      - name: Apply base schema
        run: |
          for f in migrations/*.sql; do
            psql $DATABASE_URL -f "$f" 2>/dev/null || true
          done
        env:
          DATABASE_URL: postgres://bench:bench@localhost:5432/bench

      - name: Seed test data
        run: seedfast seed --scope "seed 1 million orders with customers and payments" --output plain
        env:
          SEEDFAST_API_KEY: ${{ secrets.SEEDFAST_API_KEY }}
          DATABASE_URL: postgres://bench:bench@localhost:5432/bench

      - name: Benchmark new migration
        run: |
          NEW_MIGRATIONS=$(git diff --name-only origin/main -- migrations/ | sort)
          for f in $NEW_MIGRATIONS; do
            echo "--- Benchmarking: $f ---"
            time psql $DATABASE_URL -f "$f"
          done
        env:
          DATABASE_URL: postgres://bench:bench@localhost:5432/bench

      - name: Check migration duration
        run: |
          echo "Review migration timing above. Migrations over 30s need batching or CONCURRENTLY."

Now every PR that adds a migration gets benchmarked against realistic data. No more "it ran fine on staging" — because staging had 200 rows. The CI/CD database seeding guide covers the full setup, including ephemeral per-PR databases and non-interactive mode for the CLI.

Even without full CI integration, you can add a one-liner to your PR template:

## Migration Checklist
- [ ] Tested against production-scale data (`seedfast seed --scope "..."`)
- [ ] Migration completes in under 30 seconds on 1M+ rows
- [ ] No ACCESS EXCLUSIVE locks held for more than 5 seconds
- [ ] Uses CONCURRENTLY for index creation (if applicable)
- [ ] Backfills are batched (if applicable)

A migration that scans or rewrites a large table is just one of the bugs that only real test data catches — the same volume blind spot that hides slow queries and N+1s also hides slow migrations. A few additional things that catch teams off guard at scale:

Transaction-wrapped migrations lock longer than you think. Most frameworks run each migration inside a transaction. That means the lock acquired at the start isn't released until the entire migration file finishes — including any backfills, constraint validations, or index builds inside the same transaction.

CONCURRENTLY can't run in a transaction. If your framework wraps migrations in transactions (Rails, Flyway, Alembic by default), CREATE INDEX CONCURRENTLY will fail. You need to configure your framework to run specific migrations outside a transaction block.

pg_repack is your friend for zero-downtime rewrites. When you need to rewrite a table (e.g., changing a column type), pg_repack can do it with minimal locking by rebuilding the table in the background. But you need to test it against real data volumes first — because pg_repack on a 50M-row table still takes significant time and I/O.

Autovacuum matters. A large UPDATE for backfill generates dead tuples. If autovacuum can't keep up, table bloat causes subsequent queries to slow down. Test the migration and then check table bloat with pg_stat_user_tables.

Checklist for any migration touching a table with more than 100K rows:

  • Seed production-scale data locally
seedfast seed --scope "seed [your production row count] [table] with related records"
  • Time the migration. If it takes more than 10 seconds, consider batching or alternative approaches.
  • Monitor locks during execution. An ACCESS EXCLUSIVE lock held for more than a few seconds will impact production traffic.
  • Test the rollback too. A 30-second migration with a 5-minute rollback is a risky deploy.
  • Run it during low traffic if the lock duration is unavoidable. But know exactly how long "unavoidable" is — in seconds, not "it should be quick."
  • Check CONCURRENTLY support in your migration framework for index operations.
  • Validate constraints separately using NOT VALID + VALIDATE CONSTRAINT for foreign keys and check constraints.
  • Batch backfills — never UPDATE millions of rows in a single statement.

The common thread: every item on this list requires production-scale data to validate. You can't measure lock duration on 50 rows. You can't benchmark a backfill on an empty table. You can't discover that CONCURRENTLY fails inside your framework's transaction wrapper without actually running it.

You test a migration by running it against a database seeded with production-scale data and measuring how long it locks each table. Stand up a database on the current schema, seed it to the same row counts as production, run the migration with a timer, and monitor pg_locks in a second session while it executes. The lock duration you observe there is the lock duration production will see — except production has live traffic queuing behind it.

Because most migration cost scales with row count, and a development database has almost no rows. A backfill UPDATE, a non-concurrent CREATE INDEX, or an integer-to-bigint rewrite touches every row, so the same statement that finishes in under a millisecond on 50 dev rows can hold a lock for 40 seconds on 8 million production rows. Nothing about the statement changed between the two environments — only the number of rows it has to touch.

Changing a column to an incompatible type (such as integer to bigint), adding a column with a volatile default, and some SET/DROP operations rewrite the whole table under an ACCESS EXCLUSIVE lock. Operations that are binary-coercible do not rewrite: widening a varchar, varchar to text, and — since PostgreSQL 11 — adding a column with a constant default are metadata-only and instant at any size.

Use CREATE INDEX CONCURRENTLY, which builds the index without blocking writes. It takes roughly 2–3× longer than a plain CREATE INDEX and cannot run inside a transaction block, so most migration frameworks (Rails, Flyway, Alembic) need that specific migration configured to run outside their default transaction wrapper. A plain CREATE INDEX holds a SHARE lock that queues every write for the entire build.

You can, but a production copy carries PII and compliance exposure, and it drifts from the schema between refreshes. Generating production-scale data from the schema gives you the same row counts and lock behavior without real records in a lower environment. One caveat for regulated codebases: Seedfast reads your schema — the table and column definitions, not the rows — and uses an external AI service to generate from it, so confirm that schema handling fits your own data-governance rules before relying on it.

Seedfast reads your live schema, resolves foreign-key order, and seeds production-scale, FK-valid data in one command, so you can time any migration against realistic volume before deploying. It connects to a database to read the schema — the structure, not your production rows — so you point it at a local or throwaway CI database. It's free to start with no credit card, and because it re-reads the schema on every run, the same command keeps working after the migration changes the table.

Get Started | Documentation | Pricing

Seedfast generates production-scale test data from your schema description. Seed millions of rows, test your migration, and deploy with confidence.