All posts

Review SQL Migrations in 30 Seconds: Seed, Migrate, Compare

By Mikhail Shytsko, Founder at Seedfast · · Updated

You review application code with tests. You review migrations by reading SQL and hoping for the best. Here's how to actually validate what a migration does before you approve it.

It's Wednesday morning. You have three PRs to review before standup. Two are application code — you check the tests, read the logic, leave a comment about a missing null check, approve the rest. The third PR adds a migration: ALTER TABLE users ALTER COLUMN email TYPE TEXT. You read the SQL. It looks fine. You approve.

On Thursday, the migration runs against production. The users table has 4 million rows. The column type change triggers a full table rewrite. That's not the problem — the team expected that. The problem is what the change quietly removes: the old VARCHAR(255) column capped every email at 255 characters, and a downstream billing export to a partner system relies on that cap to fit its own fixed-width email field. Once the column becomes TEXT with no length limit, the next batch of sign-ups with longer addresses flows straight through and the partner export starts rejecting payloads it was never built to hold.

The migration was syntactically correct. The schema change made sense. But the effect of that migration on real data was something nobody looked at. You approved a diff. You didn't approve an outcome.

  • Reading migration SQL checks syntax and intent. It cannot show you what the migration does to real data — NULLs, duplicates, edge-case strings, and out-of-range values that only exist once a table has been in production a while.
  • A migration review needs three things a diff can't give you: the before state, the after state, and a diff of the data, not just the schema.
  • The bug is usually in the data, not the SQL. A syntactically valid UPDATE ... = a || b still produces NULLs, a valid ADD CONSTRAINT UNIQUE still fails on existing duplicates, a valid type cast still rounds money.
  • Correctness testing (production-shaped data: NULLs, edge cases, ~500 rows) is a different question from performance testing (production-scale data: millions of rows). This article is about correctness, the part that belongs in code review. Its performance cousin is migration testing at scale.
  • Seedfast seeds FK-valid, production-shaped data from your live schema in one command, so a reviewer can run any migration against realistic variety before approving it.

Code review has a well-established loop: read the change, check the tests, verify the behavior matches the intent. When someone adds a new endpoint, you can look at the test suite and see exactly what inputs produce what outputs. When someone refactors a service, the tests prove that existing behavior is preserved.

Migrations have none of this. A typical migration PR looks like:

-- migrations/20260226_consolidate_user_names.sql

ALTER TABLE users ADD COLUMN full_name VARCHAR(200);
UPDATE users SET full_name = first_name || ' ' || last_name;
ALTER TABLE users DROP COLUMN first_name;
ALTER TABLE users DROP COLUMN last_name;

As a reviewer, you read this and think: "Okay, concatenating first and last name into a single column. Makes sense." You check for obvious mistakes — missing WHERE clause, wrong column name, bad syntax. You approve.

What you didn't check:

  • Are there rows where first_name is NULL? The concatenation produces NULL || ' ' || 'Smith' = NULL. The full_name column now has NULLs where you expected names.
  • Are there rows where first_name or last_name contain leading or trailing spaces? You now have " John " || " " || "Smith" = " John Smith".
  • Does the combined length of first_name + ' ' + last_name ever exceed 200 characters? If it does, the UPDATE silently truncates or throws — depending on your database settings.

These aren't edge cases. They're the normal state of any database that's been in production for more than six months. But you can't see them by reading SQL. You can only see them by running the migration against data that has the same shape as production.

When you review application code, you don't just read the source — you read the test output. The tests tell you: "Given this input, this is what happens." You're reviewing behavior, not just syntax.

Migration reviews should work the same way. What a reviewer actually needs is:

  1. Before state: What does the data look like right now?
  2. After state: What does the data look like after the migration?
  3. Diff: What changed, and does the change match the stated intent?

If a migration says "consolidate first and last name into full_name," the reviewer should be able to see a sample of actual rows before and after. Not imagine them. See them.

The problem is that this requires data. Your local development database has 12 rows, all with perfectly formatted names like "John Smith" and "Jane Doe." That data will never reveal the NULL concatenation bug, the trailing spaces, or the truncation. It's too clean, too small, and too synthetic.

The fix is a three-step workflow that takes about 30 seconds of hands-on time:

Start with a database that has your current schema (before the migration) and fill it with data that has realistic variety — NULLs, edge-case strings, varying lengths, the kind of messiness that accumulates in production.

# Seed users with realistic variation
seedfast seed --scope "500 users with varied names, some with NULL first or last names, international characters"

Seedfast generates data that reflects real-world patterns: some names are short, some are long, some fields are NULL, unicode characters show up where they would in production. This is the critical difference from hand-crafted test fixtures — you get the variety you didn't think to test for. You describe the variety you want in a plain-English scope, and because it reads your live schema each run, the seed keeps producing FK-valid rows even after the migration you're reviewing changes the shape of a table.

Capture the before state, run the migration, capture the after state.

# Snapshot the before state
psql $DATABASE_URL -c "SELECT first_name, last_name FROM users LIMIT 20" > before.txt

# Run the migration
psql $DATABASE_URL -f migrations/20260226_consolidate_user_names.sql

# Snapshot the after state
psql $DATABASE_URL -c "SELECT full_name FROM users LIMIT 20" > after.txt

Now look at what actually happened:

# Check for NULLs that shouldn't be there
psql $DATABASE_URL -c "SELECT count(*) FROM users WHERE full_name IS NULL"

# Check for unexpected whitespace
psql $DATABASE_URL -c "SELECT full_name FROM users WHERE full_name LIKE '%  %' OR full_name LIKE ' %' LIMIT 10"

# Check for truncation
psql $DATABASE_URL -c "SELECT full_name, length(full_name) FROM users ORDER BY length(full_name) DESC LIMIT 5"

If the migration is correct, these queries return clean results. If it's not, you just caught a production bug from your laptop in under a minute.

The migration:

UPDATE users SET full_name = first_name || ' ' || last_name;

The bug: Any row where first_name or last_name is NULL produces a NULL full_name. In PostgreSQL, NULL concatenated with anything is NULL — the string operators short-circuit to NULL the moment one operand is NULL.

What the reviewer sees after seeding and running:

 count
-------
   47
(1 row)

47 users just lost their names. The fix is obvious once you see it — wrap each column in COALESCE so a NULL becomes an empty string instead of poisoning the whole expression:

UPDATE users SET full_name = COALESCE(first_name, '') || ' ' || COALESCE(last_name, '');

But you'd never catch this by reading the original SQL — because the SQL is syntactically valid and logically reasonable. The bug is in the data, not the code.

The migration:

ALTER TABLE orders ALTER COLUMN shipping_address SET NOT NULL

The reviewer reads this and thinks: "Good, shipping address should be required." They approve. In production, 8% of orders are digital-only — no shipping address. The migration fails on deploy.

After seeding 500 orders with realistic variety:

psql $DATABASE_URL -c "SELECT count(*) FROM orders WHERE shipping_address IS NULL"
 count
-------
    41
(1 row)

The reviewer now knows: this migration needs a backfill step first, or it needs to handle digital orders differently.

The migration:

ALTER TABLE products ALTER COLUMN price TYPE INTEGER USING price::INTEGER;

The intent: "We don't need decimal precision, let's simplify." The reviewer checks the cast syntax — looks correct. Approved. But ALTER COLUMN ... TYPE rewrites every value through the USING cast, and casting numeric to integer rounds each price to a whole number, dropping the cents.

After seeding:

  price
---------
  29.99
  14.50
  99.95
  7.25
  149.99

Every one of those prices is rounded to the nearest whole number — $29.99 becomes $30, $7.25 becomes $7. The cents are gone. Across thousands of products, financial reports drift by thousands of dollars. The reviewer catches this instantly by looking at the data, not the DDL.

The migration:

ALTER TABLE users ADD COLUMN role VARCHAR(20) NOT NULL DEFAULT 'user'

Seems fine. But after seeding and inspecting existing rows:

   role   | count
----------+-------
 user     |   412
 admin    |    23
 manager  |    65

Wait — all existing rows now have role = 'user', including the 23 admins and 65 managers. If the application was previously storing role information in a different table or field, this migration just demoted every admin to a regular user. The DEFAULT applies to existing rows because the column is new, and every row gets the default.

The fix: add the column as nullable first, backfill from the authoritative source, then add the NOT NULL constraint.

The migration:

ALTER TABLE customers ADD CONSTRAINT unique_email UNIQUE (email)

After seeding:

         email          | count
------------------------+-------
 john.smith@example.com |     3
 info@company.org       |     2

The migration will fail on deploy. Duplicate emails exist — maybe from a legacy import, maybe from a bulk account creation that didn't enforce uniqueness, maybe from a bug that was fixed years ago but left dirty data behind. The constraint is correct, but the data isn't ready for it.

That's five different bugs with five different fixes — a COALESCE, a backfill, a column redesign, a data cleanup. What they share isn't the fix; it's that none of them are visible in the SQL. The only way to find them is to seed realistic data, run the migration, and look at what actually changed before it reaches production.

Wrap the workflow into a script that any reviewer can run:

#!/bin/bash
# review-migration.sh — Validate a migration against realistic data
set -e

MIGRATION_FILE=$1
DATABASE_URL=${DATABASE_URL:-"postgres://dev:dev@localhost:5432/migration_review"}

if [ -z "$MIGRATION_FILE" ]; then
  echo "Usage: ./review-migration.sh <migration-file>"
  exit 1
fi

echo "=== Step 1: Reset database ==="
dropdb --if-exists migration_review
createdb migration_review

echo "=== Step 2: Apply base schema ==="
for f in migrations/*.sql; do
  if [ "$f" != "$MIGRATION_FILE" ]; then
    psql $DATABASE_URL -f "$f" 2>/dev/null || true
  fi
done

echo "=== Step 3: Seed realistic data ==="
seedfast seed --scope "Realistic data across all tables, include NULLs and edge cases" --output plain

echo "=== Step 4: Capture before state ==="
pg_dump $DATABASE_URL --data-only --inserts > /tmp/before_data.sql
psql $DATABASE_URL -c "\d+" > /tmp/before_schema.txt

echo "=== Step 5: Run migration ==="
echo "Running: $MIGRATION_FILE"
if ! psql $DATABASE_URL -f "$MIGRATION_FILE" 2>/tmp/migration_errors.txt; then
  echo "MIGRATION FAILED:"
  cat /tmp/migration_errors.txt
  exit 1
fi

echo "=== Step 6: Capture after state ==="
psql $DATABASE_URL -c "\d+" > /tmp/after_schema.txt

echo "=== Step 7: Show schema diff ==="
diff /tmp/before_schema.txt /tmp/after_schema.txt || true

echo "=== Step 8: Validate data integrity ==="
psql $DATABASE_URL -c "
  SELECT table_name, column_name
  FROM information_schema.columns
  WHERE table_schema = 'public'
    AND is_nullable = 'NO'
    AND column_default IS NULL
  ORDER BY table_name, column_name;
"

echo "=== Done. Review the output above. ==="

Now any reviewer can run ./review-migration.sh migrations/20260226_consolidate_user_names.sql and see the actual effect of the migration before approving the PR.

Don't rely on reviewers remembering to run the script. Make it part of your PR pipeline.

name: Migration Review Check

on:
  pull_request:
    paths:
      - 'migrations/**'

jobs:
  validate-migration:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: review
          POSTGRES_USER: review
          POSTGRES_PASSWORD: review
        ports:
          - 5432:5432

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Apply base schema
        run: |
          for f in migrations/*.sql; do
            psql $DATABASE_URL -f "$f" 2>/dev/null || true
          done
        env:
          DATABASE_URL: postgres://review:review@localhost:5432/review

      - name: Seed realistic data
        run: seedfast seed --scope "Realistic dataset across all tables, include NULLs and varied data" --output plain
        env:
          SEEDFAST_API_KEY: ${{ secrets.SEEDFAST_API_KEY }}
          DATABASE_URL: postgres://review:review@localhost:5432/review

      - name: Apply new migration(s)
        run: |
          NEW_MIGRATIONS=$(git diff --name-only origin/main -- migrations/ | sort)
          for f in $NEW_MIGRATIONS; do
            echo "--- Applying: $f ---"
            psql $DATABASE_URL -f "$f"
          done
        env:
          DATABASE_URL: postgres://review:review@localhost:5432/review

      - name: Validate post-migration state
        run: |
          echo "=== Row counts after migration ==="
          for table in $(psql $DATABASE_URL -t -c "SELECT tablename FROM pg_tables WHERE schemaname='public'"); do
            psql $DATABASE_URL -c "SELECT '$table' as table_name, count(*) as rows FROM $table"
          done

          echo "=== Check for unexpected NULLs ==="
          psql $DATABASE_URL -c "
            SELECT c.table_name, c.column_name, c.is_nullable
            FROM information_schema.columns c
            WHERE c.table_schema = 'public'
            ORDER BY c.table_name, c.ordinal_position;
          "
        env:
          DATABASE_URL: postgres://review:review@localhost:5432/review

When a PR adds a migration, this pipeline seeds realistic data, runs the migration, and reports the outcome. Reviewers get concrete evidence in the PR checks instead of relying on their ability to mentally execute SQL. The CI/CD database seeding guide covers the full pipeline setup, including ephemeral per-PR databases so each migration is reviewed against a fresh, isolated copy.

There's an important distinction here. Testing migrations against production-scale data (millions of rows) answers the question: "Will this migration be fast enough?" That's about lock duration, rewrite time, and index build speed — the failure mode behind the ALTER TABLE that locked an 8-million-row table for 40 minutes.

Testing migrations against production-shaped data (realistic variety, NULLs, edge cases, messy strings) answers a different question: "Will this migration produce correct results?" That's about data integrity, constraint violations, and silent data loss. It's the same volume-and-variety blind spot that hides the bugs only real test data catches — except here the cost isn't a slow query, it's wrong data written to every row.

You don't need 5 million rows to catch a NULL concatenation bug. You need 500 rows with realistic NULL patterns. You don't need production scale to discover that a unique constraint will fail on duplicates. You need a dataset that contains duplicates.

This article is about the second kind of testing. It's about giving reviewers the ability to see what a migration does to data, not just how long it takes. Both are important. Both require realistic data. But they answer different questions, and they fit into different parts of the development process.

Performance testing happens before deploy. Correctness testing happens during code review.

Before approving any PR that includes a schema migration:

  • Seed realistic data and run the migration locally (or check the CI results)
  • Verify no unexpected NULLs were introduced
  • Verify no data was silently truncated or rounded
  • Verify row counts are unchanged (unless rows are intentionally removed)
  • Check that new DEFAULT values make sense for existing rows, not just new ones
  • If adding NOT NULL: confirm no existing rows violate the constraint
  • If adding UNIQUE: confirm no existing duplicates
  • If adding a FOREIGN KEY: confirm all referenced rows exist and referential integrity holds across the data you already have
  • If changing a column type: confirm all existing values fit the new type
  • String operations handle NULLs (use COALESCE or explicit NULL checks)
  • Backfill queries produce correct results on a sample of real-shaped data
  • The migration is idempotent or has a working rollback
  • PR includes what the migration does AND why (not just the SQL)
  • If the migration is multi-step, each step's purpose is documented
  • CI migration validation check passed

You review a migration by running it against realistic data and comparing the before state, the after state, and the diff — not just reading the SQL. Start a database on the current schema, seed it with production-shaped data (NULLs, edge-case strings, duplicates, varied lengths), snapshot the affected rows, run the migration, and snapshot again. The difference between the two snapshots is the thing you're actually approving. Reading the DDL only tells you the migration is syntactically valid; running it tells you what it does.

Because most migration bugs live in the data, not the syntax. UPDATE users SET full_name = first_name || ' ' || last_name is valid SQL that silently writes NULL for every row with a NULL name. ADD CONSTRAINT ... UNIQUE is valid SQL that fails on deploy if duplicates already exist. A type cast to INTEGER is valid SQL that rounds every price. The SQL reads correctly in all three cases; the defect only appears when the statement meets real-shaped data.

Performance testing uses production-scale data (millions of rows) to answer "will this migration be fast enough?"; correctness testing uses production-shaped data (realistic variety at small volume) to answer "will this migration produce the right data?" Lock duration and rewrite time scale with row count, so performance testing needs volume. Constraint violations and silent data loss depend on the shape of the data — a NULL, a duplicate, an out-of-range value — so correctness testing only needs a few hundred rows with realistic variety. The two are separate checks: performance belongs before deploy, correctness belongs in code review.

A few hundred rows with realistic variety is usually enough. You don't need 5 million rows to catch a NULL concatenation bug — you need ~500 rows that include the NULLs, duplicates, and edge-case strings production accumulates. The requirement is variety, not volume: a dataset that actually contains a duplicate email will surface a failing UNIQUE constraint, and one that contains NULL names will surface a NULL concatenation. Clean fixtures with 12 perfectly-formatted rows will surface neither.

You can, but copying production data into dev or CI carries PII and compliance exposure that often outweighs the benefit. Moving real production records into lower environments runs into data-minimization and access-control requirements under regimes like GDPR, HIPAA, PCI DSS, and SOC 2, and the dump drifts from the schema between refreshes. Generating production-shaped data from the schema gives you the same correctness signal — the NULLs, duplicates, and edge cases — without real records in the pipeline. One thing to know before you reach for it in a regulated codebase: Seedfast reads your schema (the table and column definitions, not the rows) and uses an external AI service to generate from it, so confirm that schema handling fits your own data-governance rules. See staging without production data for the full trade-off.

Seedfast reads your live schema, resolves foreign-key order, and seeds FK-valid, production-shaped data in one command, so a reviewer can run any migration against realistic variety before approving it. It connects to a database to read the schema — the structure, not your production rows — so you point it at a local review database or a throwaway CI database, not production. It's free to start with no credit card. Because it re-reads the schema on every run, the same command keeps working after the migration changes a table — you reseed, run the migration, and compare, without maintaining a fixture script that drifts.

We don't approve application code by reading the source and imagining what it does. We run the tests. We look at the output. We verify behavior.

Migrations deserve the same rigor. The SQL might look correct. The intent might be sound. But the only way to know what a migration does to your data is to run it against data that looks like yours.

Seed. Migrate. Compare. It takes about 30 seconds of work, and it catches the data-shape bugs that reading the SQL alone keeps letting through.

Get Started | Documentation | Pricing

Seedfast generates realistic, FK-valid test data from your schema — so you can review what a migration does, not just what it says.