All posts

Synthetic Data for CI/CD: Seed Fresh, Valid Rows on Every Run

By Mikhail Shytsko, Founder at Seedfast · · Updated

Every CI pipeline that touches a database needs rows to run against, and there are only three ways to get them: restore a copy of production, replay a checked-in seed.sql, or generate the data inside the run. Synthetic data for CI/CD is the third option — generate fresh, referentially-valid rows from the schema on every build, so no production data is copied and the data can't go stale between migrations the way a static seed file does. For most teams it's the one option that both stays correct as the schema changes and stays clean of real customer data.

This guide covers why pipeline-time generation beats the other two, what a generator needs to be usable in CI, and the newer pressure behind it: AI coding agents that open pull requests and need a real database to test against.

Generating synthetic data in the pipeline beats the two older approaches (copying production and replaying a static seed.sql) on the two things CI cares about: it moves no real customer data, and it stays in sync with the schema. Each older approach fails one of those.

Copying production drags real customer data into a build environment. That's a compliance problem the moment the schema holds anything personal, and it usually means a masking pipeline bolted on top, plus a slow restore of a database far larger than any test needs. You inherit production's size and its risk to test a pull request.

A checked-in seed.sql avoids the privacy problem but rots. The day someone adds a NOT NULL column or a new foreign key, the file is wrong: it fails, or worse, quietly seeds an incomplete row. Static fixtures drift because the schema moves and the file doesn't. The seed file maintenance problem is exactly this, and every migration charges the tax.

Generating in the pipeline sidesteps both. The data is synthetic, so there are no production records to move or mask, and because it's built from the live schema each run, a new migration is picked up automatically: the pipeline-time seeding step re-reads the schema and writes rows that fit it. The catch is that the generator has to produce valid, connected data unattended, which is where most tools fall down. Seedfast is a schema-aware generator built for that: point it at a connection string and it reads the live schema and foreign-key graph itself.

On the axes that decide whether a seed source survives in CI:

ApproachMoves production PII?Stays valid after a schema change?Sized for the test?Runs unattended in CI?
Copy / restore productionYes (real records)Only after a fresh copy and re-maskNo (production-sized)Not without a masking step first
Checked-in seed.sqlNoNo, breaks when the schema movesYesUntil the next migration breaks it
Generate in the pipelineNoYes, re-reads the live schema each runYesOnly with a generator good enough to run on its own

The bottom row's last column is the honest cost of generating: it works only if the generator holds integrity without a human watching. Run your first seed free to put that on your own schema.

A CI seed step has to clear five bars a point-and-click generator never faces. Miss one and the step either won't run unattended or won't fail loudly enough to trust.

  • A non-interactive command. It runs from a connection string and an API key in an environment variable, with no UI, no login prompt, and no paste step. If a human has to click, it isn't a CI step.
  • A schema read at run time. The seed runs after migrations apply, so the generator must read the current schema then, not a snapshot from when someone configured it.
  • Referential integrity across the foreign-key graph. Data that violates constraints fails the insert; data that skips relationships makes integration tests lie. The generator inserts parents before children and handles nullable circular foreign keys without hand-holding.
  • Machine-readable output and honest exit codes. A JSON output mode and a non-zero exit on failure are how the build gate knows whether the seed worked.
  • A cost that doesn't scale with run frequency. Seeds fire on every push, every branch, across a job matrix, and a per-row or per-token price turns that volume into a bill that climbs with your commit rate. A flat plan doesn't move.

AI coding agents now open pull requests, and agentic test runners drive an app end to end on every PR. Neither one seeds the database it depends on. Both assume realistic, connected rows are already there: the agent that wrote the code and the agent that tests it each expect something else to have prepared the schema.

That gap closes cleanly when the generator is both a CLI step and an MCP tool. Seedfast exposes the same seed run over the Model Context Protocol, so an agent in Claude Code or Cursor calls seedfast_run to seed the branch database, then lets the tests (human-written or agent-driven) run against it. The seeded-data prerequisite collapses into one tool call. As more of the pipeline runs on agents, "the test data is generated, valid, and current" becomes the assumption everything else rests on.

Seedfast runs as a single step after migrations: one command points at the database, reads the live schema and foreign-key graph, and writes connected rows.

# .github/workflows/test.yml (excerpt)
- name: Seed test database
  run: npx seedfast seed --scope "realistic accounts, orders, and line items" --output json
  env:
    SEEDFAST_API_KEY: ${{ secrets.SEEDFAST_API_KEY }}
    SEEDFAST_DSN: ${{ secrets.TEST_DATABASE_URL }}

The --scope is plain English: Seedfast generates connected accounts, orders, and line items with valid foreign keys between them, sized to the test rather than to production. --output json returns a machine-readable result, and the command exits non-zero on failure so the build gate can read it. Add a table or column in a later migration and the next run picks it up, with nothing to reconfigure. The SEEDFAST_API_KEY comes from a free Seedfast account, and the CI/CD database seeding guide has the full GitHub Actions and GitLab CI setup, including per-environment keys and exit-code handling.

For ephemeral databases this pairs with branch-per-PR workflows: the Neon branching seed data guide covers seeding a fresh branch database per run, and E2E test fixtures covers generating data for Playwright and Cypress.

Not every test-data tool clears that bar. Web-based column generators like Mockaroo are built for clicking, so they don't drop into an unattended command. ORM-coupled seeders like Drizzle's and Prisma's seed scripts need your application code running and only work inside one framework. Production-copy tools like pg_dump and Tonic hand back the privacy and size problems you came here to avoid. Even Faker-based scripts generate columns in isolation, with no foreign-key graph to keep the rows connected. Seedfast takes a connection string, reads the live schema, and holds referential integrity with nobody watching.

If you're cross-shopping on that axis, the best Postgres test data generator comparison ranks tools on schema-awareness and CI fit, and the data seeding tools guide covers the regulated-industry angle where copying production isn't allowed at all. For the broader strategy of where test data comes from and how it stays valid, see test data management.

Synthetic data for CI/CD is test data generated inside the pipeline from your database schema, after migrations apply, rather than copied from production or replayed from a static seed file. Because it's regenerated from the live schema each run, it stays valid as the schema changes, and because it's synthetic it carries no real customer data.

A copied dump is production-sized, so a multi-gigabyte restore can take minutes on every job, and it usually needs a masking step before it's safe to use. A generated seed is sized to the test and skips masking entirely, because there's no real data in it to mask. The dump also reflects whatever the schema looked like when it was taken; the generator re-reads the current schema instead.

It has to, or the seed is useless. A generator built for CI reads the foreign-key graph, inserts parent rows before child rows, and handles circular references. Seedfast resolves the graph topologically and, for nullable circular foreign keys, fills the back-reference in a second pass, so referentially-valid data lands in one command without manual ordering.

No rows leave your database. The generator reads the schema's shape (table and column names, types, and constraints) to plan the data, then generates the values itself; your production records are never involved, because they aren't in the test database to begin with. If a table or column name is itself sensitive, confirm that path fits your policy before wiring it into CI.

It depends on the pricing model. A four-job matrix on a twenty-push day is eighty seed runs; a per-row or per-token generator bills all eighty, scaling with your commit rate. Seedfast charges a flat monthly rate with unlimited seeds, so eighty runs cost the same as one. The price tracks your table count, not how often the pipeline fires.

Seedfast connects to your PostgreSQL database, reads the schema and foreign-key graph, and generates connected, realistic data in one command that runs as a CLI step or an MCP tool, fresh and referentially-valid on every pipeline run, with no copied production data and no seed.sql to patch. Set up CI/CD seeding or run your first seed in about two minutes, free to start. See pricing for flat plans.

Related guides: