All posts

Synthetic Test Data: Schema-Aware Generation on PostgreSQL

By the Seedfast team Β· Β· Updated

  • Hand-written seed.sql, Faker scripts, and factory libraries break on every migration that adds a NOT NULL column or tightens a constraint β€” the maintenance cost scales with schema churn, not with team size
  • Synthetic test data is data generated from a database schema's structure and constraints, not copied from production and not random strings stitched together by hand
  • Schema-aware tools like Seedfast read PostgreSQL's foreign-key graph on every run, generate a valid insertion order, and handle circular references and CHECK constraints without a fixture file to update
  • Generating from schema metadata instead of cloning production typically removes a category of compliance scope under HIPAA, GDPR, and PCI DSS β€” though the generation pipeline itself still belongs in your security review

If the issue is fixture maintenance (every migration breaks seed.sql), schema-aware generation removes the file. If the issue is compliance scope (PII in staging, cross-border transfer concerns, BAA chain), generating from schema metadata side-steps the source-of-data question. If the issue is realistic distributions for query-planner-faithful tests, no generator gets you all the way β€” you supplement synthetic baseline with hand-curated edge cases. Most teams hit the first two together, and that's where Seedfast slots in: connect once, seed on demand against the live schema.

Synthetic test data is data generated to satisfy a database schema β€” its tables, columns, types, foreign keys, uniqueness rules, and check constraints β€” without sourcing rows from production. The contrast that matters in practice is with two adjacent approaches. A pg_dump of production gives you real shape and real distributions, but it also copies every row, which is the part regulated teams cannot keep doing. A Faker-style script gives you control over individual columns, but column-level randomness does not respect cross-table relationships, so the generated rows fail at the first foreign key. Synthetic test data, when generated by a tool that reads the schema, sits in the middle: realistic shape, no production records, FK-valid out of the box.

If your team runs integration tests against PostgreSQL, you've felt the pain: fragile seed scripts, broken foreign keys after migrations, and test data that doesn't look like reality.

Three problems compound as your schema grows.

Referential integrity isn't optional. When a dataset breaks relationships, tests stop reflecting real application behavior. Foreign key errors, missing dependencies, or uniqueness collisions turn test runs into debugging sessions β€” not of your product, but of your fixtures.

Migrations silently break fixtures. Each migration can change what "valid data" means: new NOT NULL fields, tighter CHECK constraints, additional UNIQUE rules, or new tables in dependency chains. Handwritten seed scripts and generators like Faker encode assumptions the schema has already outgrown.

A common example: your seed.sql inserts users, then a migration adds team_id NOT NULL REFERENCES teams(id), and the seed file breaks β€” someone has to fix it for every such change. Multiply this across 50+ tables and it becomes a recurring tax on every migration. (This problem compounds fast.)

CI/CD requires predictability. Modern pipelines create and refresh environments constantly β€” PR workflows, nightly runs, regression suites, migration testing. Whether you're seeding a dev database, staging environment, or ephemeral test database in CI/CD, the process must be predictable or flakiness becomes normal.

A useful synthetic dataset has to satisfy four things at once, and most traditional approaches miss at least one. It has to be valid β€” every row clears NOT NULL, CHECK, and UNIQUE. It has to be connected: inserts come in an order that respects the foreign-key graph, even when the graph contains a cycle. The values have to be realistic enough to read, so a failed test log isn't a decoding exercise. And it has to be safe β€” no production rows in the pipeline at any point, which is the part pg_dump + anonymization can't credibly claim.

Realism is the property worth unpacking, because it's where Seedfast diverges most from traditional generators. Here's the same products table seeded two ways.

A Faker script with lorem.sentence() and lorem.words() returns prose like:

name: "Versatile actuating framework"
description: "Billion mouth support feeling prove town cold own firm stuff."
review: "Story leader choice despite building church. Mind wait less able."

That output is what Faker's text providers produce by default. It's not strawmanned β€” it's what you get unless every column has a custom provider attached. Seedfast, run against the same schema, returns:

name: "Sonic Q1 Wireless Headphones"
description: "Noise-cancelling over-ear headphones with 30-hour battery life and multipoint connection"
review: "Great sound quality, but the ear cups could be more breathable for long sessions"

Faker's text is technically valid. But when a test fails and the log shows "Versatile actuating framework" at $29.99, the row tells you nothing β€” you're decoding before you can debug. With "Sonic Q1 Wireless Headphones" at $348, the failure mode is obvious at a glance. Readability is a side effect of the underlying difference: Seedfast does an LLM-based domain-inference pass on the schema (table names, column names, types, constraints, comments) before generation, so column intent flows into row content. There's no seed file or factory definition to keep in sync with migrations β€” the schema is the input.

That domain-inference pass is the mechanism behind the next three properties, so it's worth being explicit about what it does.

Column intent. Seedfast looks at the table-and-column context together β€” users.bio is a bio, reviews.review_comment is a product review, companies.legal_name is a company name. A TEXT column without semantic naming gets a generic string; a TEXT column with a recognizable name gets context-appropriate prose. Faker, by contrast, doesn't know the column is called bio β€” it just generates whatever the provider you wired up returns.

The --scope flag is where this gets interesting in practice. The flag accepts natural language, and it lets you steer the entire dataset toward a domain:

seedfast seed --scope "seed an e-commerce store with electronics products"

Run that scope and the generated products, categories, prices, and reviews stay in the same domain β€” laptop descriptions land under electronics pricing, not "Servers & Networking" with a kitchen blender. You get test scenarios that mirror real business domains without writing domain-specific factory code per table.

Cross-column coherence falls out of the same pass. When users.country is "Japan", Seedfast picks a phone number in the right format and a culturally plausible name. When a product sits in an "Electronics" category, its description reads like electronics, not produce. Traditional generators populate each column independently, which produces rows that pass constraint checks but fail visual inspection β€” and that's where dashboards built on seeded data become unreadable in code review.

Distribution shape matters too, even though it's the property generators most often skip. Real tables don't have uniformly distributed values: a users table has a few heavy-posters and many dormant accounts; an orders table clusters around a handful of popular products. Generators that ignore these skews leave PostgreSQL's query planner with statistics that don't reflect production, so EXPLAIN ANALYZE on a seeded database tells you very little about how the query will behave under real load. The deeper rabbit hole on this is covered in the data volume guide; the short version is that uniform synthetic data and small-volume datasets share the same failure mode.

The contrast with pg_dump + anonymization is structural, not stylistic. A masked dump gives you real shape but the underlying record is a customer's β€” incomplete anonymization is a recurring real-world failure mode, and it's why compliance risk in staging keeps showing up in audits. A hand-written seed.sql file gives you control over content, but creates a maintenance surface that grows with every schema change. Seedfast sidesteps both: data is created from the schema definition, so there's nothing to mask and nothing to babysit. Try it on a test schema in about two minutes β€” 30-day free trial, no credit card.

The hardest part of generating multi-table data isn't the values β€” it's the order. Consider a minimal e-commerce schema:

users         ──┐
                β”œβ”€β”€β–Ί orders ──► order_items ──► products
addresses β—„β”€β”€β”€β”€β”€β”˜                                 β–²
                                                  β”‚
reviews β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

To insert a single row into order_items, the database needs an existing orders.id and products.id. To insert into orders, it needs an existing users.id and addresses.id. A valid insertion order has to respect every one of these edges.

Seedfast reads the foreign-key graph from PostgreSQL on each run, then runs a topological sort to pick an insertion order that never violates an FK. When the graph contains a cycle β€” users.primary_address_id β†’ addresses.id and addresses.user_id β†’ users.id is the canonical example β€” Seedfast inserts one side with the FK temporarily NULL, inserts the other side, then updates the first row once both IDs exist. Handwritten seed.sql files break here because the author encodes the topological sort by hand and has to re-encode it on every migration that adds a new edge.

Faker-based scripts face a different version of the same gap: faker.name() doesn't know that a user_id in posts must match an existing row in users. You either hand-roll the ordering in JavaScript or TypeScript, or you lean on an ORM-level factory that handles direct associations but still needs maintenance when the schema shifts. Seedfast reads the edges fresh every run and does the ordering for you, including when a migration adds a new edge yesterday.

The same discipline applies to unique constraints (generate-then-check, regenerate on collision), to check constraints (respect CHECK (price > 0) when generating numbers), and to composite keys across tables that represent many-to-many joins. None of this is exotic β€” it's what every production schema looks like, and it's also what every fixture script has to track by hand.

For a growing class of teams, copying production into staging isn't a matter of convenience β€” it's blocked by regulation or policy.

  • GDPR and Schrems II. EU personal data on a developer laptop in a non-adequate country can constitute a cross-border transfer. Synthetic rows generated from schema metadata typically contain no personal data, which substantially reduces β€” and in many designs removes β€” the transfer question.
  • HIPAA. PHI in a test environment typically brings that environment into HIPAA scope β€” covered entities, BAAs, audit logs, the full surface. Removing real PHI is usually the main lever for keeping the environment out of scope, alongside the usual access and logging controls.
  • PCI DSS. Storing cardholder data in non-production environments pulls those environments into PCI scope. PCI DSS v4.0 (Req. 6.5.5) restricts use of live PANs in pre-production environments.
  • SOC 2. Auditors increasingly flag "production data in dev" as a control weakness under Common Criteria 6 (logical access) and CC 7 (system operations).

Synthetic data doesn't eliminate compliance work β€” you still need controls around how data is generated, stored, and destroyed, and your compliance team will want to verify the generation pipeline. There's also a real caveat about what counts as "metadata only": column names like ssn or tax_id, embedded comments quoting customers, and DEFAULT values can carry identifiers, so the generation pipeline still belongs in your security review even when no production rows leave the database. What synthetic data does is remove real customer records from a category of environments where they were always an awkward fit. The staging-without-production-data guide goes deeper on the compliance-specific workflows.

For teams in fintech, healthcare, or any cross-border SaaS, this is often the deciding factor β€” the fixture-maintenance argument is a bonus. See how Seedfast fits in regulated stacks β€” 30-day free trial, no credit card.

ApproachProsConsCostBest for
Static seed.sqlSimple, version-controlled, no external dependenciesBreaks on schema changes, manual maintenance scales with table countFreeTiny schemas that rarely change
pg_dump + anonymizeRealistic structure and volumesSlow, risk of incomplete masking, ongoing export process, regulated data still in scopeFree (your time)One-off staging fills where policy allows
Data maskingPreserves production distributionsStill uses real data as source, potential masking gaps, keeps environment in compliance scopeTooling + ongoing reviewRegulated environments with strict distribution requirements
Faker / factory librariesFlexible, code-controlled; ORM-based factories can handle associationsManual maintenance, can break on schema changes, no cross-column coherence by defaultFree (OSS)Unit tests, simple integration tests
Synthetic (Seedfast)Schema-aware, no fixture file, no production data in pipelineSends schema metadata to the Seedfast service for planning (what crosses the wire); PostgreSQL only; requires network; non-deterministic by default30-day free trial, paid plans afterIntegration tests, CI/CD, staging, microservice seeding

The comparison comes down to where you want to spend time β€” and where you can afford to spend compliance budget. Maintaining fixtures is a cost that scales with schema complexity. Masking production data is a cost that scales with regulatory scope. Generating from the schema is a cost that scales with… almost nothing on the maintenance side, because there's no source data to track and no fixture to rewrite β€” the trade is moving the line item from "engineer-hours per migration" to "subscription, plus a security review of the generation pipeline." For a wider comparison of generation methods including tools and libraries, see the test data generation pillar. Try Seedfast on a test schema to see which side of that trade fits your team.

Every fixture-based approach β€” seed.sql, Faker scripts, factory libraries β€” has the same operational problem: someone has to update it when the schema changes. Seedfast reads the current schema on every run instead. When a developer adds a new table or column, the next seedfast seed picks it up β€” no seed file to update, no fixture to rewrite. You steer what gets seeded with natural language:

seedfast seed --scope "seed only the checkout flow: carts, cart_items, products, users"

Whether the target is e2e test fixtures, CI/CD integration, or load testing, the test setup collapses to a single command that stays current with the schema. Connect Seedfast to your Postgres database β€” about two minutes, 30-day free trial, no credit card.

Test data is any data used during testing β€” it could be a production copy, a hand-written fixture, or generated values. Synthetic data specifically means the data was created artificially rather than derived from real records. Synthetic test data combines both: artificially generated data purpose-built for testing.

The simplest approach is to read the database schema (tables, columns, constraints, foreign keys) and generate rows that satisfy all the rules. Seedfast does this by connecting to your PostgreSQL database, analysing the schema, and producing a valid, connected dataset. You run seedfast connect once, then seedfast seed whenever you need data.

For most integration testing, staging, and CI/CD scenarios β€” yes. Synthetic data covers schema structure, constraint validation, and relationship integrity without involving production records. The main exception is when you need to test against specific production data distributions or volumes, in which case synthetic data works well as a baseline that you supplement with targeted test cases.

Masked data starts with real production records and transforms sensitive fields β€” redacting names, shuffling emails, hashing identifiers. Synthetic data never touches production in the first place; every row is generated from the schema. See the staging-without-production-data guide for why masking pipelines often fail in practice.

Synthetic data generated from schema metadata alone is generally not considered personal data under GDPR or protected health information under HIPAA, provided the generation pipeline does not reintroduce identifiers sampled from production. Compliance teams should still validate this for their specific environment β€” what the pipeline reads, what it logs, and where the generated rows are stored. Seedfast sends the schema definition (table names, column names, types, and constraints) to its planning service for generation, but row data does not leave your database. Schema metadata can itself contain identifiers in certain shapes (column names like ssn, comments quoting customers, default values), so the pipeline still belongs in your security review.

Yes β€” this is the core technical difference between schema-aware generators and value-level generators. A schema-aware tool reads the foreign-key graph, topologically sorts the tables, and inserts parents before children. Circular references are handled by inserting one side with a temporarily NULL FK and backfilling once both IDs exist. Faker doesn't do this; Mockaroo supports referenced datasets between tables, but you still wire each relationship by hand in the UI. Seedfast resolves the entire graph automatically on every run.

Yes. Seedfast connects directly to your PostgreSQL database and reads the schema β€” it doesn't depend on your ORM or migration tool. Whether you manage migrations with Prisma Migrate, TypeORM, or raw SQL, Seedfast reads the current database state and generates matching data.

Seedfast offers a 30-day free trial with no credit card required to connect and run the first seed β€” small schemas typically fit within trial limits. Past the trial, paid plans are listed on the Seedfast pricing page. The CLI surfaces which limit you hit if a schema exceeds the trial allowance.

If your team is still maintaining seed files, hand-rolled Faker scripts, or a pg_dump-and-mask pipeline, Seedfast can generate a valid dataset from your current PostgreSQL schema β€” without touching production rows. See the getting started guide to seed your first database in under five minutes β€” 30-day free trial, no credit card.