All posts

Enterprise Database Test Data: What It Actually Looks Like

By Mikhail Shytsko, Founder at Seedfast · · Updated

TL;DR: Enterprise database test data is test data managed as a system rather than a script — the point where an organization's environment count, schema size, and compliance footprint outgrow seed scripts and pg_dump snapshots. Teams solve it three ways: subset production, mask production, or generate from scratch. Seedfast generates relational data from the live schema, so regulated teams populate dev, CI, and staging without moving a single production row.

This is the practical companion to the broader test data management category: if that piece defines the discipline, this one shows what enterprise database test data actually looks like in working Postgres at the scale and sensitivity a regulated mid-size company runs. Not the platonic vendor diagram. The real one — the one your CI runs against on Tuesday morning, the one your platform team is tired of fixing.

By "enterprise" we don't mean Fortune 500. We mean the moment your environment count, schema size, and compliance footprint cross a threshold where seed scripts and pg_dump snapshots stop being viable, and you have to think about test data as a system rather than a script.

  • Enterprise database test data isn't just bigger fixture data — it's a different problem with different constraints: scale variance, regulated sensitivity, multi-environment refresh, organizational coordination
  • The three levers — subset production, mask production, synthesize from scratch — solve overlapping problems with different trade-offs; none is the right answer everywhere
  • Different environments deserve different data shapes: a unit test wants a dozen rows, perf wants production-shaped distribution, demo wants a story. One-size-fits-all is the silent source of "works in CI, breaks in staging" bugs
  • Tenant isolation in test data and timestamp/lifecycle realism cause more demo embarrassments and test-flake than any data volume problem
  • Between fragile DIY scripts and six-figure enterprise TDM procurement sits a third category — schema-aware generation — that fits regulated mid-size teams who can't touch production but also can't run a multi-quarter rollout

Three things separate enterprise test data from a single team's fixtures: scale, sensitivity, and environment count.

Scale. Postgres tables in production grow into the millions of rows. Schemas grow into the hundreds of tables. Foreign-key chains run six or seven hops deep — a payment_event references a payment references an order references a cart references a customer references an account references a tenant. Hand-rolled seed scripts hit a wall around 30 tables. Past that, somebody has to think about insert order, dependency cycles, and how many order_items should hang off a typical order to look real.

Sensitivity. Production data in a regulated company isn't just sensitive — it's regulated. A copy of prod on a developer laptop can be a finding in your next SOC 2 review, a potential HIPAA violation if the schema includes patient records, or a PCI DSS compliance problem if cardholder fields are involved. The classic move — pg_dump prod, restore to staging, UPDATE users SET email = ... over the top — is the staging-without-prod-data trap most teams already know is broken but haven't replaced.

Environment count. A small team has local + CI + staging. A mid-size regulated company has local-dev, CI, integration, staging, perf, and demo at minimum. Each one wants different data: tiny in CI, representative in staging, production-shaped in perf, narrative in demo. The "one giant snapshot for everything" approach quietly creates more bugs than it solves — tests that pass in CI and fail in staging because the data shape is different.

Four constraints shape every enterprise team's test-data approach. None of them are optional; the question is how you handle each.

Volume that varies by purpose. Unit tests want a dozen rows. Integration tests want hundreds. Staging wants something representative — five-figure rows for most tables, six-figure for hot ones. Perf wants production-shaped distribution at production-shaped volume. Most teams pick one volume and force every environment to live with it. Bugs follow.

Referential integrity across many related tables. Postgres enforces foreign keys; that's the database's job. The tool's job is figuring out insert order — which tables to fill before which others, how to handle circular references, what to do when a payment has an FK to an order and the order has an FK back to payment_id for the latest payment. A 30-table fintech schema has dozens of these. Hand-rolled scripts handle three or four; past that they collapse.

Sensitivity gravity. Once a schema includes any regulated field — PHI, cardholder data, account numbers, EU resident PII — the cost of putting production rows anywhere outside production climbs sharply. HIPAA, PCI DSS, SOC 2, and GDPR data minimization all push toward the same outcome: don't move production rows into dev environments at all. The pull of regulated data is gravitational; everything else in your test-data strategy rotates around it. Seedfast was built around exactly this constraint — generate the data fresh from the schema so the regulated rows never leave production in the first place.

Refresh cadence. Staging data refreshed monthly is stale. Refreshed nightly is expensive. Refreshed on-demand requires the refresh to be cheap, fast, and reliable. The cadence question is downstream of the volume question: small data refreshes cheaply and often; production-sized snapshots refresh slowly and rarely. Teams who tie all environments to the same refresh process end up with stale staging and expensive perf runs.

Three working approaches to enterprise database test data. Each one solves part of the problem; none of them solves all of it.

Subset production. Take a slice of prod — a few thousand customers and their related rows — and copy it down. Subsetting tools handle the FK closure: pull customer 1234 and you also pull their orders, payments, addresses, audit rows. The data is real, so your tests behave like production. The trade-off is that subsetted data is still production data. It carries the same regulatory weight as the full database, just with fewer rows. For a HIPAA shop or a PCI-scope team, subsetting alone isn't enough — you'd still have PHI on a developer laptop.

Mask production. Take that subset (or the full snapshot) and run masking rules over the sensitive columns. Names get replaced with fake names; emails get hashed; card numbers get tokenized. Done well, masking produces realistic-looking data without the regulated values. Done poorly — and it's often done poorly because masking config drifts behind schema changes — masking produces obviously-broken data: emails that don't validate, addresses that don't match cities, foreign keys that reference rows the subsetting step dropped. Masking also requires production database access for the masking step, which is its own access-control problem.

Synthesize from scratch. Generate the data without ever touching production. Schema-aware generators read the schema, walk the foreign-key graph, and produce realistic rows that satisfy constraints. Schema-aware test data generation of this kind needs no access to production data at all — no subset, no mask, no procurement of prod connections. The trade-off is that synthetic data doesn't reproduce production's exact shape: you won't accidentally hit a real customer's edge case because you've never seen it. Some teams treat that as a feature; others want at least one perf environment that mirrors production-shaped distributions.

Each lever has a legitimate niche. Synthesizing is the default for environments that should never see production data at all — local-dev, CI, demo. Masking-after-subset still fits perf testing for teams already running a TDM platform. Subsetting fits a narrow case where production access is allowed and the tool's masking is trusted. Most regulated mid-size teams end up with a mix, and the mix is what makes test data hard, not any single lever.

Here's the section the rest of the test-data discourse mostly skips. Most articles talk about "production-like data" as if every environment wants the same thing. The opposite is true. A working enterprise test-data setup recognizes four distinct scales, each with a different answer.

Unit-test scale (1–50 rows per relevant table). Unit tests want the smallest data that exercises the code path. Twelve users, three orders each, one payment status. Fast to set up, easy to assert against. Generated synthetically per-test is fine; copying any version of prod here is slow and brings risk that doesn't pay off.

CI/integration scale (hundreds to a few thousand rows). CI runs against a populated database that resembles a real one but stays cheap to provision. The shape is more important than the volume — every status code represented at least once, every FK chain populated, edge-case dates included. This is where most teams over-provision; CI databases sized like staging slow pipelines down without catching more bugs.

Staging/demo scale (tens of thousands of rows). Staging needs realistic-looking data because humans interact with it — internal users, sales demos, support reps. A dashboard with 50,000 plausibly-distributed records reads as "real" the same way one with 500,000 would. Demo-specific data often gets even more narrative shaping ("the persona Linda has 14 orders across 6 months, last one delivered yesterday") so screenshots tell the right story.

Perf scale (production-shaped distribution at production-shaped volume). Perf is the only environment where matching production matters. Query plans, index hit rates, autovacuum behavior, lock contention — all of these depend on data distribution and volume. Synthetic data at perf scale is achievable but requires careful distribution control: the long tail matters, hot rows matter, skew matters. Under-size this environment and you ship the bugs that only real test data catches — missing indexes, N+1 queries, and memory blowups that never appear against a few dozen rows.

The argument: one tool that can hit any of these four scales from the same schema specification is more valuable than one that does any single scale well. Most teams own four different test-data pipelines (a fixtures library for unit, a Faker script for CI, a pg_dump for staging, a vendor TDM for perf) and pay coordination cost across all four. The win isn't more data; it's the same data engine producing different shapes on demand.

Two practical issues most TDM articles ignore — and both produce bugs that survive into production.

Tenant isolation in test data. Multi-tenant SaaS schemas have a tenant_id (or org_id, or account_id) column on most rows, and isolation is enforced with tenant-scoped query filters or Postgres row-level security so tenant 1 only sees tenant 1's data. When test data is generated naively, foreign keys cross tenant boundaries: a tenant_1 user owns an order whose product_id belongs to tenant_2's catalog. That order will never appear in production — the application wouldn't allow it — but it shows up in tests, causes assertion failures, and worst case slips into a demo environment where a sales rep notices that one tenant can see another tenant's data. Multi-tenant database seeding has its own constraints; the rule is that every tenant-scoped FK has to stay within tenant.

Audit-trail and lifecycle consistency. Real records have a story: a user account is created, then email_verified, then active, eventually suspended or closed. The created_at, updated_at, verified_at, last_login_at timestamps line up with that story. Most generated data ignores this; you end up with users whose verified_at predates created_at, accounts with last_login_at after they were closed, status active rows whose closed_at is set. None of this is wrong by the database's lights — the constraints are satisfied — but it produces bugs nobody catches in tests because the data is the lie. Demo embarrassments are the visible version of this; the silent version is integration tests that pass against impossible state.

These are problems schema-aware generation has the chance to solve and hand-rolled scripts almost never do. They're also what turns "we have synthetic data" into "we have synthetic data that actually behaves like production data."

Most existing content presents two paths for enterprise test data: hand-rolled seed scripts at the small end, six-figure enterprise TDM at the high end. There's a gap in the middle, and a lot of teams live in it.

The gap is shaped like this. A regulated mid-size company — fintech, insurtech, healthtech, B2B SaaS in SOC 2 scope — has more than fifty engineers, a Postgres schema in the high two-digit table count, and a hard rule that production data cannot leave production. Hand-rolled seed scripts can't keep up with the schema; they break on every migration, and nobody owns them. But the team can't justify a six-figure procurement cycle either. Procurement takes two quarters. The platform takes two more quarters to roll out. By the time the TDM is online, the schema has moved twice and the data-shape problem is half-solved at six-figure cost.

This isn't a "smaller customer for the enterprise tool" problem. It's a different category. The team doesn't need anonymization-of-production; they need data that was never production in the first place. They don't need a six-month rollout; they need something to drop into CI this sprint. The enterprise TDM platforms aren't a discount tier away from the right answer — they're answering a different question (anonymize what we have) than the one the team is asking (generate what we don't).

Schema-aware generation fills that gap. Tools in this category read the schema, generate from scratch, and never require access to production data. They're not a discount enterprise TDM; they're a different lever, and they happen to fit the regulated-mid-size shape better than either of the alternatives. The data seeding tools spectrum lays out where each category lands on the price/capability axis if you want a side-by-side.

Seedfast generates relational test data from your Postgres schema, with no production rows involved. You point it at a connection string, describe the scenario in plain English, and review the plan before anything is written:

seedfast seed --scope "fintech app: 5,000 accounts with transactions, payments, and varied balances"
  → Connected to PostgreSQL
  ✓ Found 41 tables, 78 foreign keys
  ✓ Resolved insert order
  → Review the plan
  ? Apply this plan? (y/N) y
  ✓ Seeded 61,240 rows across 41 tables

Only schema metadata — table and column shapes — crosses the wire to Seedfast's generation service so the model knows what to build; row values never do. If schema names are themselves sensitive in your shop, that data path is the thing to review against your security policy, the same way you would any new vendor. For a team that's already decided "no production data leaves production" and isn't ready to run a multi-quarter procurement, that's the third option that's been missing from most of the writing on this topic.

The operational closing question: which approach belongs where? A working enterprise test-data setup matches each environment to the data shape it actually needs.

  • Local-dev: synthesized data, small, regenerated per-developer. No production access ever. Refresh on demand from the schema; if the schema changes, the next run picks it up.
  • CI: synthesized data, small-to-medium, generated per-test-run or cached as a template. The CI database should be cheap to provision and identical across runs. This is where flaky tests pile up if the data shape drifts.
  • Integration / preview environments: synthesized data, medium scale, refreshable on demand. When a developer pushes a branch and gets a preview environment, the data should be there in minutes, not hours.
  • Staging: synthesized data at staging scale, refreshable nightly or on-demand. If sales and support use staging for demos, the data needs narrative shaping — named personas, recognizable scenarios.
  • Perf: the only environment where matching production-shaped distribution matters. For most regulated teams, this is also the only environment where masked-after-subset still earns its keep, if the team already has a TDM platform. Otherwise, schema-aware generation at perf scale, with care taken on distribution control.
  • Demo: synthesized data, scenario-shaped. Demo data should tell a story; sales reps and customer success teams should be able to walk through a known flow without surprises.

The pattern across all six environments is the same input — your schema — producing different output shapes. Teams that get this right stop coupling all four environments to one fragile script, and they stop pretending one snapshot fits everything. If you want to see how scenario-based generation lines up with your own schema, seedfa.st is where to start.

Enterprise database test data is test data managed as a system rather than a script. Once an organization's environment count, schema size, and compliance footprint cross a threshold, seed scripts and pg_dump snapshots stop being viable, and the team has to treat test data as infrastructure with its own scale, sensitivity, and refresh constraints.

No. Copying production into dev or CI puts regulated rows — PHI, cardholder data, EU resident PII — outside production, which auditors and regulators generally treat as a serious problem: a likely SOC 2 finding, a potential HIPAA violation where PHI is involved, or a PCI DSS scope issue where cardholder data is present. GDPR data minimization points the same way: don't move production rows into dev environments at all.

Subsetting copies a real production slice but keeps full regulatory weight. Masking transforms sensitive columns but needs production access and drifts behind schema changes. Schema-aware generation synthesizes from scratch without reading production rows. Most regulated mid-size teams mix all three — synthesize for dev, CI, and demo; reserve masked-after-subset for perf if a TDM platform already exists.

Valid foreign keys at scale require resolving insert order across the dependency graph — which tables to fill first, how to handle circular references, and how many child rows hang off each parent. Hand-rolled scripts handle three or four levels; a schema-aware generator walks the whole graph topologically so every reference resolves.

Seedfast reads a live Postgres schema and generates realistic relational data from a plain-English scope, with no production rows in the pipeline. For a team that has ruled out moving production data but can't justify a multi-quarter, six-figure TDM procurement, Seedfast is the schema-aware option that drops into CI without a long rollout. Only schema metadata crosses the wire to the generation service, so review that data path against your security policy the way you would any vendor.