All posts

Enterprise Database Test Data: What It Actually Looks Like

By the Seedfast team ·

This is the practical companion to the broader test data management category: if that piece defines the discipline, this one shows what enterprise database test data actually looks like in working Postgres at the scale and sensitivity a regulated mid-size company runs. Not the platonic vendor diagram. The real one — the one your CI runs against on Tuesday morning, the one your platform team is tired of fixing.

By "enterprise" we don't mean Fortune 500. We mean the moment your environment count, schema size, and compliance footprint cross a threshold where seed scripts and pg_dump snapshots stop being viable, and you have to think about test data as a system rather than a script.

Key Takeaways#

  • Enterprise database test data isn't just bigger fixture data — it's a different problem with different constraints: scale variance, regulated sensitivity, multi-environment refresh, organizational coordination
  • The three levers — subset production, mask production, synthesize from scratch — solve overlapping problems with different trade-offs; none is the right answer everywhere
  • Different environments deserve different data shapes: a unit test wants a dozen rows, perf wants production-shaped distribution, demo wants a story. One-size-fits-all is the silent source of "works in CI, breaks in staging" bugs
  • Tenant isolation in test data and timestamp/lifecycle realism cause more demo embarrassments and test-flake than any data volume problem
  • Between fragile DIY scripts and $200K+ enterprise TDM procurement sits a third category — schema-aware generation — that fits regulated mid-size teams who can't touch production but also can't run a multi-quarter rollout

Why enterprise database test data is a different problem#

Three things separate enterprise test data from a single team's fixtures: scale, sensitivity, and environment count.

Scale. Postgres tables in production grow into the millions of rows. Schemas grow into the hundreds of tables. Foreign-key chains run six or seven hops deep — a payment_event references a payment references an order references a cart references a customer references an account references a tenant. Hand-rolled seed scripts hit a wall around 30 tables. Past that, somebody has to think about insert order, dependency cycles, and how many order_items should hang off a typical order to look real.

Sensitivity. Production data in a regulated company isn't just sensitive — it's regulated. A copy of prod on a developer laptop is a finding in your next SOC 2 review, a HIPAA violation if the schema includes patient records, a PCI DSS breach if cardholder fields are involved. The classic move — pg_dump prod, restore to staging, UPDATE users SET email = ... over the top — is the staging-without-prod-data trap most teams already know is broken but haven't replaced.

Environment count. A small team has local + CI + staging. A mid-size regulated company has local-dev, CI, integration, staging, perf, and demo at minimum. Each one wants different data: tiny in CI, representative in staging, production-shaped in perf, narrative in demo. The "one giant snapshot for everything" approach quietly creates more bugs than it solves — tests that pass in CI and fail in staging because the data shape is different.

The constraints that shape it#

Four constraints shape every enterprise team's test-data approach. None of them are optional; the question is how you handle each.

Volume that varies by purpose. Unit tests want a dozen rows. Integration tests want hundreds. Staging wants something representative — five-figure rows for most tables, six-figure for hot ones. Perf wants production-shaped distribution at production-shaped volume. Most teams pick one volume and force every environment to live with it. Bugs follow.

Referential integrity across many related tables. Postgres enforces foreign keys; that's the database's job. The tool's job is figuring out insert order — which tables to fill before which others, how to handle circular references, what to do when a payment has an FK to an order and the order has an FK back to payment_id for the latest payment. A 30-table fintech schema has dozens of these. Hand-rolled scripts handle three or four; past that they collapse.

Sensitivity gravity. Once a schema includes any regulated field — PHI, cardholder data, account numbers, EU resident PII — the cost of putting production rows anywhere outside production climbs sharply. HIPAA, PCI DSS, SOC 2, and GDPR data minimization all push toward the same outcome: don't move production rows into dev environments at all. The pull of regulated data is gravitational; everything else in your test-data strategy rotates around it. Seedfast was built around exactly this constraint — generate the data fresh from the schema so the regulated rows never leave production in the first place.

Refresh cadence. Staging data refreshed monthly is stale. Refreshed nightly is expensive. Refreshed on-demand requires the refresh to be cheap, fast, and reliable. The cadence question is downstream of the volume question: small data refreshes cheaply and often; production-sized snapshots refresh slowly and rarely. Teams who tie all environments to the same refresh process end up with stale staging and expensive perf runs.

The three levers: subset, mask, synthesize#

Three working approaches to enterprise database test data. Each one solves part of the problem; none of them solves all of it.

Subset production. Take a slice of prod — a few thousand customers and their related rows — and copy it down. Subsetting tools handle the FK closure: pull customer 1234 and you also pull their orders, payments, addresses, audit rows. The data is real, so your tests behave like production. The trade-off is that subsetted data is still production data. It carries the same regulatory weight as the full database, just with fewer rows. For a HIPAA shop or a PCI-scope team, subsetting alone isn't enough — you'd still have PHI on a developer laptop.

Mask production. Take that subset (or the full snapshot) and run masking rules over the sensitive columns. Names get replaced with fake names; emails get hashed; card numbers get tokenized. Done well, masking produces realistic-looking data without the regulated values. Done poorly — and it's often done poorly because masking config drifts behind schema changes — masking produces obviously-broken data: emails that don't validate, addresses that don't match cities, foreign keys that reference rows the subsetting step dropped. Masking also requires production database access for the masking step, which is its own access-control problem.

Synthesize from scratch. Generate the data without ever touching production. Schema-aware generators read the schema, walk the foreign-key graph, and produce realistic rows that satisfy constraints. Synthetic test data of this kind has no production access at all — no subset, no mask, no procurement of prod connections. The trade-off is that synthetic data doesn't reproduce production's exact shape: you won't accidentally hit a real customer's edge case because you've never seen it. Some teams treat that as a feature; others want at least one perf environment that mirrors production-shaped distributions.

Each lever has a legitimate niche. Synthesizing is the default for environments that should never see production data at all — local-dev, CI, demo. Masking-after-subset still fits perf testing for teams already running a TDM platform. Subsetting fits a narrow case where production access is allowed and the tool's masking is trusted. Most regulated mid-size teams end up with a mix, and the mix is what makes test data hard, not any single lever.

Scale variance across environments#

Here's the section the rest of the test-data discourse mostly skips. Most articles talk about "production-like data" as if every environment wants the same thing. The opposite is true. A working enterprise test-data setup recognizes four distinct scales, each with a different answer.

Unit-test scale (1–50 rows per relevant table). Unit tests want the smallest data that exercises the code path. Twelve users, three orders each, one payment status. Fast to set up, easy to assert against. Generated synthetically per-test is fine; copying any version of prod here is slow and brings risk that doesn't pay off.

CI/integration scale (hundreds to a few thousand rows). CI runs against a populated database that resembles a real one but stays cheap to provision. The shape is more important than the volume — every status code represented at least once, every FK chain populated, edge-case dates included. This is where most teams over-provision; CI databases sized like staging slow pipelines down without catching more bugs.

Staging/demo scale (tens of thousands of rows). Staging needs realistic-looking data because humans interact with it — internal users, sales demos, support reps. A dashboard with 50,000 plausibly-distributed records reads as "real" the same way one with 500,000 would. Demo-specific data often gets even more narrative shaping ("the persona Linda has 14 orders across 6 months, last one delivered yesterday") so screenshots tell the right story.

Perf scale (production-shaped distribution at production-shaped volume). Perf is the only environment where matching production matters. Query plans, index hit rates, autovacuum behavior, lock contention — all of these depend on data distribution and volume. Synthetic data at perf scale is achievable but requires careful distribution control: the long tail matters, hot rows matter, skew matters.

The argument: one tool that can hit any of these four scales from the same schema specification is more valuable than one that does any single scale well. Most teams own four different test-data pipelines (a fixtures library for unit, a Faker script for CI, a pg_dump for staging, a vendor TDM for perf) and pay coordination cost across all four. The win isn't more data; it's the same data engine producing different shapes on demand.

Tenancy boundaries and audit-trail consistency#

Two practical issues most TDM articles ignore — and both produce bugs that survive into production.

Tenant isolation in test data. Multi-tenant SaaS schemas have a tenant_id (or org_id, or account_id) column on most rows, and the application enforces "tenant 1 only sees tenant 1's data" via row-level filters. When test data is generated naively, foreign keys cross tenant boundaries: a tenant_1 user owns an order whose product_id belongs to tenant_2's catalog. That order will never appear in production — the application wouldn't allow it — but it shows up in tests, causes assertion failures, and worst case slips into a demo environment where a sales rep notices that one tenant can see another tenant's data. Multi-tenant database seeding has its own constraints; the rule is that every tenant-scoped FK has to stay within tenant.

Audit-trail and lifecycle consistency. Real records have a story: a user account is created, then email_verified, then active, eventually suspended or closed. The created_at, updated_at, verified_at, last_login_at timestamps line up with that story. Most generated data ignores this; you end up with users whose verified_at predates created_at, accounts with last_login_at after they were closed, status active rows whose closed_at is set. None of this is wrong by the database's lights — the constraints are satisfied — but it produces bugs nobody catches in tests because the data is the lie. Demo embarrassments are the visible version of this; the silent version is integration tests that pass against impossible state.

These are problems schema-aware generation has the chance to solve and hand-rolled scripts almost never do. They're also what turns "we have synthetic data" into "we have synthetic data that actually behaves like production data."

The procurement gap and how teams fill it#

Most existing content presents two paths for enterprise test data: hand-rolled seed scripts at the small end, $200K+ enterprise TDM at the high end. There's a gap in the middle, and a lot of teams live in it.

The gap is shaped like this. A regulated mid-size company — fintech, insurtech, healthtech, B2B SaaS in SOC 2 scope — has more than fifty engineers, a Postgres schema in the high two-digit table count, and a hard rule that production data cannot leave production. Hand-rolled seed scripts can't keep up with the schema; they break on every migration, and nobody owns them. But the team can't justify a $200K procurement cycle either. Procurement takes two quarters. The platform takes two more quarters to roll out. By the time the TDM is online, the schema has moved twice and the data-shape problem is half-solved at six-figure cost.

This isn't a "smaller customer for the enterprise tool" problem. It's a different category. The team doesn't need anonymization-of-production; they need data that was never production in the first place. They don't need a six-month rollout; they need something to drop into CI this sprint. The enterprise TDM platforms aren't a discount tier away from the right answer — they're answering a different question (anonymize what we have) than the one the team is asking (generate what we don't).

Schema-aware generation fills that gap. Tools in this category read the schema, generate from scratch, and never require production access. They're not a discount enterprise TDM; they're a different lever, and they happen to fit the regulated-mid-size shape better than either of the alternatives. The data seeding tools spectrum lays out where each category lands on the price/capability axis if you want a side-by-side.

Seedfast generates relational test data from your Postgres schema, with no production rows involved. For a team that's already decided "no production data leaves production" and isn't ready to run a multi-quarter procurement, that's the third option that's been missing from most of the writing on this topic.

Where to put what across the SDLC#

The operational closing question: which approach belongs where? A working enterprise test-data setup matches each environment to the data shape it actually needs.

  • Local-dev: synthesized data, small, regenerated per-developer. No production access ever. Refresh on demand from the schema; if the schema changes, the next run picks it up.
  • CI: synthesized data, small-to-medium, generated per-test-run or cached as a template. The CI database should be cheap to provision and identical across runs. This is where flaky tests pile up if the data shape drifts.
  • Integration / preview environments: synthesized data, medium scale, refreshable on demand. When a developer pushes a branch and gets a preview environment, the data should be there in minutes, not hours.
  • Staging: synthesized data at staging scale, refreshable nightly or on-demand. If sales and support use staging for demos, the data needs narrative shaping — named personas, recognizable scenarios.
  • Perf: the only environment where matching production-shaped distribution matters. For most regulated teams, this is also the only environment where masked-after-subset still earns its keep, if the team already has a TDM platform. Otherwise, schema-aware generation at perf scale, with care taken on distribution control.
  • Demo: synthesized data, scenario-shaped. Demo data should tell a story; sales reps and customer success teams should be able to walk through a known flow without surprises.

The pattern across all six environments is the same input — your schema — producing different output shapes. Teams that get this right stop coupling all four environments to one fragile script, and they stop pretending one snapshot fits everything. If you want to see how scenario-based generation lines up with your own schema, seedfa.st is where to start.