All posts

Compliant Test Data: Why Generated Beats Masked

By Mikhail Shytsko, Founder at Seedfast · · Updated

A compliant test data tool generates rows directly from your database schema, so no production PII is ever copied or masked. Because the data was never personal to begin with, it falls outside regulations like GDPR by construction. That keeps development and staging environments out of scope, with no masking pipeline to build and no real records in lower environments to defend in an audit.

If you work somewhere regulated, you already know the line "we need realistic test data" is not a lawful basis for parking real customer records in a dev database. The usual response is to mask: copy production, scrub the columns that matter, and keep the scrubbing rules in step with a schema that never stops moving. There is a less brittle option. Generate the rows from the schema and there is nothing to scrub in the first place, because no value in the dataset ever belonged to a real person — and that single shift, from copying to constructing, is what the rest of this guide unpacks under GDPR and at the point of choosing a tool.

  • "Compliant" hinges on one question: was the data ever personal? GDPR Recital 26 puts truly anonymous, never-personal data outside scope, while pseudonymised (masked) data is still personal data and stays in scope.
  • Generating from the schema is compliant by construction. It reads the schema and builds fresh rows, so there's no production access, no PII in the output, and no masking rules to maintain as the schema changes.

Under GDPR, whether test data counts as "compliant" comes down to a single distinction in Recital 26, and the two halves of that recital draw it cleanly.

Recital 26 says the principles of data protection "should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable." Generated test data that never derived from a real person is the cleanest case of "does not relate to an identified or identifiable natural person." There is no data subject behind the row.

Pseudonymised data is the mirror image. Masking a production export — tokenising names, scrambling emails — is pseudonymisation, and Recital 26 addresses it head-on: "Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information should be considered to be information on an identifiable natural person." Because the original stays recoverable with the right additional information, the masked copy remains personal data, and every environment holding it stays in scope.

That distinction is really the data minimization principle in disguise. GDPR Article 5 says personal data should be held to what is necessary, and a full production copy sitting in a lower environment is the opposite of necessary — masking shrinks the exposure, but the copy is still real data you have to govern. Generating the rows removes the personal data altogether, which is about as minimal as a dataset gets.

For a regulated team, the realistic choices come down to two: mask a production copy, or generate the data from the schema. Everything else is a variation on one of those, and they sit on opposite sides of the one line that actually matters here — whether a real person's data is ever present in the lower environment at all.

ApproachProduction accessPII in the outputIn GDPR scopePipeline to maintain
Mask / anonymize productionRequiredPseudonymised (still personal)Yes (pseudonymised data is in scope)Per-column masking rules, refreshed on schema change
Generate from schemaNot requiredNone (never personal)No (never-personal data is out of scope)None; reads the live schema each run

Masking production data is a real, established approach. Tonic Structural connects to your production database and applies masking, deterministic tokenization, and format-preserving encryption while preserving referential integrity; Delphix pairs irreversible masking with data virtualization for non-production environments, and K2View uses an entity-based model with format-preserving tokens (as of June 2026). Their shared strength is fidelity — the output inherits the real structure, distributions, and edge cases of production, because it is your production data, transformed. The cost is a production connection plus the security review it triggers, a masking pipeline to update on every schema change, and output that, being pseudonymised, remains personal data under Recital 26.

Generating from the schema works the other way around. The tool reads the structure — tables, columns, constraints, foreign keys — and builds valid relational rows from nothing, so no row ever began as a real person's record and there is nothing in the output anyone could re-identify. That trade suits a team whose goal is keeping dev and staging out of scope more than mirroring production exactly. Where Faker, ORM seeders, and the rest of the field fit is mapped in the data seeding tools comparison.

Masking wins when testing depends on the exact distributions and edge cases of production — fraud-model validation, or analytics that has to match prod. Generated data is realistic and correctly shaped, though it isn't your actual users' data. For development, CI, and demos, generation is the compliant default.

Once you've settled on generating over masking, the tool still has to deliver it. These are the criteria that decide whether it keeps you out of scope and out of maintenance:

  • No production access. The tool should never need a connection to your production database. If it does, you're back to a pipeline that processes real data and needs a security review.
  • Schema-aware, foreign-key-level. It should read the live schema and resolve the foreign-key graph, so the generated data has valid relationships across tables rather than plausible values sitting in isolated cells. Column-level tools like Faker produce values but leave referential integrity for you to wire up by hand.
  • No masking rules to maintain. Generation from the live schema adapts to migrations automatically. A masking config, by contrast, drifts the moment someone adds a PII column nobody flagged.
  • CLI/CI-native. Compliant data you can only produce by hand isn't sustainable. The tool should run as a step in your pipeline so every ephemeral environment gets fresh, compliant data without a person in the loop.
  • Audit-defensible provenance. "Our dev and staging environments do not contain customer data; we generate it from the schema" is a clean answer for an auditor. "We copy production and run a masking script" invites questions about the script's coverage.

Seedfast is a CLI that points at a live Postgres database and generates relational data from a plain-language scope, re-reading the schema on every run. Only the metadata is ever read — table and column names, types, foreign keys, never the rows — so there's no production access to security-review and no masking step to maintain, because nothing in the pipeline was real to begin with.

seedfast seed --scope "B2B SaaS app with 5,000 accounts, users, and 90 days of activity"
  → Connected to PostgreSQL
  → Found 28 tables, 54 foreign keys
  → Generating data...
  → Done. Seeded 41,200 rows in 9.1s

From there, Seedfast walks the foreign-key graph so that every child row points at a parent that actually exists, resolving insert order on its own wherever the schema permits it — through nullable back-edges or deferrable constraints. The values it produces are domain-shaped: plausible names, amounts in sensible ranges, distributions that don't look machine-stamped. And when next sprint's migration adds a table, the following run just picks it up, with no masking config left encoding last week's schema and quietly drifting out of date.

Treat Seedfast as a technical control and nothing grander. By building data fresh, the CLI keeps production PII out of lower environments — real value, but only one line item in a program that still rests on scoping, access control, and audit logging. It is not a certification, and on its own it makes you neither GDPR nor SOC 2 compliant. If your program requires a signed BAA or a SOC 2 report from a sub-processor, check Seedfast's current status directly; don't infer it from this page, and keep treating the tool as one control among the rest. One data-path detail belongs in that same sub-processor review: to generate the data, Seedfast sends the schema definition — table and column names, types, constraints — to an AI provider, while the row values themselves never leave your database. If a column name is itself sensitive, make sure that path fits your policy. None of this is legal advice; for your own obligations, talk to qualified counsel.

For where this is purely a Postgres-stack decision, the best Postgres test data generator comparison covers the wider tool field.

The same generate-from-schema logic shows up under each framework, with a different name on the obligation.

GDPR. Under GDPR, never-personal data is outside scope (Recital 26), and copying production into lower environments works against data minimization (Article 5). Generating from the schema satisfies minimization for non-production environments by removing the personal data entirely.

SOC 2. A Type II audit asks where your test data comes from, and reducing PII in dev and staging supports the confidentiality and privacy criteria an auditor evaluates (AICPA SOC 2). "We generate it from the schema" is a cleaner control narrative than documenting a masking pipeline's coverage.

HIPAA and PCI. The same construction keeps the test-data path clear of protected health information and cardholder data; the framework-specific mechanics live in HIPAA compliant test data (§164.514 de-identification) and test data for fintech (PCI DSS). For the staging-specific case of dropping the production copy, see staging without production data.

Anonymized and generated data get conflated, but they are different under GDPR. Anonymized data starts as real personal data and is processed to strip identifiers; under GDPR Recital 26 it counts as only pseudonymised — and still personal data — if it can be re-identified with additional information. Data that was generated and never personal has nothing to re-identify, which is the cleaner footing for gdpr test data.

The most defensible route to gdpr-safe test data generation is to build the rows from your schema instead of copying and masking production. A schema-aware generator reads your tables, columns, and foreign keys, then emits realistic relational rows with no real personal data in the pipeline. Before committing to one approach, the data seeding tools guide weighs Faker, ORM seeders, masking, and schema-aware generation side by side — the concrete next step this page doesn't otherwise spell out.

On its own, no. Masked or tokenized production data is pseudonymisation under Recital 26, so it stays personal data and its environment stays in scope, and you still carry a production connection plus the pipeline behind it. Masking earns its place on fidelity: choose it when a test genuinely depends on production's real distributions, like fraud-model validation, and choose generation for development, CI, and demos. The comparison table above shows which side of the scope line each one lands on.

Generally not, provided the generator never touches production and the output cannot be linked back to real individuals. Data built from a schema describes no actual person — there is no data subject behind the row — so it falls under Recital 26's anonymous-information exception rather than the pseudonymisation rule. The catch worth checking is provenance. If a generator quietly reads production rows to "learn" distributions, that assumption breaks.

Compliance is not something a single tool confers. A compliant test data tool removes one specific risk — production PII in non-production environments — and supports principles like data minimization. Test data compliance is one piece of a wider picture that also covers scoping, access controls, and audit logging. Generating from the schema is a strong control inside that program, and it doesn't stand in for the rest of it.

Seedfast is not affiliated with, endorsed by, or sponsored by the products compared here. All product names, logos, and brands are the property of their respective owners and are used for identification purposes only. Comparisons reflect publicly available information as of the date shown.

Tonic, Delphix, K2View are trademarks of their respective owners.