HIPAA Test Data: What Compliance Really Requires in Your Dev Environment
By the Seedfast team ·
Search "hipaa test data" and you get two SERPs collapsed into one: marketing pages about verifying that your software meets HIPAA technical safeguards, and a much smaller pocket of writing about populating dev databases for healthcare apps. This article is about the second one. If you're an engineer maintaining a Postgres schema with patients, encounters, and prescriptions, and your compliance lead has banned production copies in non-prod environments, the practical question is: what should the data in your dev and staging databases actually look like, and what does HIPAA require about it?
This isn't a HIPAA primer. It's the engineering view: what the regulation says about non-production environments, how that maps to columns in a real Postgres schema, and what changes when the proposed Security Rule update is finalized.
Key Takeaways#
- HIPAA does not define the term "test data", but the Security Rule (45 CFR §164.308–§164.312) and the Privacy Rule's de-identification standard (§164.514) effectively decide what your dev and staging databases can hold — any environment with PHI is in HIPAA scope.
- The 18 Safe Harbor identifiers map directly onto columns in a typical healthcare schema:
full_name,email,phone,dob,address,mrn, IP address, device ID, and free-text fields likenotesandchief_complaint. - Copying production into staging and "anonymizing" the obvious columns leaves PHI in free-text fields, JSON blobs, audit trails, and FK-linked context — and pulls the entire staging environment under BAA, audit-log, and MFA obligations.
- The proposed May 2026 Security Rule update raises the cost of PHI-in-dev sharply: MFA and encryption become required, annual penetration tests and biannual vulnerability scans become explicit, and network segmentation gets a mandate. Whatever environment holds PHI gets all of that.
- Schema-aware generation produces realistic relational data from your schema definition without reading any production rows — which keeps dev and staging out of HIPAA scope rather than trying to mask their way out of it.
What HIPAA actually says about test data#
HIPAA never uses the phrase "test data". The statute is built around two regulations that matter for engineers: the Security Rule (45 CFR §164.308–§164.312), which sets administrative, physical, and technical safeguards for ePHI, and the Privacy Rule's de-identification standard (§164.514), which describes when patient data is no longer protected.
§164.514 defines two paths to de-identification:
- Safe Harbor: remove 18 specific identifiers and have no actual knowledge that the remaining data could identify a person. The 18 identifiers are listed verbatim in HHS guidance — names, dates more granular than year, geographic subdivisions smaller than state (with the 3-digit ZIP rule), telephone, fax, email, SSN, MRN, account numbers, biometric identifiers, full-face photos, and so on.
- Expert Determination: a qualified statistician documents that the risk of re-identification is "very small".
If your dev database contains data that satisfies one of these, it's not PHI. If it doesn't, it is — even if you call it "test data" in a sprint planning doc. Naming has no force here; the data shape decides.
This sits on top of a structural rule: any environment that stores, processes, or transmits PHI is in HIPAA scope. Scope brings business associate agreements (BAAs) for every cloud service touching that environment, audit logging on access, breach notification obligations, encryption at rest and in transit, and access controls aligned with §164.308. None of that goes away because an environment is "just staging".
The practical consequence: deciding what your test data looks like is, on a HIPAA-bound system, a regulatory decision about scope. Either your dev database holds something that meets §164.514 (and the environment is out of scope), or it doesn't (and the environment is in).
What counts as PHI in your Postgres database#
Most engineers know HIPAA covers names and SSNs. The 18 identifiers list is wider, and several of them ride quietly inside columns nobody flags during code review. Mapped to a typical healthcare Postgres schema, here's what the Safe Harbor identifiers look like — and a working CREATE TABLE of the kind of schema this article keeps coming back to:
CREATE TABLE patients (
id uuid PRIMARY KEY,
mrn text UNIQUE NOT NULL, -- PHI: medical record number
full_name text NOT NULL, -- PHI: name
email text NOT NULL, -- PHI: email
phone text, -- PHI: phone
dob date NOT NULL, -- PHI: date more granular than year
address text, -- PHI: street address
zip5 text, -- PHI: ZIP unless reduced to 3 digits
ip_last_seen inet, -- PHI: device IP
created_at timestamptz NOT NULL
);
CREATE TABLE providers (
id uuid PRIMARY KEY,
full_name text NOT NULL,
npi text
);
CREATE TABLE encounters (
id uuid PRIMARY KEY,
patient_id uuid REFERENCES patients(id),
provider_id uuid REFERENCES providers(id),
occurred_at timestamptz NOT NULL, -- date more granular than year — PHI
chief_complaint text, -- free-text — usually PHI
diagnosis_code text -- code alone is not PHI; in context it can be
);
CREATE TABLE prescriptions (
id uuid PRIMARY KEY,
encounter_id uuid REFERENCES encounters(id),
patient_id uuid REFERENCES patients(id),
drug_code text NOT NULL,
written_at timestamptz NOT NULL,
notes text -- free-text — usually PHI
);
Things that surprise people on first audit:
dobis PHI. Any date more granular than year, attached to an individual, counts. Adate_of_birthcolumn with full dates fails Safe Harbor.ip_last_seenis PHI. IP addresses are explicitly in the 18 identifiers.- Free-text fields are usually PHI.
chief_complaint,notes,provider_comments, and anydescriptioncolumn tend to leak names, dates, and addresses dictated by clinicians or pasted from prior records. A clean schema with dirty free-text is still a PHI-holding environment. - Audit and event tables.
audit_log.actor_email,events.payload(JSON),feature_flags.user_id— these are easy to forget when building a "no PHI in dev" inventory. - The FK chain itself can re-identify. A 6-row
patientstable that ties to a 30-rowencounterstable that ties to a 12-rowprescriptionstable can identify someone even if no individual column does, when the FK pattern is rare enough. Generating synthetic data from the schema addresses this by construction. The FK chain is what makes referential integrity across patient records genuinely matter for HIPAA, not just for tests passing.
A useful exercise before you decide what your dev database should hold: open your schema and tag every column as PHI / not-PHI / depends-on-content. The columns that surprise you are usually where the leak is. Seedfast was built around the assumption that this list is long, that it grows with every migration, and that the safest place for those columns in a non-prod environment is to be filled with values that don't come from a real person at all.
Why production copies in development are the default mistake#
The most common pattern for getting "realistic" data into a dev database is a sanitized production dump. It looks responsible: take a pg_dump, run an anonymize.sql script that overwrites obvious PHI columns, restore. People feel they did something rigorous.
The pattern fails in four ways that compound:
1. Anonymization scripts only cover columns somebody remembered. Six months in, three new tables exist, two columns were renamed, and a feature added a JSON metadata column that now contains free-text from clinicians. The script doesn't update itself. It also doesn't reach into JSON. It doesn't anonymize free-text fields like chief_complaint (you can't blindly rewrite clinical narrative without breaking tests). The script is a partial fix that the team treats as complete.
2. Free-text and FK-linked context survive masking. Suppose patient.full_name is overwritten with 'Test User ' || id. The notes column on the corresponding prescription row still says "Patient and her husband John mentioned the move to 412 Elm Street." Real masking of free-text is an NLP project, not a SQL UPDATE. Most teams ship without it.
3. The non-prod environment is now in HIPAA scope, fully. PHI in staging means staging needs a BAA chain, audit logging, MFA, encryption at rest, encryption in transit, breach-notification readiness, the works. Even when the team says "it's only there for an afternoon" — and especially when one developer restored a snapshot to staging "for a few hours" to debug a billing bug, which is the moment the compliance lead finds out about — the environment was in scope while the data sat there. A 3-hour exposure is a 3-hour HIPAA event.
4. The auditor sees the dump. A SOC 2 auditor reviewing test-data sources will ask. The honest answer "we copy production and run a script that the team wrote in 2023" earns a finding. The team that says "our dev environments do not contain PHI" needs to demonstrate the path that data takes — and "we generate from the schema" is the cleanest version of that demonstration. (Day-to-day, this is what staging environments built without production data try to make feasible.)
The deeper issue: production-copy-then-mask treats PHI as something you reduce to manageable levels in non-prod. The regulation doesn't operate that way. Either an environment has PHI (in scope, with the full obligations) or it doesn't (out of scope). There is no middle tier called "lightly anonymized". The audit happens at the binary.
The proposed May 2026 Security Rule update — what changes for dev environments#
A status note before specifics: as of writing, the OCR final rule on the new HIPAA Security Rule has not yet been issued. The proposed rule (NPRM) was published December 27, 2024. May 2026 is the target finalization date on the regulatory agenda, but timing is not guaranteed; teams I've talked to are planning against the proposed text rather than waiting for the final version. Cite the NPRM, hedge the timeline, and revisit when OCR confirms.
What the NPRM proposes, with the dev-environment translation in plain terms:
- MFA becomes required, not "addressable". Every login to an environment storing or processing ePHI needs MFA. If staging holds PHI, every developer's path into staging — bastion, VPN, IAM console, database admin tool — gets an MFA gate. If staging holds only generated data, this requirement applies to production but not to that environment.
- Encryption at rest and in transit become required. Same scope question: if PHI never enters the dev disk, the rule applies upstream of dev rather than to it.
- Annual penetration tests and biannual vulnerability scans. Both must cover any system handling ePHI. A staging environment with anonymized-prod data is in that category. A staging environment seeded from the schema isn't.
- Network segmentation mandate. Production networks holding PHI need to be segmented from non-PHI environments. This is much easier to draw on a network diagram when dev and staging are decisively in the non-PHI category.
- Stricter audit cadence and incident-response obligations. Audit logs covering ePHI access become mandatory at higher granularity. Staging access counts.
- 180-day implementation window after publication. If the rule lands in mid-to-late 2026, the obligation hits production systems in early 2027.
There's a conclusion engineering teams are reaching when reading the NPRM that compliance teams sometimes reach later: the cheapest way to prepare for the rule is to remove PHI from environments where it doesn't need to be. Production stays in scope (it has to). Dev and staging don't have to, and the new rule makes "have to" much more expensive when they are.
Three approaches teams use to populate non-production databases#
The decision about what to put in your dev database almost always reduces to one of three approaches. They are not three flavors of the same thing — they map to different scope outcomes under HIPAA.
| Approach | What it does | HIPAA scope effect | Honest tradeoffs |
|---|---|---|---|
| Production copy (with or without masking) | pg_dump from prod, restore to dev/staging, optionally run a masking script. | Dev/staging is in scope. Even with masking, free-text and JSON columns and audit trails carry PHI; the environment has obligations. | High realism. Fast to set up the first time. Ongoing scope cost: BAAs, MFA, audit logs, encryption, breach readiness for a non-prod environment. |
| Masked or de-identified subset | Pull a slice of prod, push through an anonymization tool that rewrites flagged columns. Enterprise tools (Tonic Structural, Delphix, K2View) sit here. | Can move dev/staging out of scope if the de-identification meets §164.514, but only when the masking is exhaustive and reviewed by an Expert Determination process. Setup is weeks; pricing starts at $200K/year. | Solves the masking gap if done thoroughly. Still requires a connection to production and a security review for that connection. The pipeline itself is in HIPAA scope. |
| Schema-aware generation (no production access) | Read the live schema, generate realistic relational data from scratch, write it to dev/staging. | Dev/staging can stay out of scope because no PHI enters from the generator. The generation tool reads schema metadata, not patient rows. | No production access at all. Predictable cost. Realism depends on the generator's domain understanding — generated data is shaped right but is not your actual users' data, which matters if a bug only repros against real production patterns. PostgreSQL-only for now if you're using Seedfast (other generators support broader databases with their own tradeoffs). |
The phrase "no production access" carries more weight than it sounds like. It cuts the BAA conversations for the masking vendor, the security review for that connection, the row-count negotiations between data engineering and compliance, and the audit question of who at the masking vendor has access to PHI. Seedfast is built that way: it reads your schema, generates the patients → encounters → prescriptions chain in coherent order, and never asks for a production credential.
Day 02 in this series covered the broader category — test data management — without anchoring on regulation. Here the comparison is HIPAA-specific, and the column that matters is "scope effect", not "how realistic does the data look in screenshots".
What HIPAA test data looks like in practice#
If you choose the third approach, the next question is what the data should actually look like. "No PHI" is a binary; "realistic enough that integration tests find real bugs" is a target. Both have to be true.
A generated row in patients for a healthcare app might look like:
id: a4c1... (random UUID, not from prod)
mrn: MRN-1857293 (synthetic format, no collision with real MRN range)
full_name: Priya Adesanya (plausible name, not a real patient)
email: priya.adesanya@example.test (reserved test domain)
phone: +1-555-0148 (555-01XX is reserved for fictional use)
dob: 1979-04-22 (plausible age, not a real DOB)
address: 14 Maple Court, Springfield (fictional)
zip5: 45203 (real ZIP shape; not tied to a real person)
ip_last_seen: 198.51.100.42 (TEST-NET-2 reserved range)
Several things are intentionally true about this row:
- It's individually realistic — first name distribution, surname plausibility, email format, age cohort, phone format. Tests that touch these fields exercise the same code paths they would in production.
- None of it traces back to a real person. The 555-01XX phone range, the
example.testdomain, and the TEST-NET reserved IP range exist precisely so engineers can use them without colliding with real numbers. - The MRN format is internal-system-shaped without being real. If your production MRNs follow a pattern, generated MRNs follow the same pattern but in a non-production range.
- It's reproducible if the generator is seeded, and disposable if it isn't. Either is a choice.
A generated row in encounters for the same patient might look like:
id: e8a2...
patient_id: a4c1... (FK to the patient above)
provider_id: 91c5... (FK to a generated provider)
occurred_at: 2026-02-14 09:30:00+00 (within plausible visit window)
chief_complaint: "Routine follow-up for blood pressure check; patient reports good adherence."
diagnosis_code: I10 (illustrative code; verify against your code set)
The chief_complaint is a generated narrative — clinical-shaped, no real person attached. The occurred_at falls inside a window that's coherent with the patient's created_at. The provider_id resolves to a generated provider row, not a NULL or a dangling integer.
A generated row in prescriptions continues the chain:
id: rx40...
encounter_id: e8a2... (FK to the encounter above)
patient_id: a4c1... (FK to the patient above)
drug_code: DRUG-PLACEHOLDER-A12 (illustrative; map to your code set)
written_at: 2026-02-14 09:35:00+00 (after encounter.occurred_at)
notes: "Continue current regimen; recheck in 3 months."
Three properties this row has that a Faker output would not:
- Seedfast respects insert order from the FK graph:
encounter_id,patient_id, and the chain back toprovider_idare all valid references to rows that exist. Faker scripts get this wrong roughly every time someone adds a table. - Temporal coherence.
prescriptions.written_atis later thanencounters.occurred_at, which is later thanpatients.created_at. None of this is enforced byCHECKconstraints in most schemas; a generator that respects narrative time avoids the kind of test bug where a prescription appears to have been written before its encounter happened. - Domain-shaped values. The clinician-shaped narrative in
notes, the format ofmrn, the consistency of phone with country — these are the things that turn a generated row from "passes constraint checks" into "passes a sniff test from a domain expert".
The point isn't that this beats Faker on a feature checklist. It's that Seedfast produces rows with these properties from your schema alone, so your dev database can do the work it needs to do without putting the environment in scope.
How schema-aware generation removes PHI from the development surface#
The framing in the SERP today is that synthetic data is a kind of anonymization. It isn't. Anonymization starts with PHI and reduces it. Schema-aware generation starts with no PHI and produces something realistic. The two approaches end up at different points on the regulatory map.
Anonymization-based pipelines have a permanent connection to production. They have to: that's where the input data lives. That connection is a path PHI travels along, however briefly, even if the output is masked. The masking vendor has access. The pipeline is in scope. The dev environment may or may not be in scope depending on how clean the masking is.
Schema-aware generation has no such connection. It needs the schema definition — table list, column types, constraint set — but not a single row of patient data. Seedfast reads the database schema directly, builds a dependency graph from foreign keys, and generates realistic rows in a coherent order. The patients table gets filled before the encounters that reference it. The free-text fields get domain-shaped values rather than lorem ipsum or random strings. The whole operation reads no PHI because there is no PHI to read.
The operational result for a HIPAA-bound team: dev and staging stop being environments-with-masked-PHI and start being environments-with-no-PHI. Seedfast keeps PHI from reaching dev or staging in the first place — obligations keyed to PHI presence apply to production rather than following data into environments that never received it.
If a healthcare schema like the one in this article — patients, encounters, prescriptions, providers, with the FK chains they imply — sounds like the kind of schema your team would point a generator at for the first time, Seedfast handles that shape. The CLI reads the live schema, you describe the scenario you need (a clinic with a few hundred patients, a year of visit history, prescriptions clustered around chronic-care patterns, whatever the integration test calls for), and the data lands. Setup is one CLI command against a live schema, with no production connection to provision.
FAQ#
Can we use a small subset of real PHI in dev if we de-identify it ourselves?
Yes, if the de-identification meets §164.514 — either Safe Harbor (all 18 identifiers removed and no reasonable basis to believe the remainder identifies a person) or Expert Determination. In practice, most ad-hoc de-identification scripts miss free-text fields, JSON columns, and FK-pattern re-identification risk, and don't meet the standard. Many compliance leads treat any subset originating from production as de facto in scope until proven otherwise.
Is synthetic test data automatically HIPAA-compliant?
HIPAA doesn't certify data sets, and no tool is HIPAA-compliant on its own — only the system as a whole can be. Generated data that contains no information about real individuals is, by definition, not PHI under §164.514. That makes the environment that holds it eligible to be out of HIPAA scope, which is the practical effect teams care about. It does not replace the rest of your HIPAA program for the systems that do hold PHI.
Does the proposed Security Rule update apply to development environments?
The Security Rule applies to any environment that creates, receives, maintains, or transmits ePHI. If your dev environment holds PHI, the proposed MFA, encryption, audit-log, and pen-test requirements would apply to it. If it doesn't, they wouldn't apply to it for that reason. The rule is keyed to PHI presence, not to environment label.
Do we need a BAA with our test-data tool?
You need a BAA with anyone handling PHI on your behalf. A schema-aware generator that reads schema metadata (and not patient rows) is generally not handling PHI; an anonymization tool that connects to production typically is. Your compliance lead is the right reviewer, but the question to put in front of them is "does the tool ever see patient data", and for schema-aware generation the answer is no.
Where does this leave SOC 2?
SOC 2 auditors will ask about test-data sources during a Type II audit. "Our dev/staging environments do not contain customer or patient data; we generate test data from the schema" is a clean answer. "Our dev/staging environments contain anonymized production data" is an answer that opens further questions about the anonymization process, which means more documentation and more controls.
Related guides#
- Referential integrity across patient records — why the FK chain is doing more work in healthcare schemas than in most domains, and what generators have to get right.
- See how Seedfast handles a healthcare schema — point it at your schema, describe the clinical scenario you need, and seed dev or staging without production access.