All posts

Test Data for Healthcare Apps: What Your Database Actually Needs

By the Seedfast team ·

Key Takeaways#

  • Healthcare schemas carry temporal and relational dependencies that generic test data tools miss: a prescription must follow a diagnosis, lab results must reference a specific encounter, and visit patterns need to reflect clinical reality
  • Column-level generators like Faker produce valid cells but not valid patient timelines — the data passes constraints but doesn't test real workflows
  • File-based tools like Synthea generate standardized formats (HL7 FHIR, CSV), not SQL for your specific PostgreSQL schema. Bridging that gap is real engineering work that usually gets hand-rolled
  • Cross-row temporal consistency separates useful test data for healthcare from rows that merely fill tables — visit clustering during acute episodes, prescription chains that follow diagnoses, lab values that track physiological trajectories
  • Seedfast reads your FK graph and produces records with these relational properties, removing the need for hand-rolled orchestration or production data access

Healthcare applications don't fail on bad data in obvious ways. They fail when a prescription references an encounter that hasn't happened yet, when lab results appear for a patient with no visit history, or when a demo environment shows five identical patients all born on January 1st, 1970. Test data for healthcare apps isn't hard because of volume — it's hard because healthcare data carries structural dependencies that most generation approaches ignore. This article is about what realistic healthcare test data looks like at the PostgreSQL schema level — not about which regulation requires it (that's a separate guide) and not about what PHI fields are.

Why test data for healthcare is a different problem#

Every development team that handles healthcare data runs into the same baseline constraint: production data stays in production. Patient records carry PHI — protected health information — and copying them to development or staging environments pulls those environments into HIPAA scope. Most healthcare engineering teams have already internalized this. Their compliance consultant has already said it. What's left to figure out isn't whether they need generated data — it's what that data actually has to look like.

Anonymization is the traditional workaround — copy production, mask the sensitive fields, use the result. But anonymization still starts with production data. If the masking pipeline misses a column (a free-text notes field, a JSON blob with embedded patient identifiers), real records leak into environments they were never meant to reach. Maintaining that pipeline also requires production access and ongoing upkeep as the schema evolves. Generating from the schema definition is a different approach: Seedfast reads your PostgreSQL schema and produces data from scratch, no production data involved. That removes the masking-completeness risk but shifts the problem: the generated data has to be realistic enough to actually test your application.

And here's where healthcare diverges from, say, an e-commerce schema or a SaaS CRM. Sensitivity alone isn't the issue. Structure is.

Healthcare databases carry temporal dependencies between tables that don't exist in most application domains. An encounter references a patient and a date. Prescriptions reference that encounter and must have been written on or after the encounter date. Lab results reference the same encounter and carry values that, in a realistic dataset, track a physiological trajectory across multiple visits. Insurance coverage records have effective dates that must overlap with the encounters they cover.

These aren't exotic edge cases. They're the baseline shape of healthcare data. And most test data generation methods don't account for them.

What healthcare data looks like in your database#

To make this concrete, here's a simplified version of what a healthcare app's PostgreSQL schema looks like — not a reference architecture, just the tables a typical telehealth or clinic management platform ends up with after six months of development:

patients
  ├─► encounters (patient_id FK, visit_date, type, provider)
  │     ├─► prescriptions (encounter_id FK, medication, dosage, start_date)
  │     ├─► lab_results (encounter_id FK, test_name, value, units, result_date)
  │     └─► diagnoses (encounter_id FK, icd_code, description, diagnosed_at)
  └─► insurance_coverage (patient_id FK, plan_name, effective_start, effective_end)

Five tables, four foreign keys, and a set of implicit temporal rules that no FK constraint can encode. PostgreSQL enforces that a prescription belongs to an encounter (encounter_id is a valid FK) — but nothing in the schema enforces that the prescription's start_date falls on or after the encounter's visit_date, that lab results show a plausible progression across visits, or that insurance coverage periods overlap with the encounters they're supposed to cover.

This is where test data for PostgreSQL in the healthcare domain diverges from generic approaches. FK structure is the foundation. On top of it sits a layer of domain rules that are implicit in any real patient record but absent from the schema's constraint definitions.

Seedfast reads your schema's FK graph and generates records that respect these relationships — and it goes further, producing cross-row temporal consistency for the domain constraints that sit above the schema. What those constraints look like in healthcare is the subject of the next two sections.

And this is a five-table simplification. Production telehealth platforms add providers, appointments, referrals, and billing tables — each with more FK edges and more implicit temporal rules. By the time a schema reaches 20 tables, the cross-table dependency web is dense enough that hand-rolling test data becomes its own engineering project.

Where column-level generators break#

Faker, Mockaroo, and similar column-level generators are good at producing realistic-looking values for individual fields. Patient names that look like names. Medication fields containing real medication names. Dates that fall within a plausible range.

Where they break is at the row level. Each prescription row gets a random medication, a random dosage, and a random date — with no awareness that this prescription is supposed to be connected to a specific encounter for a specific patient on a specific date.

Every constraint passes. FK values are valid. NOT NULL columns are filled. CHECK constraints are satisfied. But the dataset doesn't test anything meaningful, because relationships between rows don't reflect how healthcare data actually behaves.

Consider a simple test scenario: a patient with Type 2 diabetes. In a realistic dataset, you'd expect an initial encounter with a diabetes diagnosis, followed by a metformin prescription, followed by quarterly lab results tracking HbA1c levels, followed by dosage adjustments if the levels don't improve. Four tables, ten or fifteen rows, all temporally ordered and clinically linked.

Column-level generators produce rows in all four tables — but the diabetes diagnosis might appear six months after the metformin prescription. HbA1c values might be random floats between 0 and 100 rather than plausible readings between 4.0 and 14.0. Dosage adjustments might precede the lab results that would have motivated them.

Data fills the schema but doesn't exercise the application. For background on why synthetic test data generated from schemas avoids this class of problem, that guide covers the mechanics in detail.

Where file-based tools leave you stranded#

Synthea — the most widely used open-source tool for synthetic patient data — addresses the clinical realism problem well. Patient life histories get simulated with disease progressions, medication regimens, and encounter sequences that follow clinical logic. Records come out temporally consistent and medically plausible. For research and analytics, Synthea is a strong choice.

Where it falls short is format. Synthea exports HL7 FHIR bundles, C-CDA documents, and flat CSVs — but doesn't connect to your PostgreSQL database or know about your encounters table, your prescriptions table, or your custom insurance_coverage schema. What you get is standardized healthcare data in a standardized format, not data shaped to your application's relational model.

Public healthcare datasets have the same limitation. CMS Synthetic Public Use Files provide claims-level data in fixed-width or CSV format. MIMIC-III and MIMIC-IV offer rich clinical datasets for research — valuable for data science and ML training, but not designed to populate a development database with a custom schema.

Bridging this gap — downloading a dataset or generating Synthea output, then writing an ETL pipeline to transform it into your schema, mapping FHIR resources to your table structure, handling FK ordering, and updating the pipeline every time your schema changes — is a real engineering project. Teams that attempt it usually spend a week building the pipeline and then abandon it after two or three migrations because the maintenance cost isn't justified.

When it comes to test data for healthcare applications, the actual need isn't a dataset — it's data that fits their schema. That difference between a downloadable CSV and a populated PostgreSQL database with valid relationships is the same gap that separates staging with production copies from staging without them.

Cross-row consistency: what makes healthcare test data useful#

This is the piece that neither column-level generators nor file-based tools address for your specific schema: cross-row temporal consistency.

Realistic healthcare data isn't just valid foreign keys and plausible column values. Patient timelines have to behave like patient timelines.

Visit patterns cluster. Someone with a chronic condition visits quarterly during stable periods and weekly during acute episodes. Encounters should reflect this — not uniform random dates spread across the year, but clusters and gaps that mirror how patients actually interact with a healthcare system.

Prescription chains follow diagnoses. A statin prescription follows a lipid panel with elevated LDL. An antibiotic prescription follows an encounter with an infection diagnosis. Start dates follow encounter dates. Refills follow the previous prescription's end date. These aren't random associations — they're sequences with causal and temporal order.

Lab values track trajectories. Blood glucose readings across six months should show a pattern — improving after a medication change, gradually worsening if untreated, or stable during maintenance therapy. Random float values in the lab_results table don't test the application logic that flags deteriorating trends or triggers alerts on out-of-range values.

Insurance coverage overlaps with encounters. If a patient has a gap in coverage, their encounters during that gap should be either absent or flagged differently. Datasets where every patient has continuous coverage and visits distributed uniformly don't exercise the edge cases that billing and eligibility logic needs to handle.

Cross-row consistency is what distinguishes test data for healthcare that exercises your application's logic from data that merely fills rows. Fintech schemas face the same class of problem — domain-specific temporal relationships encoded in the data, not just in the schema — fintech schemas face the same class of challenge.

Seedfast produces this. When it reads your FK graph and the developer describes a scenario — "telehealth app with 200 patients, 6 months of encounter history" — the generation is designed to respect the domain logic: a prescription's start date follows its encounter's visit date, lab results track a progression rather than noise, insurance coverage periods align with the encounters they cover.

You can't bolt this onto a column-level generator by writing better factory code. Consistency has to be generated across tables simultaneously, with awareness of how rows relate to each other — not just which FK they reference, but when the referenced event occurred.

Generating test data for your healthcare schema#

Seedfast collapses the gap between "I need realistic patient data" and "I need data in my specific PostgreSQL schema".

Point the tool at your database, describe the scenario, and the FK graph plus domain context handle the rest. A typical session looks like this:

$ seedfast seed --scope "telehealth app with 200 patients, 6 months of encounter history, mixed chronic and acute cases"
  → Connected to PostgreSQL
  → Found 22 tables, 31 foreign keys
  → Generating data...
  → Done. Seeded 14,200 rows in 8.1s

No production data was accessed. No FHIR-to-SQL pipeline was written. No seed script needs updating after the next migration. Seedfast reads the current schema on every run, so when the team adds a referrals table next sprint, the next seed picks it up automatically.

Try it on your own schema — get started here.

Generation starts from the schema itself — not from a factory definition, not from a fixture file, not from a production export. Describe a scenario, and Seedfast translates that into records that respect both the FK constraints the database enforces and the temporal patterns the domain implies.

For teams approaching their first SOC 2 audit or working under HIPAA test data requirements, removing production data from development environments is often the first action item the compliance consultant recommends. Seedfast makes that actionable without sacrificing the data realism that developers need for test data management across local, CI, and staging environments.