All posts

Test Data for Healthcare Apps: What Your Database Actually Needs

By Mikhail Shytsko, Founder at Seedfast · · Updated

This page covers the relational and temporal shape healthcare schemas need. If you're here for the HIPAA compliance angle — §164.514 de-identification, what counts as PHI in a Postgres schema, and keeping dev or staging out of HIPAA scope — see HIPAA test data.

TL;DR: Test data for healthcare apps needs three things a generic generator won't produce: foreign-key chains that resolve in the right order (patients → encounters → prescriptions), temporal ordering so a prescription never predates its encounter, and cross-row trajectories like an HbA1c series that descends across visits. Compliance rules are a separate concern.

  • Healthcare schemas carry temporal and relational dependencies that generic test data tools miss: a prescription must follow a diagnosis, lab results must reference a specific encounter, and visit patterns need to reflect clinical reality
  • Cross-row temporal consistency separates useful healthcare test data from rows that merely fill tables — visit clustering during acute episodes, prescription chains that follow diagnoses, lab values that track physiological trajectories
  • Schema-aware generation reads your live Postgres schema and produces rows in FK-dependency order — patients before encounters, encounters before prescriptions — without ever touching production data

Healthcare applications don't fail on bad data in obvious ways. They fail when a prescription references an encounter that hasn't happened yet, or when lab results appear for a patient with no visit history. Healthcare test data isn't hard because of volume — it's hard because healthcare data carries structural dependencies that most generation approaches ignore. This article is about what realistic healthcare test data looks like at the relational schema level — not about which regulation requires it (that's a separate guide) and not about what PHI fields are.

Healthcare databases carry temporal dependencies between tables that don't exist in most application domains. An encounter references a patient and a date. Prescriptions reference that encounter and must have been written on or after the encounter date. Lab results reference the same encounter and carry values that, in a realistic dataset, track a physiological trajectory across multiple visits. Insurance coverage records have effective dates that must overlap with the encounters they cover.

These aren't exotic edge cases. They're the baseline shape of healthcare data. And most test data generation methods don't account for them.

To make this concrete, here's a simplified version of what a healthcare app's relational schema looks like — not a reference architecture, just the tables a typical telehealth or clinic management platform ends up with after six months of development:

patients
  ├─► encounters (patient_id FK, visit_date, type, provider)
  │     ├─► prescriptions (encounter_id FK, medication, dosage, start_date)
  │     ├─► lab_results (encounter_id FK, test_name, value, units, result_date)
  │     └─► diagnoses (encounter_id FK, icd_code, description, diagnosed_at)
  └─► insurance_coverage (patient_id FK, plan_name, effective_start, effective_end)

Five tables, four foreign keys, and a set of implicit temporal rules that no FK constraint can encode. The database enforces that a prescription belongs to an encounter (encounter_id is a valid FK) — but nothing in the schema enforces that the prescription's start_date falls on or after the encounter's visit_date, that lab results show a plausible progression across visits, or that insurance coverage periods overlap with the encounters they're supposed to cover.

This is where healthcare test data diverges from generic approaches — and where the gap between test data for PostgreSQL and truly domain-aware generation becomes visible. FK structure is the foundation. On top of it sits a layer of domain rules that are implicit in any real patient record but absent from the schema's constraint definitions.

And this is a five-table simplification. Production telehealth platforms add providers, appointments, referrals, and billing tables — each with more FK edges and more implicit temporal rules. By the time a schema reaches 20 tables, the cross-table dependency web is dense enough that hand-rolling test data becomes its own engineering project.

Faker, Mockaroo, and similar column-level generators are good at producing realistic-looking values for individual fields. Patient names that look like names. Medication fields containing real medication names. Dates that fall within a plausible range.

Where they break is at the row level. Each prescription row gets a random medication, a random dosage, and a random date — with no awareness that this prescription is supposed to be connected to a specific encounter for a specific patient on a specific date.

Every constraint passes. FK values are valid. NOT NULL columns are filled. CHECK constraints are satisfied. But the dataset doesn't test anything meaningful, because relationships between rows don't reflect how healthcare data actually behaves.

Consider a simple test scenario: a patient with Type 2 diabetes. In a realistic dataset, you'd expect an initial encounter with a diabetes diagnosis, followed by a metformin prescription, followed by quarterly lab results tracking HbA1c levels, followed by dosage adjustments if the levels don't improve. Four tables, ten or fifteen rows, all temporally ordered and clinically linked.

Column-level generators produce rows in all four tables — but the diabetes diagnosis might appear six months after the metformin prescription. HbA1c values might be random floats between 0 and 100 rather than plausible readings between 4.0 and 14.0. Dosage adjustments might precede the lab results that would have motivated them.

Data fills the schema but doesn't exercise the application. For background on why generating test data from a schema avoids this class of problem, that guide covers the mechanics in detail.

Synthea — the most widely used open-source tool for synthetic patient data — addresses the clinical realism problem well. Patient life histories get simulated with disease progressions, medication regimens, and encounter sequences that follow clinical logic. Records come out temporally consistent and medically plausible. For research and analytics, Synthea is a strong choice.

Where it falls short is format. Synthea exports HL7 FHIR bundles, C-CDA documents, and flat CSVs — but doesn't connect to your database or know about your encounters table, your prescriptions table, or your custom insurance_coverage schema. What you get is standardized healthcare data in a standardized format, not data shaped to your application's relational model.

Public healthcare datasets have the same limitation. CMS Synthetic Public Use Files provide claims-level data in fixed-width or CSV format. MIMIC-III and MIMIC-IV offer rich clinical datasets for research — valuable for data science and ML training, but not designed to populate a development database with a custom schema.

Bridging this gap — downloading a dataset or generating Synthea output, then writing an ETL pipeline to transform it into your schema, mapping FHIR resources to your table structure, handling FK ordering, and updating the pipeline every time your schema changes — is a real engineering project. Teams that attempt it often find the pipeline needs revisiting after every migration, and the maintenance cost quietly accumulates.

When it comes to test data for healthcare applications, the actual need isn't a dataset — it's data that fits their schema. That difference between a downloadable CSV and a populated relational database with valid relationships is the same gap that separates staging with production copies from staging without them.

This is the piece that neither column-level generators nor file-based tools address for your specific schema: cross-row temporal consistency.

Realistic healthcare data isn't just valid foreign keys and plausible column values. Patient timelines have to behave like patient timelines.

Visit patterns cluster. Someone with a chronic condition visits quarterly during stable periods and weekly during acute episodes. Encounters should reflect this — not uniform random dates spread across the year, but clusters and gaps that mirror how patients actually interact with a healthcare system.

Prescription chains follow diagnoses. A statin prescription follows a lipid panel with elevated LDL. An antibiotic prescription follows an encounter with an infection diagnosis. Start dates follow encounter dates. Refills follow the previous prescription's end date. These aren't random associations — they're sequences with causal and temporal order.

Lab values track trajectories. Blood glucose readings across six months should show a pattern — improving after a medication change, gradually worsening if untreated, or stable during maintenance therapy. Random float values in the lab_results table don't test the application logic that flags deteriorating trends or triggers alerts on out-of-range values.

Insurance coverage overlaps with encounters. If a patient has a gap in coverage, their encounters during that gap should be either absent or flagged differently. Datasets where every patient has continuous coverage and visits distributed uniformly don't exercise the edge cases that billing and eligibility logic needs to handle.

Cross-row consistency is what distinguishes test data for healthcare that exercises your application's logic from data that merely fills rows.

Claims are easy to make. Here's what cross-row temporal consistency looks like in practice against a real schema.

Take the diabetes example from above: initial diagnosis → metformin prescription → quarterly HbA1c labs → dosage adjustment when levels don't improve. A schema-aware seed run with the scope "telehealth app, 200 patients, 6 months of encounter history, mixed chronic and acute cases" produced this patient timeline:

-- Patient encounters, ordered by visit date
SELECT e.id, e.visit_date, e.type
FROM encounters e
WHERE e.patient_id = 47
ORDER BY e.visit_date;
 id  | visit_date | type
-----+------------+-----------
 201 | 2024-03-12 | initial
 208 | 2024-06-18 | follow-up
 215 | 2024-09-24 | follow-up
 223 | 2024-12-10 | follow-up
-- Diagnosis and prescriptions tied to each encounter
SELECT d.diagnosed_at, d.icd_code, d.description,
       p.medication, p.dosage, p.start_date
FROM diagnoses d
JOIN prescriptions p ON p.encounter_id = d.encounter_id
WHERE d.encounter_id IN (201, 208, 215, 223)
ORDER BY d.diagnosed_at;
 diagnosed_at | icd_code | description          | medication  | dosage  | start_date
--------------+----------+----------------------+-------------+---------+------------
 2024-03-12   | E11.9    | Type 2 diabetes      | Metformin   | 500mg   | 2024-03-12
 2024-09-24   | E11.9    | Type 2 diabetes      | Metformin   | 1000mg  | 2024-09-24

Diagnosis date and prescription start_date match the encounter visit_date — not a coincidence, not a lucky seed, the temporal ordering is enforced across rows. The dosage adjustment on the third visit appears only after two prior encounters, the way a real clinical workflow would produce it.

-- HbA1c trajectory across the four visits
SELECT lr.result_date, lr.value, lr.units
FROM lab_results lr
WHERE lr.encounter_id IN (201, 208, 215, 223)
  AND lr.test_name = 'HbA1c'
ORDER BY lr.result_date;
 result_date | value | units
-------------+-------+-------
 2024-03-12  |  8.7  | %
 2024-06-18  |  8.2  | %
 2024-09-24  |  7.8  | %
 2024-12-10  |  7.1  | %

Values descend across visits — consistent with a patient responding to metformin therapy, not four random floats sampled independently. The application logic that checks for improving HbA1c trends, or that flags a patient who isn't improving after two quarters, now has data that will actually exercise those code paths.

This is what column-level generators can't produce and Synthea doesn't target: data shaped to your specific schema, with temporal relationships intact.

Seedfast reads your PostgreSQL schema and produces records with these relational properties, removing the need for hand-rolled orchestration or production data access. No production data was accessed — though note Seedfast does send your schema definition (table and column names, types, constraints) to an AI provider to generate the data; for what that means in a regulated shop, see the HIPAA guide. No FHIR-to-SQL pipeline was written. No seed script needs updating after the next migration. Seedfast reads the current schema on every run, so when the team adds a referrals table next sprint, the next seed picks it up automatically.

If your team is also under HIPAA, the compliance angle is here. The realism developers need carries over to test data management across local, CI, and staging environments.

Try it on your own schema — get started here.

Healthcare test data needs three things a generic generator won't produce: foreign-key chains that resolve in dependency order (patients → encounters → prescriptions → lab_results), temporal ordering so a prescription start_date never predates its encounter.visit_date, and cross-row trajectories like an HbA1c series that descends across visits as a patient responds to therapy. Generic tools fill columns; healthcare needs the rows to behave like a patient timeline.

Generic tools like Faker and Mockaroo generate plausible column values but ignore cross-row temporal and FK dependencies — a prescription can land before its encounter, lab values are random floats, and free-text columns get lorem ipsum. Healthcare-aware approaches enforce the patients → encounters → prescriptions chain, keep dates and FK references coherent, and use reserved ranges (TEST-NET IPs, 555-01XX phones, example.test domain) so generated data is never confusable with a real person.

Synthea generates clinically plausible patient histories in HL7 FHIR, C-CDA, or flat CSV — formats designed for research, not for a custom application schema. To use Synthea output in your dev database, you write an ETL pipeline that maps FHIR resources onto your encounters, prescriptions, and lab_results tables, handles FK ordering, and re-runs after every schema migration. The clinical realism is excellent; the schema-fit work is the part that becomes its own engineering project.

A telehealth or clinic-management schema typically has five core tables: patients, encounters (with patient_id, visit_date, type, provider_id), prescriptions (with encounter_id, medication, dosage, start_date), lab_results (with encounter_id, test_name, value, units, result_date), diagnoses (with encounter_id, icd_code), and insurance_coverage per patient. Production systems add providers, appointments, referrals, and billing — the FK graph quickly reaches 20+ tables.