All posts

Test Data Generation: 7 Ways to Fill Your Database

By Mikhail Shytsko, Founder at Seedfast · · Updated

Test data generation is the practice of creating data — synthetic, sampled, or hand-written — to populate a database for development, testing, or staging environments.

Test data generation gets hard twice. First, if your team is covered by HIPAA, PCI-DSS, GDPR, or SOC 2, you can't just clone production — a masked copy is still a copy, and every environment that holds one extends your compliance perimeter. Second, even when production data is on the table, foreign keys, constraints, and cross-table dependencies mean you can't just randomize columns and hope they line up. This guide compares seven ways to fill your database with test data, from a hand-written INSERT to a schema-aware generator that reads your schema on every run, and shows which ones survive real engineering workflows. If you specifically need PostgreSQL test data with generate_series and psql-flavored examples, see PostgreSQL Test Data: What Actually Works — this page is the cross-stack methods comparison.

TL;DR: Test data generation is creating data to populate a database for development, testing, or staging. The seven main methods are handwritten SQL, online generators, Faker libraries, production copies, data masking, property-based testing, and schema-aware generation. Choose by compliance posture and schema complexity — regulated teams with 30+ tables need different tooling than small prototypes.

  • Online generators (Mockaroo and similar) work for flat CSV/JSON but can't resolve foreign keys across tables
  • Faker libraries give you code-level control, but you wire up table dependencies by hand — and rewire them after every migration
  • Production copies (pg_dump) give perfect realism, but put PII in every environment that gets a copy — an immediate problem for HIPAA, PCI-DSS, GDPR, and SOC 2 scope
  • Schema-aware generators connect to your database, read the schema, and handle constraints and foreign keys automatically — without needing production access at all
  • The right method depends on compliance posture and schema complexity, not team size: regulated teams with 30+ tables need a different approach than a two-person prototype
MethodForeign keysMaintenanceNo production dataSetup
Handwritten SQLYou wire itHigh — update after every migrationYesLow
Online generators (Mockaroo)No — manual coordination neededReconfigure columns on schema changesYesMedium — configure columns manually
Faker / factoriesPartial — ORM handles direct associations, you wire the restHigh — update factories after migrationsYesMedium
pg_dump + restoreYes — inherited from productionRe-dump periodicallyNoLow
Data maskingYes — inherited from productionHigh — maintain field catalog + re-maskPartial — miss a field and PII leaksHigh
Property-basedNo — function inputs onlyLow — tests are schema-independentYesMedium
Schema-aware (Seedfast)Auto — reads FK graphNone — reads live schema each runYes — synthetic from scratchLow

For a deeper comparison of synthetic data vs fixtures and pg_dump, see Synthetic Test Data. For the full landscape of tooling categories — DIY scripts, web generators, enterprise anonymization, and schema-aware generators — see Data Seeding Tools.

A seed.sql file checked into your repo with INSERT statements for every table — the classic starting point.

INSERT INTO teams (id, name) VALUES (1, 'Engineering');
INSERT INTO users (id, email, team_id) VALUES
  (1, 'alice@example.com', 1),
  (2, 'bob@example.com', 1);
INSERT INTO projects (id, name, owner_id) VALUES (1, 'Backend API', 1);

Straightforward. You know exactly what data exists. The problem starts when a migration adds NOT NULL columns, renames a table, or tightens a constraint — and someone has to fix the seed file by hand. It's the simplest form of test data generation, but the maintenance cost compounds fast as your schema grows.

Best for: small schemas (under 10 tables) that rarely change.

Online generators like Mockaroo offer browser-based test data generation — define columns in a web UI and download CSV, JSON, or SQL.

They're the fastest way to get a flat file of realistic-looking data. The catch: they generate one table at a time. There's no way to say "generate orders that reference valid user IDs from the users table I just created." If your schema has foreign keys — and it does — you're stitching files together manually and hoping the IDs line up.

For anything with foreign keys, you'll need a tool that understands the dependency graph — see schema-aware generation below.

Best for: quick prototyping, populating a single table, or generating test CSVs that don't touch a relational database.

Faker and its cousins are the developer's go-to for test data generation in code — define factories, then generate data programmatically.

Python (Faker):

from faker import Faker
fake = Faker()

user = {
    "name": fake.name(),           # "Jennifer Brown"
    "email": fake.email(),         # "jbrown@example.net"
    "address": fake.address(),     # "123 Main St, Apt 4"
    "created_at": fake.date_time_this_year()
}

JavaScript (Faker.js):

import { faker } from "@faker-js/faker";

const product = {
  name: faker.commerce.productName(), // "Handcrafted Granite Chips"
  price: faker.commerce.price(), // "29.99"
  description: faker.lorem.sentence(), // lorem ipsum...
};

Ruby (Factory Bot):

FactoryBot.define do
  factory :user do
    name { Faker::Name.name }
    email { Faker::Internet.email }
    association :team
  end
end

Faker gives you fine-grained control, and factory libraries like Factory Bot handle direct associations (like :team above). The tradeoff: complex multi-level dependency chains are still defined in code, and when the schema changes, the factory definitions break just like seed files. The output also tends to be nonsensical — "Handcrafted Granite Chips" doesn't help you debug a failing checkout test, and "Billion mouth support feeling prove town cold" in a product_description column is noise that hides real bugs instead of exposing them.

If you'd rather not maintain factory definitions that break on every migration, schema-aware generation reads your live schema and handles the full dependency chain automatically.

Best for: unit tests and simple integration tests where you control a small slice of the schema.

# Dump production
pg_dump -h prod-host -d myapp > prod_dump.sql

# Restore to staging
psql -h staging-host -d myapp_staging < prod_dump.sql

You get the most realistic data possible — because it is real data. Query plans, index usage, data distributions — everything matches production exactly.

You've also just put PII in every environment that has a copy. Names, emails, phone numbers, payment history — all accessible to every developer and CI pipeline with database access. Under HIPAA, GDPR, SOC 2, or PCI-DSS, that sprawl of production data pulls every copy into your compliance scope: each environment now needs the same access controls, audit logs, retention rules, and breach-notification posture as production. Dumping and restoring a multi-gigabyte database is also slow, and you can't easily parameterize the dataset for specific test scenarios.

If you need production-level schema coverage without the PII, synthetic generation gives you the structure without the liability.

Best for: one-off performance investigations where exact production distributions matter. Not for routine test data generation.

A production copy with sensitive fields replaced:

UPDATE users SET
  email = 'user_' || id || '@masked.test',
  name = 'User ' || id,
  phone = '555-0' || lpad(id::text, 3, '0');

This preserves real distributions and relationships, but masking is only as good as your field catalog. Miss one column — a name embedded in a notes JSON field, an address in a metadata blob, a phone number in a free-text customer_support_log — and unmasked PII reaches non-production environments. The source is still production, which means every refresh restarts the compliance clock and re-runs the breach-risk argument. Under GDPR Article 17, a masked staging snapshot still has to honor right-to-be-forgotten requests; under PCI-DSS, any environment containing cardholder data remains inside PCI scope whether the card numbers are masked or not.

The alternative is generating data that was never real in the first place — no fields to miss, no compliance clock to restart. That's the schema-aware approach.

Best for: regulated environments with strict distribution requirements and a dedicated data governance team, where an enterprise anonymization pipeline is already paid for.

Tools like Hypothesis (Python), fast-check (JS), or QuickCheck (Haskell) generate inputs satisfying declared properties:

from hypothesis import given
from hypothesis.strategies import text, integers

@given(name=text(min_size=1), age=integers(min_value=0, max_value=150))
def test_user_creation(name, age):
    user = create_user(name=name, age=age)
    assert user.name == name

Excellent for finding edge cases in pure functions. Scoped to individual function inputs, though — not database-level datasets. Getting a property-based generator to produce a valid, connected dataset across 50 tables means reimplementing schema awareness from scratch.

Best for: unit tests and algorithmic edge cases. Not designed for database test data generation.

The schema drives everything — tables, columns, and constraints define what can be inserted. Schema-aware generators use language models only to fill realistic values inside those rails, so a product_description gets an actual product description instead of lorem ipsum. That's the opposite of prompting a chatbot directly for seed SQL, which tends to break down once the foreign-key graph gets deep and the model runs out of working context to track every constraint at once.

For a healthcare or fintech schema with 30+ tables, the practical AI pattern is: let the model plan the data (semantic intent, distributions, scenario shape) while the tool enforces the structural rails (FK ordering, types, constraints). That's the split the schema-aware section below explains in detail.

This approach to test data generation flips the script: instead of generating data and hoping it fits the schema, the generator reads the schema and produces data that fits by definition.

Seedfast connects to your database, inspects every table, column, constraint, and foreign key, resolves the full dependency graph (including circular references), and produces inserts in topological order. When the schema changes, the next run picks it up — no seed file to update.

# Connect once
$ seedfast connect

# Generate data from the current schema (example session — your numbers will differ)
$ seedfast seed --scope "fintech app with 100 accounts, transactions, and varied balances"
  → Connected to PostgreSQL
  → Schema inspected: tables, foreign keys, and circular dependencies resolved
  → Generating data...
  → Done.

One command. No production database on the other end of a connection string, no anonymization pipeline, no seed script to maintain. Seedfast reads your PostgreSQL schema and generates contextually realistic data — a product in the "Electronics" category gets an electronics description, not a kitchen blender, and a user with country = "Japan" gets a Japanese-format phone number and a culturally plausible name.

The --scope flag accepts natural language, so you control the scenario without writing factory code:

# Minimal dataset for integration tests
seedfast seed --scope "3 users with 2 orders each"

# Large dataset for load testing
seedfast seed --scope "e-commerce store with 10000 orders across 500 users"

# Targeted for a specific flow
seedfast seed --scope "seed only checkout: carts, cart_items, products, users"

# Regulated context — no production data involved
seedfast seed --scope "healthcare portal with 50 synthetic patients, appointments, and prescriptions"

No seed file. No factory definitions. No maintenance when the schema changes. And because Seedfast reads the live PostgreSQL catalog directly, it works regardless of whether you use Prisma, Drizzle, TypeORM, or raw SQL migrations — the database schema is the source of truth, not the ORM.

Strengths: no seed files to maintain, handles arbitrary schema complexity including circular FK dependencies, produces connected and realistic data, and generates everything from scratch — no production data is involved, so there's no PII pipeline to secure, document, or justify to auditors. Only schema structure is read; generated data is inserted into the database you point Seedfast at. See the data handling docs for details on what leaves your machine.

Best for: integration tests, E2E tests, CI/CD pipelines, staging, load testing, migration testing, and any team in fintech, healthcare, or another regulated industry that simply can't connect a tool to production.

Seed your first database in under 5 minutes →

It depends on what you're testing — and on whether production data is even an option:

If you need...Use
Quick CSV/JSON for a prototypeOnline generator (Mockaroo)
Data for a unit testFaker + your language's factory library
A full database for integration/E2E testsSeedfast (schema-aware generation)
Realistic volumes for load testingSeedfast or production copy
A staging environment without PIISeedfast (synthetic generation)
To validate a migration against real-sized dataSeedfast with a large --scope
Data for a fintech or healthcare app that can't touch productionSeedfast (schema-aware generation)
A fundamentals primer on how seeding actually worksThe database seeding guide
A side-by-side comparison of seeder tools and per-ORM syntaxDatabase seeder tools compared

The pattern: as schemas get bigger and testing gets more serious, you need a database test data generator that understands foreign keys, survives migrations, and doesn't require a seed file that someone has to maintain. Online generators and Faker work great at the bottom of the testing pyramid. They break down when you need a connected, multi-table dataset at the top.

If your team maintains separate fixture systems for unit tests, integration tests, E2E, and staging — that's four test data generation pipelines for the same schema. Small datasets also hide entire categories of bugs that only surface at realistic volumes. A single generator that adapts by scope is simpler.

In a pipeline, your seeding approach needs to be automated, fast, and isolated — no shared state between parallel runs, no manual steps, no fixtures drifting from the schema.

# GitHub Actions example
- name: Seed test database
  run: seedfast seed --scope "minimal checkout flow"

If your CI data setup is more than two lines, that's maintenance surface. Every extra step is a potential red build that has nothing to do with your application code.

For microservice architectures where multiple databases reference each other, schema-aware generators can seed them in dependency order — keeping cross-service IDs consistent without coordinating seed scripts across repos.

See the full CI/CD setup guide →

Production copies in non-production environments create compliance exposure under GDPR, HIPAA, SOC 2, and PCI-DSS. Each framework treats the data the same way regardless of which environment it lives in: if developer laptops, staging servers, or CI runners hold real PII, they inherit the same access controls, audit trails, retention rules, and breach-notification obligations as production.

Data masking narrows that scope but doesn't eliminate it. The masking pipeline still processes production records, the output still carries residual re-identification risk, and every missed column — a name in a JSON blob, a phone number in a support log — reopens the audit question. Schema-aware generation avoids the question entirely: no real records ever leave production, because no real records were involved in the first place. A staging environment filled with synthetic data sits outside the compliance perimeter by construction. See the data handling documentation for the details of what leaves your machine and what doesn't.

Connect a generator to your database and let it read the schema. Seedfast inspects tables, columns, foreign keys, and constraints, then produces INSERT statements in topological order — no seed files to write or maintain. For a detailed comparison of approaches, see Synthetic Test Data.

Test data generation is any process that creates data for software testing. It ranges from hand-written INSERT statements to schema-aware tools that read your database and produce thousands of valid, connected rows automatically. The goal: realistic data that surfaces real bugs without using production records.

It depends on the scope. For flat files, any online generator will do. For code-level factories: Faker.js, Python Faker, Factory Bot. For full database seeding with FK resolution, constraint handling, and migration compatibility, Seedfast reads your PostgreSQL schema and generates everything in one command.

Use a tool that reads the schema, not one that generates columns in isolation. Foreign keys define an ordering — child rows can only be inserted after their parents exist, and tables with circular references need to be inserted in a single transaction with deferred constraints. Faker and Mockaroo leave this ordering to you. Schema-aware generators build the dependency graph automatically and insert rows in topological order, including the harder case of two tables that reference each other.

Enough that realistic queries hit realistic code paths. A handful of rows is fine for unit tests and fast feedback loops, but integration and end-to-end suites need volumes close to what production looks like — otherwise you miss the class of bugs that only appear at scale: TOAST decompression, stale query planner statistics, OFFSET pagination cliffs. With a schema-aware generator you can run the same scope description at 100, 10,000, and 1,000,000 rows just by changing the count in natural language.

Seedfast handles test data generation for your PostgreSQL schema — connect once, and it generates a valid, connected dataset in one command. No seed files, no factories, no PII pipeline to secure, and no production database to connect to.

Connect Seedfast to your schema → Installs via Homebrew or npm. Free trial, no credit card required.