All posts

Test Data Generation: 7 Ways to Fill Your Database

By the Seedfast team ·

Test data generation for a relational database is harder than it looks — foreign keys, constraints, and cross-table dependencies mean you can't just randomize columns. Here are seven ways to fill your database with test data, from a hand-written INSERT to an AI that reads your schema and handles all of it automatically.

Key Takeaways

  • Online generators (Mockaroo and similar) work for flat CSV/JSON but can't resolve foreign keys across tables
  • Faker libraries give you code-level control but you wire up table dependencies by hand — and rewire them after every migration
  • Production copies (pg_dump) give perfect realism but put PII in every environment that gets a copy
  • Schema-aware generators connect to your database, read the schema, and handle constraints and foreign keys automatically

Which method handles what?

MethodForeign keysMaintenanceNo production dataSetup
Handwritten SQLYou wire itHigh — update after every migrationYesLow
Online generators (Mockaroo)No — manual coordination neededReconfigure columns on schema changesYesMedium — configure columns manually
Faker / factoriesPartial — ORM handles direct associations, you wire the restHigh — update factories after migrationsYesMedium
pg_dump + restoreYes — inherited from productionRe-dump periodicallyNoLow
Data maskingYes — inherited from productionHigh — maintain field catalog + re-maskPartial — miss a field and PII leaksHigh
Property-basedNo — function inputs onlyLow — tests are schema-independentYesMedium
Schema-aware (Seedfast)Auto — reads FK graphNone — reads live schema each runYes — synthetic from scratchLow

For a deeper comparison of synthetic data vs fixtures and pg_dump, see Synthetic Test Data.

1. Handwritten seed files

The classic. A seed.sql checked into your repo with INSERT statements for every table.

INSERT INTO teams (id, name) VALUES (1, 'Engineering');
INSERT INTO users (id, email, team_id) VALUES
  (1, 'alice@example.com', 1),
  (2, 'bob@example.com', 1);
INSERT INTO projects (id, name, owner_id) VALUES (1, 'Backend API', 1);

Straightforward. You know exactly what data exists. The problem starts when a migration adds NOT NULL columns, renames a table, or tightens a constraint — and someone has to fix the seed file by hand. It's the simplest form of test data generation, but the maintenance cost compounds fast as your schema grows.

Best for: small schemas (under 10 tables) that rarely change.

2. Online generators

Online generators like Mockaroo offer browser-based test data generation — define columns in a web UI and download CSV, JSON, or SQL.

They're the fastest way to get a flat file of realistic-looking data. The catch: they generate one table at a time. There's no way to say "generate orders that reference valid user IDs from the users table I just created." If your schema has foreign keys — and it does — you're stitching files together manually and hoping the IDs line up.

For anything with foreign keys, you'll need a tool that understands the dependency graph — see schema-aware generation below.

Best for: quick prototyping, populating a single table, or generating test CSVs that don't touch a relational database.

3. Faker and factory libraries

The developer's go-to for test data generation in code. Define factories, generate data programmatically.

Python (Faker):

from faker import Faker
fake = Faker()

user = {
    "name": fake.name(),           # "Jennifer Brown"
    "email": fake.email(),         # "jbrown@example.net"
    "address": fake.address(),     # "123 Main St, Apt 4"
    "created_at": fake.date_time_this_year()
}

JavaScript (Faker.js):

import { faker } from '@faker-js/faker';

const product = {
  name: faker.commerce.productName(),  // "Handcrafted Granite Chips"
  price: faker.commerce.price(),       // "29.99"
  description: faker.lorem.sentence()  // lorem ipsum...
};

Ruby (Factory Bot):

FactoryBot.define do
  factory :user do
    name { Faker::Name.name }
    email { Faker::Internet.email }
    association :team
  end
end

Faker gives you fine-grained control, and factory libraries like Factory Bot handle direct associations (like :team above). The tradeoff: complex multi-level dependency chains are still defined in code, and when the schema changes, the factory definitions break just like seed files. The output also tends to be nonsensical — "Handcrafted Granite Chips" doesn't help you debug a failing checkout test.

If you'd rather not maintain factory definitions that break on every migration, schema-aware generation reads your live schema and handles the full dependency chain automatically.

Best for: unit tests and simple integration tests where you control a small slice of the schema.

4. Production database copies

# Dump production
pg_dump -h prod-host -d myapp > prod_dump.sql

# Restore to staging
psql -h staging-host -d myapp_staging < prod_dump.sql

The most realistic data you can get, because it is real data. Query plans, index usage, data distributions — everything matches production exactly.

You've also just put PII in every environment that has a copy. Names, emails, phone numbers, payment history — all accessible to every developer and CI pipeline with database access. Dumping and restoring a multi-gigabyte database is also slow, and you can't easily parameterize the dataset for specific test scenarios.

If you need production-level schema coverage without the PII, synthetic generation gives you the structure without the liability.

Best for: one-off performance investigations where exact production distributions matter. Not for routine test data generation.

5. Data masking and anonymization

A production copy with sensitive fields replaced:

UPDATE users SET
  email = 'user_' || id || '@masked.test',
  name = 'User ' || id,
  phone = '555-0' || lpad(id::text, 3, '0');

Preserves real distributions and relationships. Masking is only as good as your field catalog, though. Miss one column — a name embedded in a notes JSON field, an address in a metadata blob — and unmasked PII reaches non-production environments. The source is still production, so every refresh restarts the compliance clock.

The alternative is generating data that was never real in the first place — no fields to miss, no compliance clock to restart. That's the schema-aware approach.

Best for: regulated environments with strict distribution requirements and a dedicated data governance team.

6. Property-based generators

Tools like Hypothesis (Python), fast-check (JS), or QuickCheck (Haskell) generate inputs satisfying declared properties:

from hypothesis import given
from hypothesis.strategies import text, integers

@given(name=text(min_size=1), age=integers(min_value=0, max_value=150))
def test_user_creation(name, age):
    user = create_user(name=name, age=age)
    assert user.name == name

Excellent for finding edge cases in pure functions. Scoped to individual function inputs, though — not database-level datasets. Getting a property-based generator to produce a valid, connected dataset across 50 tables means reimplementing schema awareness from scratch.

Best for: unit tests and algorithmic edge cases. Not designed for database test data generation.

7. Schema-aware generation

This approach to test data generation flips the script: instead of generating data and hoping it fits the schema, you read the schema and generate data that fits by definition.

Seedfast connects to your database, inspects every table, column, constraint, and foreign key, resolves the full dependency graph (including circular references), and produces inserts in topological order. When the schema changes, the next run picks it up — no file to update.

# Connect once
seedfast connect

# Generate data from the current schema
seedfast seed --scope "e-commerce store with electronics products"

Seedfast reads your PostgreSQL schema and uses AI to produce contextually realistic data. The --scope flag tells it what domain to generate for — "e-commerce with electronics" produces products, categories, prices, and reviews that all belong to the same business context, not random values scattered across unrelated tables.

The --scope flag accepts natural language, so you control the scenario without writing factory code:

# Minimal dataset for integration tests
seedfast seed --scope "3 users with 2 orders each"

# Large dataset for load testing
seedfast seed --scope "e-commerce store with 10000 orders across 500 users"

# Targeted for a specific flow
seedfast seed --scope "seed only checkout: carts, cart_items, products, users"

No seed file. No factory definitions. No maintenance when the schema changes.

Seedfast reads the live PostgreSQL catalog directly, so it works regardless of whether you use Prisma, Drizzle, TypeORM, or raw SQL migrations. Your ORM doesn't matter — the database schema is the source of truth.

Strengths: no seed files to maintain, handles arbitrary schema complexity including circular FK dependencies, produces connected and realistic data, fully synthetic — no production data involved. Only schema structure is read; generated data is inserted locally. See the data handling docs for details.

Best for: integration tests, E2E tests, CI/CD pipelines, staging, load testing, migration testing.

Seed your first database in under 5 minutes →

Which method should you use?

It depends on what you're testing:

If you need...Use
Quick CSV/JSON for a prototypeOnline generator (Mockaroo)
Data for a unit testFaker + your language's factory library
A full database for integration/E2E testsSeedfast (schema-aware generation)
Realistic volumes for load testingSeedfast or production copy
A staging environment without PIISeedfast (synthetic generation)
To validate a migration against real-sized dataSeedfast with a large --scope

The pattern: as schemas get bigger and testing gets more serious, you need a database test data generator that understands foreign keys, survives migrations, and doesn't require a seed file that someone has to maintain. Online generators and Faker work great at the bottom of the testing pyramid. They break down when you need a connected, multi-table dataset at the top.

If your team maintains separate fixture systems for unit tests, integration tests, E2E, and staging — that's four test data generation pipelines for the same schema. Small datasets also hide entire categories of bugs that only surface at realistic volumes. A single generator that adapts by scope is simpler.

Test data generation in CI/CD

In a pipeline, your test data generation approach needs to be automated, fast, and isolated — no shared state between parallel runs, no manual steps, no fixtures drifting from the schema.

# GitHub Actions example
- name: Seed test database
  run: seedfast seed --scope "minimal checkout flow"

If your CI data setup is more than two lines, that's maintenance surface. Every extra step is a potential red build that has nothing to do with your application code.

For microservice architectures where multiple databases reference each other, schema-aware generators can seed them in dependency order — keeping cross-service IDs consistent without coordinating seed scripts across repos.

See the full CI/CD setup guide →

Test data and compliance

Production copies in non-production environments create compliance exposure under GDPR, HIPAA, and SOC 2. Even masked data carries risk — miss one column and PII leaks into your staging environment.

Seedfast generates fully synthetic data — no real records are involved, so there's nothing to mask or trace back to a real person. A staging environment filled with synthetic data avoids the PII exposure that triggers most compliance obligations. See the data handling documentation for details on what leaves your machine.

Frequently asked questions

How do you generate test data for a database?

Connect a generator to your database and let it read the schema. Seedfast inspects tables, columns, foreign keys, and constraints, then produces INSERT statements in topological order — no seed files to write or maintain. For a detailed comparison of approaches, see Synthetic Test Data.

What is test data generation?

Test data generation is any process that creates data for software testing. It ranges from hand-written INSERT statements to AI tools that read your database schema and produce thousands of valid, connected rows automatically. The goal: realistic data that surfaces real bugs without using production records.

What's the best test data generation tool?

It depends on the scope. For flat files, any online generator will do. For code-level factories: Faker.js, Python Faker, Factory Bot. For full database seeding with FK resolution, constraint handling, and migration compatibility: Seedfast reads your PostgreSQL schema and generates everything in one command.

Try it on your schema

Seedfast handles test data generation for your PostgreSQL schema — connect once, and it generates a valid, connected dataset in one command. No seed files, no factories, no maintenance.

Seed your first database in under 5 minutes → Installs via Homebrew or npm. Free trial, no credit card required.