All posts

Data Seeding Tools: How Development Teams Actually Get Test Data

By the Seedfast team ·

Every development team needs a populated database. For most teams, the answer is a seed script or a Faker-based fixture — good enough until the schema grows, migrations arrive weekly, and the seeder becomes dead weight nobody wants to own.

For teams in regulated industries, the answer is harder. HIPAA, PCI-DSS, GDPR, and SOC 2 restrict how production data can be used outside production environments. Copying the prod database to a developer's laptop or to a staging server isn't just a bad practice in these environments — it's a compliance violation. These teams need data that was never real to begin with.

This guide covers the full spectrum of data seeding tools: what each approach is, where it holds up, and where it breaks. The decision tree at the end will help you pick the right one for your situation.

Key Takeaways#

  • The choice isn't really between seeding tools — it's between four different approaches: DIY scripts, web generators, enterprise anonymization, and schema-aware generators
  • Teams in regulated industries (fintech, healthcare, insurance, etc.) can't use production data at all — anonymization tools and schema-aware generators are the only viable paths
  • Enterprise anonymization tools (Tonic Structural, Delphix, K2View) require custom enterprise contracts and production database access — schema-aware generators fill the gap for teams that don't need or want that overhead
  • Schema-aware generators that read the live schema on every run eliminate the maintenance cost that makes seed scripts fragile: no scripts to update after migrations, no FK wiring to maintain by hand

The real problem with test data#

The hard part of test data isn't generating fake names and emails. Any library does that. The hard parts are:

Relational integrity. Inserting an order_item requires an existing order, product, and price. A user might need a team, a role, and a subscription. Most data seeding tools ignore foreign keys entirely — they produce rows that fail on insert or create orphaned records that make your app behave unpredictably.

Schema evolution. Schemas change constantly. A new NOT NULL column, an added foreign key, a renamed table — any of these breaks a seed script written against last week's schema. The maintenance cost compounds with every migration.

Realistic distributions. Uniform random data looks obviously fake. Real databases have skewed distributions — a few customers make most purchases, transaction amounts cluster around common price points, activity varies by time of day. Dashboards, reports, and business logic behave differently against realistic data than against uniform noise.

Compliance constraints. Teams in fintech, healthcare, and other regulated industries can't use production data in development. For these teams, "just copy prod" isn't an option — they need data that was never real to begin with.

How teams actually solve this (the full spectrum)#

Data seeding tools are only one category in a broader landscape. Here's how it actually breaks down:

  • Faker / faker.js — No schema awareness, manual FK wiring, high maintenance, free
  • ORM seeders (Prisma, Laravel, EF Core) — Manual via ORM API, medium-high maintenance, free
  • Web generators (Mockaroo) — No FK handling, medium maintenance, $0–$7500/yr
  • Copy + anonymize production (Tonic Structural, Delphix) — Full FK preservation, realistic data, weeks of setup, enterprise/custom pricing
  • Schema-aware generators (Seedfast, Tonic Fabricate, Snaplet Seed) — Automatic FK handling, no production access required

Let's walk through each.

Faker libraries#

Faker (available in JavaScript, Python, PHP, Ruby, and Java) generates random values — names, emails, addresses, dates. You call faker.person.fullName() and get a plausible-looking string.

import { faker } from '@faker-js/faker';

const users = Array.from({ length: 50 }, () => ({
  name: faker.person.fullName(),
  email: faker.internet.email(),
  createdAt: faker.date.past(),
}));

Faker is the most common starting point. It's fast, zero-config, and works with any database. The limitation is fundamental: Faker generates values for columns you specify. It doesn't know your schema, doesn't resolve foreign keys, and doesn't insert anything. You write the insert logic, wire up table dependencies by hand, and update that code when the schema changes.

Where Faker works: Flat tables with few dependencies. Quick prototypes. Unit tests that need a handful of records.

Where Faker breaks down: Schemas with 10+ related tables. The FK wiring becomes its own codebase. Every migration that touches a seeded table requires a script update. At 20+ tables with weekly migrations, maintaining the Faker-based seeder becomes a recurring sprint tax. For teams in regulated environments, Faker solves the wrong problem entirely — you need not just random data, but data your compliance team has approved.

ORM built-in seeders#

Every major ORM ships with a seeding mechanism. Prisma uses seed.ts with prisma.$transaction. Laravel has DatabaseSeeder.php with factories. EF Core offers HasData. Rails has db/seeds.rb.

// Prisma seed.ts
const team = await prisma.team.create({ data: { name: 'Engineering' } });
await prisma.user.create({
  data: { email: 'alice@example.com', teamId: team.id },
});

ORM seeders are a step up — the ORM handles insert logic, and relationships can be defined through its API. But the core constraint is the same: you define the data manually. A new NOT NULL column breaks the seeder until someone updates the script.

Where ORM seeders work: Stable schemas with small teams. Deterministic reference data (specific roles, feature flags, config) that must match exactly across environments.

Where ORM seeders break down: Complex, evolving schemas where the seeder becomes maintenance overhead. The ORM adds convenience but doesn't eliminate the fundamental problem of manually specifying data that drifts from the schema.

Web-based generators#

Mockaroo, GenerateData, and similar tools let you define columns through a web form and export synthetic data as CSV, JSON, or SQL.

These are fast for one-off exports and useful when you need data without writing code. The ceiling: they don't connect to your database, don't understand foreign keys, and produce flat rows. You export, then manually import in dependency order after resolving relationships yourself.

Where web generators work: Simple tables. Mock API responses. Quick datasets for demos or presentations.

Where web generators break down: Any relational database with foreign keys. The export-then-import workflow doesn't scale past a few tables.

Production data anonymization#

At the opposite end of the spectrum, enterprise anonymization tools — Tonic Structural, Delphix, K2View, Informatica — connect to your production database, mask or replace sensitive fields, and produce a de-identified copy.

This approach has a clear advantage: the resulting data inherits the real schema structure, real distributions, real edge cases, and real volumes of your production system. Business logic that depends on specific data patterns works correctly because the data came from a system where that logic was already running.

The constraints are equally clear:

  • It requires production access. Someone has to approve a connection from the anonymization tool to your production database. In many organizations, this requires security review, VPN configuration, and ongoing credential management.
  • It's enterprise-priced. These tools are sold through sales cycles with custom contracts — there are no self-serve plans or published prices. The purchasing process alone can take longer than the original problem.
  • Setup takes weeks. Configuring masking rules per column, validating that anonymized data preserves referential integrity, setting up refresh schedules — this is a project, not a feature toggle.
  • It still processes production data. Even though the output is anonymized, the pipeline processes real PII. For some compliance frameworks, the existence of that pipeline is itself a risk that needs to be managed.

Where anonymization works: Large enterprises with existing production databases, dedicated data teams, and regulatory requirements that specifically demand prod-derived test data.

Where anonymization is overkill: Teams that need a working database for local development, CI/CD, and demos without the overhead of a production pipeline. Companies that don't have production data yet. Teams where the security process to approve a prod connection takes longer than the project itself.

Schema-aware generators#

A newer category occupies the gap between free DIY tools and enterprise anonymization platforms: tools that connect to your database, read the schema directly, and generate data that satisfies all of it automatically — without ever touching production data.

The key distinction from anonymization tools: they don't need a production database to connect to. They work from the schema alone. For teams in regulated industries — fintech, healthcare, insurance — this is the critical difference. There's no PII in the pipeline because there's no real data in the pipeline.

Snaplet Seed (@snaplet/seed) is an open-source TypeScript library in this space. It introspects your PostgreSQL schema and generates a type-safe seed client. You write seed plans that describe the shape of the data:

import { createSeedClient } from "@snaplet/seed";
const seed = await createSeedClient();

await seed.users((x) =>
  x(3, () => ({
    orders: (x) => x(2, () => ({
      order_items: (x) => x(3),
    })),
  }))
);

Snaplet Seed handles FK resolution — you declare nested relationships and it inserts rows in dependency order. The values it generates are deterministic (via the copycat library) but not domain-realistic: expect placeholder-style text for names and emails rather than values that look like real business data. Note that Snaplet the company shut down in 2024; the library is now maintained as open source under the Supabase community.

Tonic Fabricate is Tonic.ai's synthetic data generation product. Like Seedfast, it generates data from scratch without production access. The key differences are in the workflow: Fabricate is a web UI where you describe or paste your schema manually. When your schema changes, you update the definition yourself — it doesn't re-read a live database. Data is generated on Tonic's servers and exported for you to import, which means database triggers, generated columns, and functions that fire on insert won't run during generation. CI/CD integration is possible through their API and a Python SDK, but the schema must be pre-configured in Fabricate — you can't point it at a live database in your pipeline and have it pick up the current schema automatically.

Seedfast takes a different approach. Instead of writing seed plans in code, you describe the business scenario you need:

# Local development
seedfast seed --scope "fintech app with 100 accounts, transactions, and varied balances"

# CI/CD pipeline (see the full CI/CD guide: https://seedfa.st/docs/cicd-database-seeding)
seedfast seed --scope "2 users with completed orders and one pending"

# Load testing (see: https://seedfa.st/blog/load-testing-data)
seedfast seed --scope "realistic store with 500 products, reviews, and varied order history"

Seedfast reads your live schema on every run — no client generation step, no sync command, no seed plans to update after migrations. It generates realistic, relational data entirely from the schema: no production access required, no PII pipeline to manage, no security review to request. For teams that need populated databases without ever connecting to production, this is the third option that didn't exist before.

Which approach fits your situation#

You're in a regulated industry and can't use production data. This is the scenario that free tools can't solve and enterprise anonymization platforms are overkill for. Schema-aware generators — Tonic Fabricate and Seedfast — both generate from scratch without production access. Fabricate works through a web UI where you describe or paste your schema — when it changes, you update the definition manually. Seedfast connects to your live database and reads the current schema on every run — one CLI command, no manual sync, adapts to migrations automatically.

You have a complex schema and no production data yet. You can't anonymize what doesn't exist. Faker covers simple prototyping. For anything with relational complexity, a schema-aware generator gets you a working database from day one.

You have a simple schema (under 10 tables, few FKs) and it rarely changes. An ORM seeder or Faker can work here — though even simple schemas tend to grow, and the switch to a schema-aware generator gets harder the longer you wait.

You need deterministic reference data (roles, feature flags, config). Use a version-controlled SQL file or ORM seeder for the fixed data — roles, flags, and config should be exact and reproducible. Then use Seedfast to fill the rest of your database with realistic relational data on top of that foundation — you can exclude tables that already have the data you need.

You need to preserve exact production data patterns for analytics or ML. Use an anonymization tool. Schema-aware generators produce realistic data and can approximate production distributions, but if your testing depends on exact patterns and edge cases from your specific production system, prod-derived data will be more precise.

You need test data in CI/CD that survives schema changes. Any tool that requires manually defined seed plans or scripts will break when migrations run. Schema-aware generators that read the live schema on every run handle this automatically — no "who broke the seeder" tickets in your sprint.

The maintenance cost most teams underestimate#

When evaluating data seeding tools, teams focus on setup time and ignore maintenance cost. Setup happens once. Maintenance happens every sprint.

When a migration adds a required column, every seed script that touches that table breaks. Every factory needs updating. Every Faker-based script needs a new field. At 20+ tables with weekly migrations, this becomes hours per sprint — spread across the team, invisible in planning, but real in velocity.

Schema-aware generators that read the live schema eliminate most of this cost. The trade-off is that you give up fine-grained control over every value. For most development and testing use cases, that trade-off is the right one.

Related guides: