PostgreSQL Test Data: What Actually Works in Practice
By the Seedfast team ·
Key Takeaways#
generate_series()is the fastest path for simple numeric-heavy test data, but it can't satisfy FK constraints across tables without manual orchestration- Copying production data is the easiest way to get realistic data and the fastest way to create a compliance incident
- Faker-based scripts give you control but become a maintenance burden the moment your schema has more than a handful of tables with relationships
- The FK constraint problem is the one that kills every hand-rolled test data approach at scale — insert order matters, circular dependencies break everything, and nothing enforces it for you
- If your schema changes regularly, any script you write is already becoming stale
There's a specific kind of afternoon every PostgreSQL developer has had at least once.
You're writing a query, or testing a new index, or trying to reproduce a performance issue that only shows up with real data volumes. Your development database has three users and one order. Production has 4 million rows across 60 tables, with realistic distributions and edge cases baked in over years of actual use.
You need test data. The question is how much you want to invest in getting it.
Here's what the options actually look like in practice.
generate_series() — fast, limited, good for the right job#
Every PostgreSQL developer reaches for generate_series() first, and usually rightly so:
INSERT INTO users (email, created_at)
SELECT
'user' || n || '@example.com',
NOW() - (random() * interval '365 days')
FROM generate_series(1, 100000) AS n;
Ten seconds. A hundred thousand users. No dependencies, no setup, runs anywhere you have psql access.
This works remarkably well when the problem is simple: you need volume in a single table, the data doesn't need to look realistic, and you're not testing anything that depends on how the data actually relates to other tables. Performance benchmarks, index tuning, capacity planning — this is where generate_series() earns its keep.
Where it breaks down is FK constraints. The moment you have an orders table that references users, and order_items that reference both orders and products, you're orchestrating insert order by hand. Add a nullable FK, a circular reference between two tables, or a schema migration that introduces a new required column, and the script needs updating.
That's not a knock on generate_series() — it's just the wrong tool for relational test data. It generates rows, not datasets.
The production copy temptation#
Before going further: copying production data into your development environment looks like the obvious solution. It's realistic, it's complete, it requires almost no effort.
The problems are well-documented: GDPR and similar regulations treat development environments as data processing contexts. A developer laptop with a copy of your user table isn't a theoretical risk. It's an actual one, and "we needed it for testing" isn't a defense that has held up well in enforcement actions.
Beyond compliance, there's the practical issue that production data ages into your development environment. Last week's dump becomes the baseline people are accustomed to. Queries get tuned against it. Features get tested against the specific edge cases that happen to be in that dump. Then the dump becomes stale and nobody wants to refresh it because now their tests will break.
If your team is doing this today, staging without production data covers the migration path in more detail.
Faker and factory libraries — more control, more upkeep#
The next thing most teams reach for is a Faker-based script. Python, JavaScript, Ruby — they all have solid Faker libraries, and the ergonomics are genuinely good for simple cases:
from faker import Faker
import psycopg2
fake = Faker()
conn = psycopg2.connect("postgresql://localhost/mydb")
cur = conn.cursor()
for _ in range(1000):
cur.execute(
"INSERT INTO users (name, email, created_at) VALUES (%s, %s, %s)",
(fake.name(), fake.email(), fake.date_time_this_year())
)
conn.commit()
The value here is realistic-looking data — names that look like names, emails that look like emails, dates with plausible distributions. For demos, for screenshot-worthy UIs, for testing features that display data to users, this matters.
The limitation is the same as generate_series(): Faker generates values, not relationships. You still have to handle the case where order_items needs a valid order_id that references an actual row in orders, which in turn needs a valid user_id.
Teams usually solve this by making the seed script progressively more elaborate — seeding one table, capturing its IDs, then using those IDs in the next table, and so on. This works until a migration changes a FK, adds a new constraint, or introduces a new table. At that point, someone has to update the script. That someone is usually whoever is on-call or whoever notices the CI failure.
It's not that this approach fails — it's that it requires ongoing maintenance that rarely gets budgeted for explicitly.
Where the hard problem actually lives#
The core issue with hand-rolled test data for PostgreSQL is satisfying FK constraints across the whole schema.
A non-trivial schema has tables that reference other tables, which reference other tables. Some of those references are circular. Some are optional. Some have been added over time and the original seed script predates them.
To insert valid test data into a real schema, you need to:
- Know which tables depend on which across the whole schema
- Insert in the right order — with handling for any circular references
- Generate FK column values that actually exist in the referenced tables
- Update all of this whenever a migration changes anything
This is solvable. Database engineers do it all the time. But it's also the kind of problem that has no elegant solution in a script — you end up with increasingly elaborate scaffolding that's specific to your schema at a moment in time.
Seedfast reads your live PostgreSQL schema and handles this automatically. Every time you run it, it works from the current schema, so migrations don't break anything. You describe what dataset you want:
seedfast seed --scope "an e-commerce database with 500 products, realistic order history, and a mix of customer account ages"
That's the entire instruction. Seedfast handles the relationships automatically — every generated row is valid across your full schema. Free tier is available — see pricing.
This is the part of the problem where scripts hit a wall. The generation part is tractable. The relationship-satisfaction part, across a real schema that's actively being migrated, is where the maintenance cost compounds.
A note on volume#
For large datasets — hundreds of thousands or millions of rows — the realism of the data matters as much as the volume. Query plans, index behavior, and cache hit rates all depend on how the data is actually distributed, not just how many rows exist.
Hand-rolled scripts can produce volume quickly, but they typically generate uniform distributions — the same price range, the same order sizes, the same account ages. Real user behavior isn't uniform, and tests against uniform data miss the edge cases that only surface with realistic distributions.
Seedfast handles large volumes with realistic distributions and valid relationships across all tables. For production-scale test environments, that combination is difficult to replicate with a script.
Related guides#
- Database Seeding: From Your First seed.sql to Fully Automated — the broader context for how test data fits into seeding workflows
- Seed File Maintenance: When Your Seeds Fall Out of Sync — what happens to hand-rolled scripts over time
- Get started with Seedfast — connect to your PostgreSQL database and run your first schema-aware seed