PostgreSQL Test Data: A Syntax Cookbook
By Mikhail Shytsko, Founder at Seedfast · · Updated
TL;DR: PostgreSQL ships three primitives that cover most test-data work — generate_series() for single-table volume, psql \copy for bulk-loading prepared CSV, and pg_dump --data-only for carving production-shaped slices. The thing none of them solve is FK insert order across a moving schema, which is where hand-rolled approaches start needing a topological sort and start needing maintenance.
generate_series()is the right tool for single-table volume and for synthesizing time-series-shaped rows; pair it withrandom()andintervalarithmetic to get realistic dates and amounts in one statementpsql \copyis the fastest path for loading prepared CSV from a developer machine — it runs client-side, so it works against managed Postgres where server-sideCOPYwould need superuser orpg_read_server_filespg_dump --data-only --tablecarves a single table out of an existing database; combined with--exclude-table-datait gives you a structural restore with selective contents- FK insert order across more than a few tables turns into a topological sort against
pg_constraint; deferrable constraints (SET CONSTRAINTS ALL DEFERRED) are the escape hatch for cyclic FKs - For the broader method comparison (Faker, masking, production copies, schema-aware generation), see the methods hub linked below — this page stays in PostgreSQL syntax land
This page is the syntax-level cookbook for getting test data into PostgreSQL: the built-in functions, the psql meta-commands, the pg_dump invocations, and the catalog queries that let you reason about insert order against a real schema. For the broader question of which method to use — Faker scripts, masking, production copies, schema-aware generation — see test data generation: methods compared, which is the methods hub. This page assumes you've already picked the method and you're staring at a psql prompt.
generate_series() produces a one-column virtual table over an integer or timestamp range. It's the cheapest way to produce volume in a single table and it composes cleanly with random(), interval arithmetic, and JOINs against existing rows.
The minimal pattern most developers write first:
INSERT INTO users (email, created_at)
SELECT
'user' || n || '@example.test',
NOW() - (random() * interval '365 days')
FROM generate_series(1, 100000) AS n;
Ten seconds for 100k rows on a laptop. No dependencies, runs anywhere psql runs.
The timestamp form is what makes it useful beyond row count. Producing one event per day over a year:
INSERT INTO events (occurred_at, kind)
SELECT
d,
CASE WHEN random() < 0.3 THEN 'login' ELSE 'page_view' END
FROM generate_series(
'2025-01-01'::timestamptz,
'2025-12-31'::timestamptz,
interval '1 day'
) AS d;
Multi-table volume in one statement, when FK targets already exist:
INSERT INTO orders (user_id, placed_at, total_cents)
SELECT
u.id,
NOW() - (random() * interval '180 days'),
(random() * 50000)::int + 100
FROM users u, generate_series(1, 5) AS o
WHERE random() < 0.6;
That produces roughly 0–5 orders per user, on a 60% probability. The users table has to exist first — and that's where the limits begin. generate_series() is for volume against rows that already resolve. The moment you need to populate users → orders → order_items → products in a single seed run, you're back to writing application code or to a topological-sort query (below).
When you already have a CSV — exported from a colleague's database, scraped from a public dataset, hand-crafted in a spreadsheet — psql \copy is the fastest path in. Two things make it the right primitive for test-data work rather than the server-side COPY command:
- It runs on the client side. The file path is local to wherever you're running
psql, not local to the Postgres server. This is what makes it work against managed Postgres (RDS, Cloud SQL, Supabase, Neon) where you don't have shell access to the server filesystem. - It needs no special privileges. Server-side
COPY FROM 'filename'requirespg_read_server_filesor superuser;\copyjust needsINSERTon the target table.
Minimal invocation from a psql session:
\copy users (email, created_at) FROM 'users.csv' WITH (FORMAT csv, HEADER true);
A few patterns that come up in practice:
-- Skip the header row (the default is FORMAT text, which has no header concept)
\copy products FROM 'products.csv' WITH (FORMAT csv, HEADER true)
-- Different delimiter and a NULL-as-literal-string convention
\copy events FROM 'events.tsv' WITH (FORMAT csv, DELIMITER E'\t', NULL '\N')
-- Only load specific columns; the rest take DEFAULT (works with serial IDs)
\copy users (email, signup_source) FROM 'users.csv' WITH (FORMAT csv, HEADER true)
-- Quote handling for free-text columns containing commas
\copy comments FROM 'comments.csv' WITH (FORMAT csv, HEADER true, QUOTE '"', ESCAPE '"')
Gotchas worth flagging:
- Encoding mismatch is the most common silent failure. If your CSV is UTF-8 with a BOM and your database is UTF-8 without, the first row's first column has an invisible
prepended. Strip the BOM before loading, or useiconvto normalize. - Errors halt the whole load by default.
\copyis transactional; one bad row rolls back the entire file. PostgreSQL 17 addedON_ERROR ignoreto skip bad rows; on older versions, either pre-clean the CSV or load into a staging table with looser types andINSERT … SELECT … WHEREyour way out. \copyruns your triggers.BEFORE INSERTandAFTER INSERTtriggers fire per row, exactly like a regularINSERT. Useful when you need a computed column populated at load time; a hidden cost when you forgot the audit-log trigger fires 100k times for a 100k-row CSV.
pg_dump is best known for full backups, but a handful of flags turn it into a precise test-data extractor. Useful for the case where production-shaped data (with PII handled separately upstream) is the source of truth and you want to carve out the parts a developer needs without dragging the whole database.
Data-only dump of a single table:
pg_dump --data-only --table=public.products \
--no-owner --no-acl \
-h prod-host -U readonly -d appdb \
> products-data.sql
Schema for everything, data for a curated subset:
pg_dump --schema-only --no-owner -d appdb > schema.sql
pg_dump --data-only \
--table=public.users \
--table=public.products \
--table=public.categories \
--no-owner -d appdb \
> seed-data.sql
Inverse: dump the schema and all data except the noisy tables:
pg_dump \
--exclude-table-data='audit_*' \
--exclude-table-data='public.event_log' \
--no-owner -d appdb \
> app-without-audit.sql
Restoring to a fresh development database is then just two commands:
createdb app_dev
psql -d app_dev -f schema.sql -f seed-data.sql
Limits of this approach:
pg_dumpdoes not strip PII. If you pipe production data through it, the dump file is in HIPAA / GDPR / SOC 2 scope until it's masked, deleted, or kept inside a vetted environment. Anonymizers like Greenmask exist to slot in front of restore.- The custom format (
-Fc) is what you want for selective restore withpg_restore --table=…; SQL format is what you want for grep-able human inspection. pg_dumpproduces a snapshot of a moment in time. Re-running it as the seed source means re-pulling production. That's the maintenance trap, and it's why most teams that start here eventually move to a generator.
The hard problem with hand-rolled PostgreSQL test data isn't producing values — it's producing them in an order the database will accept. A six-table schema with three FK chains is already past the point where humans get this right by eye.
The catalog query that gives you the dependency graph for the public schema:
SELECT
child.relname AS child_table,
parent.relname AS parent_table
FROM pg_constraint c
JOIN pg_class child ON child.oid = c.conrelid
JOIN pg_class parent ON parent.oid = c.confrelid
JOIN pg_namespace ns ON ns.oid = child.relnamespace
WHERE c.contype = 'f'
AND ns.nspname = 'public'
ORDER BY child.relname, parent.relname;
Each row reads "child references parent" — the edges of a directed graph. A valid insert order is any topological sort of that graph, with parents inserted before children.
A recursive CTE that assigns an insert_order to every public table — parents at level 0, their children at level 1, and so on:
WITH RECURSIVE deps AS (
-- Anchor: every public table starts at level 0
SELECT c.oid AS table_oid, c.relname AS table_name, 0 AS level
FROM pg_class c
JOIN pg_namespace ns ON ns.oid = c.relnamespace
WHERE c.relkind = 'r' AND ns.nspname = 'public'
UNION ALL
-- Recursive: a child sits one level deeper than its deepest parent
SELECT child.oid, child.relname, d.level + 1
FROM deps d
JOIN pg_constraint fk ON fk.confrelid = d.table_oid AND fk.contype = 'f'
JOIN pg_class child ON child.oid = fk.conrelid
WHERE d.level < 20
)
SELECT table_name, MAX(level) AS insert_order
FROM deps
GROUP BY table_name
ORDER BY insert_order, table_name;
Sort the output by insert_order ascending and you've got a valid sequence. (The WHERE d.level < 20 guards against infinite recursion on cyclic FKs, which the next paragraph covers.)
When FKs are cyclic — users.created_by → users.id, or orders ↔ shipments where each references the other — topological sort has no valid order. The PostgreSQL escape hatch is deferrable constraints: declare the FK with DEFERRABLE INITIALLY DEFERRED, then wrap the whole seed in a transaction. Constraint checks run at COMMIT, so you can insert rows in either order:
ALTER TABLE orders
ADD CONSTRAINT orders_shipment_fk
FOREIGN KEY (shipment_id) REFERENCES shipments(id)
DEFERRABLE INITIALLY DEFERRED;
BEGIN;
-- Insert in any order; the constraint is checked at COMMIT
INSERT INTO orders (id, shipment_id) VALUES (1, 1);
INSERT INTO shipments (id, order_id) VALUES (1, 1);
COMMIT;
If the constraint already exists as non-deferrable, you can toggle it in place — no drop required (PostgreSQL 9.4+):
ALTER TABLE orders
ALTER CONSTRAINT orders_shipment_fk DEFERRABLE INITIALLY DEFERRED;
(ALTER CONSTRAINT works on foreign keys for the deferrable attributes only; you can't change the referenced columns this way.)
For one-off seeding, SET session_replication_role = replica disables FK and trigger enforcement for the session entirely. It bypasses checks rather than deferring them, and it requires superuser (or, on PostgreSQL 15+, a non-superuser explicitly granted SET on the parameter). That's a pg_dump-restore pattern, not a development-loop pattern.
This is the part of the problem where hand-rolled scripts start needing real engineering. The query above gives you the order, but it doesn't generate the data. The data still has to be produced — with valid types, plausible values, and FK column values that match the IDs you just inserted upstream. That's the maintenance cliff most teams hit around 15–20 tables.
The three primitives above cover most of what PostgreSQL gives you for test data at the syntax level. Past a certain schema size, the bottleneck stops being syntax and starts being the rate at which your schema changes versus the rate at which someone updates the seed script.
Seedfast reads the live schema on every run, derives the FK graph (the same one the topological-sort query above produces), and generates valid rows in dependency order from a plain-English scope:
seedfast seed --scope "an e-commerce database with 500 products, realistic order history, and a mix of customer account ages"
The migration story is the part scripts don't survive. Add a referrals table next sprint, and a Seedfast run picks it up automatically; a hand-rolled script needs the topological-sort query re-run and the data-generation code updated for the new table. Free tier exists — see pricing.
For the broader landscape — why you'd use schema-aware generation rather than Faker, masking, or pg_dump-and-anonymize — that comparison lives in the test data generation methods hub.
The PostgreSQL-native tools are generate_series() for single-table volume (INSERT INTO users SELECT 'user'||n FROM generate_series(1, 100000) AS n), psql \copy for bulk-loading CSV from your local machine, and pg_dump --data-only --table=… for carving subsets out of an existing database. Past a few tables with FKs, all three need help with insert order — a topological sort against pg_constraint gives you the right sequence; cyclic FKs need DEFERRABLE INITIALLY DEFERRED. For the method-level decision (script vs. masking vs. schema-aware generation), see the test data generation hub.
\copy is a psql meta-command that streams data over the client connection, so the CSV file lives on the machine running psql rather than on the database server. COPY FROM 'filename' is a server-side SQL command that reads the file from the server's filesystem, which requires pg_read_server_files (or superuser) and is unavailable on most managed Postgres services (RDS, Cloud SQL, Supabase, Neon). For developer-side bulk loads against managed Postgres, \copy is the practical choice.
Two patterns. For acyclic FKs, derive the dependency graph from pg_constraint (catalog query in the topological-sort section above) and insert in level order — parents first, children second. For cyclic FKs (users.created_by → users.id, or two tables that reference each other), declare the FK as DEFERRABLE INITIALLY DEFERRED and wrap inserts in a single transaction; PostgreSQL checks the constraint at COMMIT rather than at each row. Hand-rolling this works at small scale and becomes a maintenance burden as the FK graph grows.
Yes — schema-aware generators like Seedfast connect to your PostgreSQL database, derive the FK graph from pg_constraint, and produce inserts in topological order from a plain-English scope. No seed.sql, no factories. See database seeding for the broader workflow.
- Test Data Generation: Methods Compared — the broader, non-PostgreSQL-specific decision matrix across all approaches
- Database Seeding: From Your First seed.sql to Fully Automated — the broader context for how test data fits into seeding workflows
- Database Seeder: 7 Tools and Per-ORM Quick Reference — the seeder tool itself (CLI/library choice), not the generation method
- Seed File Maintenance: When Your Seeds Fall Out of Sync — what happens to hand-rolled scripts over time
- Get started with Seedfast — connect to your PostgreSQL database and run your first schema-aware seed