Seedfast – Documentation | Small Data, Big Lies

Search Docs…

Small Data, Big Lies

Small Data, Big Lies: 6 Bugs Your Test Suite Will Never Catch

The bugs that only show up when your database has real data in it — and how to catch them before production does.

Your test suite is green. Coverage is 94%. The PR gets approved, merged, deployed. Two hours later, the on-call gets paged: the orders page takes 47 seconds to load. The customer export endpoint is OOM-killing pods. Pagination skips page 7 entirely.

Nothing changed in the code. What changed is that production has 2 million orders, and your test database had 12.

This isn't a testing failure. It's a data volume blind spot — and almost every team has one.

The Volume Blind Spot

Most test databases contain between 5 and 50 rows per table. Just enough to verify that the happy path works. This is fine for unit tests, but it creates a dangerous assumption: if it works with 10 rows, it works with 10 million.

It doesn't. Entire categories of bugs are invisible at low volume and catastrophic at scale. Here are the six most common ones.

1. Pagination Off-by-One

The classic. Your API returns paginated results. With 10 rows and a page size of 10, there's one page. Everything works. With 10,001 rows, page 1001 is either empty or duplicates page 1000 — depending on whether your offset calculation uses > or >=.

-- Looks correct with small data
SELECT * FROM orders ORDER BY created_at LIMIT 10 OFFSET ?

-- At scale: duplicate rows when created_at isn't unique
-- Rows shift between pages during concurrent inserts

-- Looks correct with small data
SELECT * FROM orders ORDER BY created_at LIMIT 10 OFFSET ?

-- At scale: duplicate rows when created_at isn't unique
-- Rows shift between pages during concurrent inserts

-- Looks correct with small data
SELECT * FROM orders ORDER BY created_at LIMIT 10 OFFSET ?

-- At scale: duplicate rows when created_at isn't unique
-- Rows shift between pages during concurrent inserts

The fix is usually cursor-based pagination, but you'll never discover the bug until you have enough rows to fill multiple pages — with realistic timestamp distributions that create collisions.

How to catch it:

seedfast seed --scope "seed 10,000 orders with timestamps"

seedfast seed --scope "seed 10,000 orders with timestamps"

seedfast seed --scope "seed 10,000 orders with timestamps"

Then run your pagination endpoint from page 1 to the last page. Count the total items. If it doesn't match SELECT COUNT(*), you have a bug.

2. N+1 Queries

Your ORM loads a list of orders. For each order, it lazily loads the customer. With 5 orders, that's 6 queries — nobody notices. With 5,000 orders, that's 5,001 queries. The endpoint that responded in 80ms now takes 12 seconds.

# 5 rows: 6 queries, 80ms
GET /api/orders → 200 OK (80ms)

# 5,000 rows: 5,001 queries, 12,400ms
GET /api/orders → 200 OK (12,400ms)  # or timeout

# 5 rows: 6 queries, 80ms
GET /api/orders → 200 OK (80ms)

# 5,000 rows: 5,001 queries, 12,400ms
GET /api/orders → 200 OK (12,400ms)  # or timeout

# 5 rows: 6 queries, 80ms
GET /api/orders → 200 OK (80ms)

# 5,000 rows: 5,001 queries, 12,400ms
GET /api/orders → 200 OK (12,400ms)  # or timeout

N+1 bugs are particularly dangerous because they scale linearly with data. Every new row adds one more query. The performance degradation is proportional and predictable — but only if you have enough data to notice it.

How to catch it:

seedfast seed --scope "seed 5,000 orders with customers and line items"

seedfast seed --scope "seed 5,000 orders with customers and line items"

seedfast seed --scope "seed 5,000 orders with customers and line items"

Enable query logging, hit your endpoints, and count the queries. Any endpoint executing more than ~10 queries for a list view has a problem.

3. Missing Indexes

PostgreSQL can scan 100 rows in under a millisecond without an index. With 1 million rows, that same scan takes seconds. The query planner switches from sequential scan to... still sequential scan, because there's no index to use.

-- Fast at 100 rows (seq scan is fine)
SELECT * FROM users WHERE email = 'john@example.com';

-- 1M rows: 800ms full table scan
-- With index: 0.1ms

-- Fast at 100 rows (seq scan is fine)
SELECT * FROM users WHERE email = 'john@example.com';

-- 1M rows: 800ms full table scan
-- With index: 0.1ms

-- Fast at 100 rows (seq scan is fine)
SELECT * FROM users WHERE email = 'john@example.com';

-- 1M rows: 800ms full table scan
-- With index: 0.1ms

The insidious part: your test suite passes at full speed. Your development database has 50 users. The missing index is invisible until production traffic arrives.

How to catch it:

seedfast seed --scope "seed 100,000 users with realistic email addresses"

seedfast seed --scope "seed 100,000 users with realistic email addresses"

seedfast seed --scope "seed 100,000 users with realistic email addresses"

Then run EXPLAIN ANALYZE on your critical queries. Any sequential scan on a table over 10K rows is a red flag.

4. Memory Blowups

Your ORM loads query results into memory. With 100 rows, that's a few kilobytes. With 100,000 rows, the endpoint allocates 500MB and gets OOM-killed.

// Loads ALL rows into memory
users, err := db.Query("SELECT * FROM users")
var allUsers []User
for users.Next() {
    var u User
    users.Scan(&u)
    allUsers = append(allUsers, u)  // grows unbounded

// Loads ALL rows into memory
users, err := db.Query("SELECT * FROM users")
var allUsers []User
for users.Next() {
    var u User
    users.Scan(&u)
    allUsers = append(allUsers, u)  // grows unbounded

// Loads ALL rows into memory
users, err := db.Query("SELECT * FROM users")
var allUsers []User
for users.Next() {
    var u User
    users.Scan(&u)
    allUsers = append(allUsers, u)  // grows unbounded

This pattern hides in export endpoints, report generators, batch jobs, and admin dashboards. Any code path that collects results into a slice, list, or array without a limit is a time bomb that detonates at scale.

How to catch it:

seedfast seed --scope "seed 100,000 users with profiles and activity logs"

seedfast seed --scope "seed 100,000 users with profiles and activity logs"

seedfast seed --scope "seed 100,000 users with profiles and activity logs"

Hit your export/report endpoints and monitor memory usage. If RSS climbs linearly with row count, you have a memory leak.

5. Timeout Cascades

Service A calls Service B, which queries the database. With small tables, the query runs in 5ms. The total request takes 50ms. With large tables, the query takes 3 seconds. Service A's 2-second timeout fires. The retry hits Service B again. Service B is now handling two slow queries. The third request times out too. The circuit breaker opens. The dashboard goes red.

Small data:   A → B (50ms) → DB (5ms)    ✓
Large data:   A → B (3.2s) → DB (2.8s)   ✗ timeout
              A → B (retry) → DB         ✗ timeout (DB now under double load)
              A → circuit breaker open   ✗ cascade failure

Small data:   A → B (50ms) → DB (5ms)    ✓
Large data:   A → B (3.2s) → DB (2.8s)   ✗ timeout
              A → B (retry) → DB         ✗ timeout (DB now under double load)
              A → circuit breaker open   ✗ cascade failure

Small data:   A → B (50ms) → DB (5ms)    ✓
Large data:   A → B (3.2s) → DB (2.8s)   ✗ timeout
              A → B (retry) → DB         ✗ timeout (DB now under double load)
              A → circuit breaker open   ✗ cascade failure

Timeout cascades don't happen with 50 rows. They happen with 500,000 rows, when one slow query is enough to breach a timeout boundary.

How to catch it:

seedfast seed --scope "seed 500,000 transactions with accounts and categories"

seedfast seed --scope "seed 500,000 transactions with accounts and categories"

seedfast seed --scope "seed 500,000 transactions with accounts and categories"

Run your integration test suite and watch for any request that approaches your timeout thresholds. A request at 80% of the timeout limit today is a timeout tomorrow when the table grows.

6. Unique Constraint Collisions

Your test fixtures use carefully crafted values: user1@test.com, user2@test.com. No collisions. In production, the email generation logic produces duplicates at scale — two users sign up with the same normalized email, or a batch import contains near-duplicates that pass validation individually but violate constraints together.

-- Works with 10 hand-crafted rows
INSERT INTO users (email) VALUES ('user1@test.com');

-- Fails at 10,000 rows with realistic data distributions
-- ERROR: duplicate key value violates unique constraint "users_email_key"

-- Works with 10 hand-crafted rows
INSERT INTO users (email) VALUES ('user1@test.com');

-- Fails at 10,000 rows with realistic data distributions
-- ERROR: duplicate key value violates unique constraint "users_email_key"

-- Works with 10 hand-crafted rows
INSERT INTO users (email) VALUES ('user1@test.com');

-- Fails at 10,000 rows with realistic data distributions
-- ERROR: duplicate key value violates unique constraint "users_email_key"

AI-generated test data produces realistic distributions — including the edge cases that hand-crafted fixtures miss. When Seedfast generates 10,000 email addresses, it follows realistic patterns that surface constraint issues before production does.

How to catch it:

seedfast seed --scope "seed 10,000 users with realistic names and emails"

seedfast seed --scope "seed 10,000 users with realistic names and emails"

seedfast seed --scope "seed 10,000 users with realistic names and emails"

If your constraints hold under 10K realistic rows, they'll hold in production. If they don't — better to find out now.

The Pattern

Every bug in this list shares the same pattern:

Invisible at small scale — test suite passes, code review looks fine
Proportional to data volume — gets worse as tables grow
Discovered in production — where the data is, and the users are
Expensive to fix after the fact — incident response, hotfixes, post-mortems

The fix is also the same: test with realistic data volumes before deploying.

Shift Left With One Command

You don't need a production database copy. You don't need to write fixture factories for 50 tables. You don't need to maintain SQL dump files that drift from your schema.

# Seed production-scale data in your dev/staging database
seedfast seed --scope "seed 100,000 users with orders, payments, and activity logs"

# Seed production-scale data in your dev/staging database
seedfast seed --scope "seed 100,000 users with orders, payments, and activity logs"

# Seed production-scale data in your dev/staging database
seedfast seed --scope "seed 100,000 users with orders, payments, and activity logs"

Seedfast analyzes your schema, resolves foreign key dependencies, and generates the right proportions automatically. You describe what you need, review the plan, and approve:

Seeding Plan:
  public.users          — 100,000 records
  public.orders         — 450,000 records
  public.payments       — 320,000 records
  public.order_items    — 1,200,000 records
  public.activity_logs  — 2,000,000 records

Total: 4,070,000 records across 5 tables

Approve? (Y/n)

Seeding Plan:
  public.users          — 100,000 records
  public.orders         — 450,000 records
  public.payments       — 320,000 records
  public.order_items    — 1,200,000 records
  public.activity_logs  — 2,000,000 records

Total: 4,070,000 records across 5 tables

Approve? (Y/n)

Seeding Plan:
  public.users          — 100,000 records
  public.orders         — 450,000 records
  public.payments       — 320,000 records
  public.order_items    — 1,200,000 records
  public.activity_logs  — 2,000,000 records

Total: 4,070,000 records across 5 tables

Approve? (Y/n)

If the scope exceeds your plan limits, Seedfast asks you to refine — right there in the terminal, no restart needed. If tables already have data, they're skipped automatically.

In CI/CD

Add a seeding step to your pipeline. Run your test suite against real data volumes on every PR:

- name: Seed test database
  run: seedfast seed --scope "seed 50,000 users with orders" --output plain
  env:
    SEEDFAST_API_KEY: ${{ secrets.SEEDFAST_API_KEY }}
    DATABASE_URL: ${{ secrets.DATABASE_URL }}

- name: Run tests
  run

- name: Seed test database
  run: seedfast seed --scope "seed 50,000 users with orders" --output plain
  env:
    SEEDFAST_API_KEY: ${{ secrets.SEEDFAST_API_KEY }}
    DATABASE_URL: ${{ secrets.DATABASE_URL }}

- name: Run tests
  run

- name: Seed test database
  run: seedfast seed --scope "seed 50,000 users with orders" --output plain
  env:
    SEEDFAST_API_KEY: ${{ secrets.SEEDFAST_API_KEY }}
    DATABASE_URL: ${{ secrets.DATABASE_URL }}

- name: Run tests
  run

The --scope flag auto-approves the plan, making it fully non-interactive. Table skipping makes it idempotent — safe to re-run.

Start Small, Then Scale

You don't have to jump to a million rows. Start with enough to surface the first category of bugs:

Goal	Suggested scope	What it catches
Pagination bugs	1,000+ rows in paginated tables	Off-by-one, cursor issues
N+1 queries	500+ rows with relationships	Lazy loading performance
Missing indexes	50,000+ rows	Sequential scan bottlenecks
Memory issues	100,000+ rows	Unbounded collection growth
Timeout cascades	500,000+ rows	Cross-service timeout breaches

Once you find (and fix) the first bug, you'll want to run every PR against realistic volumes. That's the point — make it a habit, not a one-time exercise.

Ready to find the bugs hiding in your small test database?

Get Started | Documentation | Pricing

Seedfast generates realistic test data from your schema description. No fixtures, no dumps, no maintenance.