All posts

Test Data for Fintech: Realistic Accounts Without Touching Production

By Mikhail Shytsko, Founder at Seedfast · · Updated

Test data for fintech is realistic, relational data for a financial-services database (accounts, transactions, ledger entries, KYC records, cards) with no real customer information in it. Three things make it hard at the same time: production is off-limits because of PII and cardholder data under PCI DSS, the balances have to reconcile, and the money columns can't lose precision. Generate the rows straight from your schema and the compliance half of the problem goes away, because there's nothing real to mask when nothing real was copied.

If you build on a financial-services Postgres schema, you already know the wall. Compliance won't let production data leave production, and a pg_dump onto a developer's laptop is an audit finding waiting to happen. The data you actually need is the awkward kind: coherent across accounts, transactions, and ledger_entries, with balances that sum correctly and amounts stored as exact numerics. This page walks through what that data has to look like, the two compliant ways to produce it, and how generating from the schema turns the whole thing into one command.

  • Fintech test data has to satisfy three constraints at once: no production PII or cardholder data (PCI DSS), ledgers that reconcile (double-entry nets to zero), and exact monetary precision (numeric, never float).
  • Two compliant paths exist. Anonymize a production copy by masking the cardholder fields, which needs production access. Or generate from the schema, where nothing is copied so there's nothing to mask.
  • The hard part of a financial schema was never fake names. It's the foreign-key chain (account → transactions → ledger_entries), double-entry consistency, and balance distributions realistic enough to actually exercise reconciliation logic.
  • Generating from the schema keeps real account and card data out of dev and test entirely, rather than putting a masked copy of it there.

In a fintech system, the production database holds the two categories regulators care about most: personally identifiable information (names, addresses, government IDs pulled in during KYC) and cardholder data. PCI DSS Requirement 3 governs the second. A stored Primary Account Number (PAN) has to be rendered unreadable through hashing, truncation, tokenization, or strong cryptography, and when it's displayed it can show at most the first six and last four digits (PCI Security Standards Council). A raw card number sitting in a developer's local database is the precise thing that standard exists to prevent.

So "just copy prod into staging" isn't the shortcut it looks like. The moment real cardholder data or KYC records land in a lower environment, that environment inherits obligations it was never built to carry, and what looked like a time-saver turns into a compliance event. The same logic that pushes regulated teams toward compliant test data applies with extra force here: once card data is in scope, the cheapest move is to keep it out of dev and test altogether.

What a fintech team actually needs, then, is data that behaves like production (accounts with balances, transactions with realistic amounts, a ledger that reconciles) without a single real record in it. That's something you generate, not something you copy and scrub clean.

The naive view of test data is "fake names and emails." For a payments backend, names are the easy part. The hard part is the relational and numerical shape of the data.

A simplified financial-services schema looks something like this:

CREATE TABLE accounts (
  id           uuid PRIMARY KEY,
  customer_id  uuid NOT NULL REFERENCES customers(id),
  currency     text NOT NULL,              -- ISO 4217, e.g. 'USD'
  balance      numeric(19,4) NOT NULL,     -- exact, never float
  status       text NOT NULL,              -- active / frozen / closed
  opened_at    timestamptz NOT NULL
);

CREATE TABLE transactions (
  id           uuid PRIMARY KEY,
  account_id   uuid NOT NULL REFERENCES accounts(id),
  type         text NOT NULL,              -- debit / credit / transfer
  amount       numeric(19,4) NOT NULL,
  status       text NOT NULL,              -- pending / posted / reversed
  created_at   timestamptz NOT NULL
);

CREATE TABLE ledger_entries (
  id             uuid PRIMARY KEY,
  transaction_id uuid NOT NULL REFERENCES transactions(id),
  account_id     uuid NOT NULL REFERENCES accounts(id),
  direction      text NOT NULL,            -- debit / credit
  amount         numeric(19,4) NOT NULL,
  posted_at      timestamptz NOT NULL
);

Four properties make this data hard to fake by hand:

  • FK chains have to resolve in order. A ledger_entry needs a transaction, which needs an account, which needs a customer. Insert them out of order and the run fails on a foreign key violation. Keeping referential integrity intact means inserting in dependency order, which is trivial for four tables and a real problem at thirty.
  • Money is exact, not approximate. Balances and amounts live in numeric, not float. Test data that rounds or drifts will pass a smoke test and then hide rounding bugs that only surface in reconciliation.
  • Double-entry has to balance. For a ledger, every posted transaction produces matched debit and credit entries that net to zero. Random amounts in ledger_entries produce a ledger that never reconciles, so the reconciliation code path never actually runs against your test data.
  • Distributions have to be plausible. Real account balances are skewed: most accounts hold modest amounts, a few hold a lot. A handful of transactions get reversed or flagged. Uniform random values miss every edge case your fraud, limits, and reporting logic was written to catch.

This is the gap between generic test data generation and data shaped to a financial domain. Column-level tools fill the cells. They don't make the ledger reconcile.

Once production copies are off the table, two real approaches remain. They sit on opposite sides of one question: is any real data ever touched?

ApproachWhat it doesProduction accessCardholder data in the pipeline
Anonymize production (masking)Connects to prod, masks or tokenizes the PAN and PII, outputs a de-identified copyRequiredYes; it starts from real records
Generate from schema (synthetic)Reads the schema, generates relational rows from scratchNot requiredNone; nothing real is ever read

Anonymizing production data is the masking lane. A tool like Tonic Structural connects to your production database and applies masking, tokenization, generalization, and format-preserving encryption to the regulated fields while keeping referential integrity intact, and for financial services it specifically markets masking of PCI-regulated fields. It fits best when your tests depend on the exact distributions and edge cases of your real system, because the data inherits that structure — it is your production data, transformed. The cost is real, though. You need a live connection to production, a security review to bless that connection, and a per-field masking pipeline that someone has to keep maintained. And whatever comes out the far end, the pipeline itself still reads and processes real cardholder data.

Generating from the schema is the other side. The tool reads your schema (tables, columns, constraints, foreign keys) and builds valid relational rows from that alone, with no connection to production and no real row ever read. For a team locked out of prod, this is what lets dev and test simply not hold cardholder data at all, instead of holding a masked copy of it. The enterprise anonymization platforms (Delphix, K2View) sit in the masking lane too, behind enterprise contracts and quote-based pricing. The data seeding tools comparison lays out the full spectrum.

One term worth pinning down: "synthetic data for fintech testing" sometimes means ML-training datasets, which is a separate world. On this page it means dev, test, and demo data for your application database, the rows that fill your schema so integration tests have something real-shaped to run against, not a corpus for training a model.

Here is the generate-from-schema path in practice. Seedfast is a CLI that connects to a live Postgres database, reads the schema on every run, and generates relational data from a plain-language scope. It reads metadata (table and column names, types, foreign keys), not the rows. There's no production data in the loop because none is ever read.

seedfast seed --scope "fintech app with 100 accounts, transactions, and varied balances"
  → Connected to PostgreSQL
  → Found 34 tables, 67 foreign keys
  → Generating data...
  → Done. Seeded 12,847 rows in 6.3s

Because it walks the foreign-key graph, the customers → accounts → transactions → ledger_entries chain inserts in dependency order, including any cycles in the graph that make hand-ordering impossible. The values are domain-shaped: balances in a skewed distribution, amounts as exact numerics, and a mix of posted, pending, and reversed transactions. When next sprint's migration adds a cards or kyc_documents table, the following run just picks it up. No seed script is sitting there encoding last week's schema, waiting to break.

It's worth being precise about what Seedfast doesn't touch. It never connects to production, and it doesn't mask or de-identify anything, because there's nothing to de-identify when the data is built from scratch. One detail does belong in any vendor review: to generate the data, the schema definition (table and column names, types, constraints) is sent to an AI provider. The row values never leave your database. But a table or column name can itself be sensitive, so treat that path like any other sub-processor and run it past your security lead.

None of this makes you PCI compliant, and Seedfast doesn't claim to. What it does is take real cardholder data out of the test-data path by never copying it in the first place; the attestation, scoping, and controls stay yours to own. The practical change is that your dev and test databases stop being somewhere a card number could leak from, because there was never one there to leak.

If you're choosing a tool rather than an approach, the best Postgres test data generator comparison covers the wider field, and the GenRocket alternative guide covers the down-market, self-serve path if an enterprise vendor has already quoted you for this job.

Generating from the schema isn't the right call for every fintech case. When your tests lean on the exact transaction distributions, edge cases, and volumes of your real system (validating a fraud model against historical patterns, say, or analytics that have to match prod to the decimal) anonymized production data will be more faithful, because it started life as that data. Generated data is shaped right and realistic, but it isn't your actual customers' history. So the split tends to fall along the work. For development, integration tests, CI, and demos, generation gets you a compliant, working database without ever going near production. For analytics and ML on real-world patterns, masking is the honest pick. Plenty of teams run both, for different jobs.

There's a number on this. IBM put the global average cost of a data breach at $4.44 million in 2025, and $10.22 million in the United States (IBM Cost of a Data Breach Report 2025). A copy of production cardholder data sitting in a staging environment is exactly the kind of exposure behind that figure. On the same spreadsheet as a breach like that, keeping real card data out of dev and test is about the cheapest control a fintech team can buy.

It contains no real cardholder data (no real PAN, no real account holder) because it's built from your schema instead of copied from production. That keeps card data out of the environment, which is the whole point of pci compliant test data. It does not, on its own, make you PCI compliant: compliance is a property of your entire environment, scope, and controls, and the attestation stays with your team. Generation handles one piece of it, getting real card data out of the test-data path.

Masking starts from real production records and transforms the sensitive fields, so it needs a production connection and a pipeline that handles real cardholder data on the way through, even though the output is de-identified. Synthetic generation starts from the schema, so no row was ever real and nothing in the pipeline is. The tradeoff is fidelity versus cleanliness: masking reproduces your exact production distributions, while generation hands you a dataset with nothing real in it.

No. The generator reads the schema (tables, columns, constraints, foreign keys), not the rows in production. The values it writes fit your column formats, card-shaped and account-shaped, but they map back to no real person or card. Nothing real goes in, so nothing real comes out.

It can, if the generator understands the relationship. One that walks the FK graph can produce matched debit and credit ledger_entries for each transaction so the ledger nets correctly, and exact numeric amounts so balances don't drift. Column-level tools that emit one random value per cell won't: they fill ledger_entries with independent random amounts that never sum to zero.

Usually not without consequences. Once real PII or cardholder data is in scope, copying the production database into development can put a team in breach of PCI DSS rather than save it time. Teams under these rules need data that was never real in the first place: either anonymized production copies (masking, with a production connection) or data generated from the schema with no card data in the pipeline.

  • Compliant test data: the broader GDPR / SOC 2 / compliance view of generating test data that no regulator can object to.
  • Test data for healthcare apps: the same problem in the other heavily regulated domain, covering temporal patterns, FK chains, and zero PHI.
  • HIPAA test data: the healthcare-regulation analog of this page's PCI angle.
  • Data seeding tools: the full spectrum of approaches, from Faker to masking to schema-aware generation, and when each fits.
  • Get started with Seedfast: point it at your Postgres schema, describe the fintech scenario you need, and seed dev or CI without a production connection. See the pricing 30-day free trial, or watch the one-command demo.

Seedfast is not affiliated with, endorsed by, or sponsored by the products compared here. All product names, logos, and brands are the property of their respective owners and are used for identification purposes only. Comparisons reflect publicly available information as of the date shown.

Tonic, Delphix, K2View are trademarks of their respective owners.