Create Anonymized Test Data From a Production CSV — Free Browser Tool

How to make anonymized test data from a production csv

Step 1
Export the production tables you need as fixtures — Dump the tables your tests touch (e.g. customers, orders) to CSV. If you'll need joins to work in the test DB, export the related tables together so you can hash their shared keys with one consistent salt.
Step 2
Drop the first CSV onto the anonymizer above — It parses in your browser — the production file isn't uploaded. Auto-detect pre-fills hash rules for recognised PII headers. The first row must be the header row, since rules target columns by name.
Step 3
Hash the keys that must still join — For every column that's a foreign key across tables (customer_id, account_ref), add a hash rule and set a salt. Deterministic hashing guarantees the same value maps to the same token, so the relationship survives. Note the salt — you'll reuse it on the related tables.
Step 4
Mask, redact, or drop the rest of the PII — Mask emails/phones to keep a realistic shape for UI tests. Redact free-text that might leak PII but whose presence you still want. Drop columns your test schema doesn't have. Leave purely analytic columns (plan, status, amounts) without a rule so they pass through verbatim.
Step 5
Run it on every related table with the SAME salt — Repeat the hash rule (and salt) on orders.csv, invoices.csv, etc. for the shared key. Because the hash is deterministic, customer_id 1001 becomes the identical token in every file — so loading them into the test DB preserves the joins.
Step 6
Download the fixtures and load them into staging — Each download saves as <name>.anon.csv. Load them into your test database or commit them as fixtures. The row counts and distributions match production; the PII is gone. Re-run any time you need a fresh fixture from current production shapes.

Strategy choice for test fixtures

Which strategy to use per column when the goal is realistic, safe test data.

Column type	Best strategy	Why
Foreign key (customer_id, account_ref)	hash (shared salt)	Deterministic — same value → same token, so joins survive across files
Email / phone for UI tests	mask	Keeps a plausible shape (`****1234`) so templates and inputs render
Name	hash or mask	Hash if you only need uniqueness; mask if a test reads a partial name
Free-text notes	redact	Blanks unpredictable PII while keeping the column your schema expects
Columns not in the test schema	drop	Removes them so the fixture matches staging without a second tool
Analytic columns (amount, plan, status)	no rule	Pass through verbatim so distributions stay realistic

Why deterministic hashing preserves referential integrity

The same input plus the same salt always yields the same token. That property is what keeps cross-file relationships intact.

File	Column	Rule	Token for customer 1001
customers.csv	customer_id	hash (salt: seed-2026)	9f3a1c70be442d18
orders.csv	customer_id	hash (salt: seed-2026)	9f3a1c70be442d18
invoices.csv	customer_id	hash (salt: seed-2026)	9f3a1c70be442d18
customers.csv	customer_id	hash (salt: DIFFERENT)	c1d8...mismatch — joins break

Tier limits

Browser-side CSV limits. The CSV Anonymizer is Pro.

Limit	Free	Pro
Max file size	2 MB	100 MB
Max rows	500	100,000
Batch files	2	10

Cookbook

Fixture recipes from typical production table shapes. Tokens illustrate the deterministic 16-char hex format.

Hash the shared key in two tables so joins survive

Example

customers and orders both reference customer_id. Hash it with the same salt in both files and the foreign key still joins in your test database.

customers.csv  rule: customer_id → hash (salt: seed-2026)
  1001,jane@acme.com  →  9f3a1c70be442d18, <hashed email>

orders.csv     rule: customer_id → hash (salt: seed-2026)
  ORD-5,1001,42.00   →  ORD-5,9f3a1c70be442d18,42.00

→ orders.customer_id joins customers.customer_id on the token.

Mask emails so UI tests render plausibly

Example

A profile screen test reads the email field. A hashed token looks wrong in the UI; a masked value keeps a realistic shape. Mask keeping first 1 and last 4.

Input:
name,email
Jane Doe,jane@acme.com

Rule: email → mask (keepStart 1, keepEnd 4)
       name → hash

Output:
name,email
5f9c...d2,j*******.com

→ renders as a plausible (fake) email in the UI test.

Drop columns your test schema doesn't have

Example

Production has audit columns your staging schema lacks. Drop them so the fixture loads cleanly.

Input:
customer_id,plan,_audit_blob,_internal_flag
1001,Pro,{...},true

Rules: customer_id → hash; _audit_blob → drop; _internal_flag → drop

Output:
customer_id,plan
9f3a1c70be442d18,Pro

→ fixture matches the staging table's columns.

Keep realistic distributions, change only identifiers

Example

The point of using production data is its lopsided realism. Leave amount/status columns ruleless so the distribution survives; only identifiers change.

Input:
customer_id,orders_count,status
1001,412,active
1002,1,active
1003,0,churned

Rules: customer_id → hash (others: no rule)

Output:
customer_id,orders_count,status
9f3a1c70be442d18,412,active
2b6e...,1,active
4c11...,0,churned

→ the heavy-tail order distribution is preserved for your tests.

Sequential ids for a clean demo dataset

Example

For a demo where ids should look tidy rather than realistic, sequential renumbers the key as id-1, id-2 by row. Remember it doesn't dedupe and isn't cross-file stable — only use it when joins don't matter.

Input:
account_ref,plan
ACME-7781,Pro
GLBX-2210,Free

Rule: account_ref → sequential

Output:
account_ref,plan
id-1,Pro
id-2,Free

→ tidy demo ids; do NOT use this for keys that must join.

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Different salts across files break your joins

Integrity break

Determinism only holds for the same value AND the same salt. If you hash customer_id in customers.csv with salt A and in orders.csv with salt B, the tokens differ and the foreign key no longer joins. Use one salt for a related set of tables, and record it so future fixtures from the same data line up too.

Sequential breaks referential integrity

Integrity break

sequential numbers rows by position, so the same id in two files becomes unrelated tokens, and even within one file two rows with the same value get different ids. Never use sequential for a key that must join — use hash. Sequential is only for tidy, join-free demo labels.

Hash changes the value, so type/format-sensitive tests may need updating

Behaviour to know

A hashed id is a 16-char hex string, not the original integer/format. If your test DB column is an integer type or your code validates an id format, a hashed token won't fit. Either change the test schema to accept the token, or use sequential / mask to keep a more compatible shape — at the cost of join stability.

Adding a rule disables auto-detect

Behaviour to know

Auto-detect runs only with zero explicit rules. Once you add your first hash rule for a key, auto-detect stops covering the other PII columns — so emails/names you assumed were auto-hashed leak into the fixture. After adding any rule, add explicit rules for every PII column and verify via the applied-rules chips.

Mask of short values yields all stars

Behaviour to know

If keepStart + keepEnd is at least the value length, mask returns all stars. Short ids or initials masked with generous keep counts become uniform ***, which can collapse distinct values in a way that surprises tests checking for variety. Set keep counts below the shortest value, or use hash for guaranteed-distinct tokens.

Low-cardinality columns leak their distribution when hashed

Behaviour to know

Hashing status or region produces identical tokens for identical values, so the value distribution is still visible (and that's usually fine for test data). But if a low-cardinality column is itself sensitive, redact or drop it instead — hashing won't hide how many rows share each value.

Free-tier caps make full-table fixtures impossible

Blocked

This is a Pro tool; free CSV limits are 2 MB / 500 rows. A production table usually exceeds that. On Pro you get 100 MB / 100,000 rows. For very large tables, take a representative sample with csv-row-limiter (sampling a parent table) — but be careful that sampled child rows still reference present parents, or joins will dangle.

Sampling parent/child tables independently dangles foreign keys

Integrity break

If you row-limit customers and orders separately, some orders will reference customers that didn't make the sample. The anonymizer preserves whatever keys are present but can't fix missing parents. Sample the parent first, then filter children to the sampled keys (e.g. with csv-column-filter) before anonymizing.

Empty production export

Handled

An empty CSV returns zero rows in and out without error. If your fixture comes out empty, the source dump probably failed or contained only a header — re-export before troubleshooting the anonymizer.

Frequently asked questions

Will my foreign keys still join after anonymizing?

Yes, if you hash the key column with the same salt in every related file. Hashing is deterministic, so customer_id 1001 becomes the identical token in customers.csv and orders.csv — the join works in your test database. The catch is consistency: a different salt (or using sequential instead of hash) breaks the relationship.

Why not just write a faker script?

You can, but synthetic data rarely reproduces production's edge cases — the empty cells, extreme value lengths, and lopsided counts that actually break code. Anonymizing real data keeps those distributions while removing PII. The anonymizer is faster to set up than a faker that has to mimic real shapes, and it preserves cross-table joins via deterministic hashing.

Can I redact free-text columns that might contain stray PII?

Yes — use the redact strategy on a notes or comments column and every value becomes [REDACTED], so unpredictable PII buried in free text never reaches the fixture while the column itself stays in the schema. If your test database doesn't have that column at all, use drop instead to remove it entirely in the same pass.

How do I keep email/phone fields looking realistic in the UI?

Use mask rather than hash for those columns. Mask keeps a few real characters and stars the rest (j****@****.com, ****1234), so a profile screen or email-template test renders something plausible. Hash would produce a hex token that looks wrong in a UI. Just remember masked values are partial PII, so still treat the fixture as sensitive-ish.

Does it change the analytic columns I want to keep realistic?

Only columns with a rule are touched; everything else passes through verbatim. So leave amount, status, plan, and timestamps ruleless and their real distributions survive into your fixture — which is exactly what makes the test data valuable.

Can I make the same token every time I regenerate the fixture?

Yes — use the same salt on every run. Deterministic hashing means the token for a given value is reproducible as long as the salt is identical. That's useful when you want a stable fixture across CI runs. Change the salt to deliberately produce a fresh, non-correlating dataset.

What if the hashed id doesn't fit my test DB column type?

A hash is a 16-char hex string, not an integer. If your test schema types the id column as INT, either widen it to a text/char column for the fixture, or use sequential/mask to keep a more compatible shape — accepting that sequential won't preserve joins. Many teams just make the test DB's id columns text for fixtures.

How do I handle parent/child tables so joins don't dangle?

Sample and filter before anonymizing. Take your sample of the parent table first, then filter the child rows to only those referencing sampled parents (use csv-column-filter), then hash the shared key with one salt across both. If you sample the tables independently, child rows will reference parents that aren't in the fixture.

Is the production data ever uploaded?

No. PapaParse parses and transforms the file in your browser; only the anonymized .anon.csv is produced. The production export never reaches a server, so it never lands in a CI artifact or shared bucket. A single no-content usage counter is stored server-side for signed-in stats and can be opted out of.

What's the row limit for a fixture?

Free tier caps at 2 MB / 500 rows; this tool is Pro, which raises it to 100 MB / 100,000 rows. For larger tables, sample with csv-row-limiter or split with csv-row-splitter, keeping referential integrity in mind, then anonymize and load.

Can I drop columns staging doesn't have in the same step?

Yes — add a drop rule per unwanted column and it's removed from the header and every row in the same pass as your hashing/masking. That saves you from running a separate column remover. The result panel reports how many columns were dropped.

Can I script this into my seed pipeline?

Yes — GET /api/v1/tools/csv-anonymizer returns the schema; pair the @jadapps/runner and POST to 127.0.0.1:9789/v1/tools/csv-anonymizer/run. It runs locally, so production data never reaches JAD's servers. A typical pipeline: nightly prod dump → runner anonymizes each table with one shared salt → load into the staging database before the test suite runs.

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

How to make anonymized test data from a production csv

Step 1
Export the production tables you need as fixtures — Dump the tables your tests touch (e.g. customers, orders) to CSV. If you'll need joins to work in the test DB, export the related tables together so you can hash their shared keys with one consistent salt.
Step 2
Drop the first CSV onto the anonymizer above — It parses in your browser — the production file isn't uploaded. Auto-detect pre-fills hash rules for recognised PII headers. The first row must be the header row, since rules target columns by name.
Step 3
Hash the keys that must still join — For every column that's a foreign key across tables (customer_id, account_ref), add a hash rule and set a salt. Deterministic hashing guarantees the same value maps to the same token, so the relationship survives. Note the salt — you'll reuse it on the related tables.
Step 4
Mask, redact, or drop the rest of the PII — Mask emails/phones to keep a realistic shape for UI tests. Redact free-text that might leak PII but whose presence you still want. Drop columns your test schema doesn't have. Leave purely analytic columns (plan, status, amounts) without a rule so they pass through verbatim.
Step 5
Run it on every related table with the SAME salt — Repeat the hash rule (and salt) on orders.csv, invoices.csv, etc. for the shared key. Because the hash is deterministic, customer_id 1001 becomes the identical token in every file — so loading them into the test DB preserves the joins.
Step 6
Download the fixtures and load them into staging — Each download saves as <name>.anon.csv. Load them into your test database or commit them as fixtures. The row counts and distributions match production; the PII is gone. Re-run any time you need a fresh fixture from current production shapes.

Strategy choice for test fixtures

Which strategy to use per column when the goal is realistic, safe test data.

Column type	Best strategy	Why
Foreign key (customer_id, account_ref)	hash (shared salt)	Deterministic — same value → same token, so joins survive across files
Email / phone for UI tests	mask	Keeps a plausible shape (`****1234`) so templates and inputs render
Name	hash or mask	Hash if you only need uniqueness; mask if a test reads a partial name
Free-text notes	redact	Blanks unpredictable PII while keeping the column your schema expects
Columns not in the test schema	drop	Removes them so the fixture matches staging without a second tool
Analytic columns (amount, plan, status)	no rule	Pass through verbatim so distributions stay realistic

Why deterministic hashing preserves referential integrity

The same input plus the same salt always yields the same token. That property is what keeps cross-file relationships intact.

File	Column	Rule	Token for customer 1001
customers.csv	customer_id	hash (salt: seed-2026)	9f3a1c70be442d18
orders.csv	customer_id	hash (salt: seed-2026)	9f3a1c70be442d18
invoices.csv	customer_id	hash (salt: seed-2026)	9f3a1c70be442d18
customers.csv	customer_id	hash (salt: DIFFERENT)	c1d8...mismatch — joins break

Tier limits

Browser-side CSV limits. The CSV Anonymizer is Pro.

Limit	Free	Pro
Max file size	2 MB	100 MB
Max rows	500	100,000
Batch files	2	10

Cookbook

Fixture recipes from typical production table shapes. Tokens illustrate the deterministic 16-char hex format.

Hash the shared key in two tables so joins survive

Example

customers and orders both reference customer_id. Hash it with the same salt in both files and the foreign key still joins in your test database.

customers.csv  rule: customer_id → hash (salt: seed-2026)
  1001,jane@acme.com  →  9f3a1c70be442d18, <hashed email>

orders.csv     rule: customer_id → hash (salt: seed-2026)
  ORD-5,1001,42.00   →  ORD-5,9f3a1c70be442d18,42.00

→ orders.customer_id joins customers.customer_id on the token.

Mask emails so UI tests render plausibly

Example

A profile screen test reads the email field. A hashed token looks wrong in the UI; a masked value keeps a realistic shape. Mask keeping first 1 and last 4.

Input:
name,email
Jane Doe,jane@acme.com

Rule: email → mask (keepStart 1, keepEnd 4)
       name → hash

Output:
name,email
5f9c...d2,j*******.com

→ renders as a plausible (fake) email in the UI test.

Drop columns your test schema doesn't have

Example

Production has audit columns your staging schema lacks. Drop them so the fixture loads cleanly.

Input:
customer_id,plan,_audit_blob,_internal_flag
1001,Pro,{...},true

Rules: customer_id → hash; _audit_blob → drop; _internal_flag → drop

Output:
customer_id,plan
9f3a1c70be442d18,Pro

→ fixture matches the staging table's columns.

Keep realistic distributions, change only identifiers

Example

The point of using production data is its lopsided realism. Leave amount/status columns ruleless so the distribution survives; only identifiers change.

Input:
customer_id,orders_count,status
1001,412,active
1002,1,active
1003,0,churned

Rules: customer_id → hash (others: no rule)

Output:
customer_id,orders_count,status
9f3a1c70be442d18,412,active
2b6e...,1,active
4c11...,0,churned

→ the heavy-tail order distribution is preserved for your tests.

Sequential ids for a clean demo dataset

Example

Input:
account_ref,plan
ACME-7781,Pro
GLBX-2210,Free

Rule: account_ref → sequential

Output:
account_ref,plan
id-1,Pro
id-2,Free

→ tidy demo ids; do NOT use this for keys that must join.

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Different salts across files break your joins

Integrity break

Sequential breaks referential integrity

Integrity break

Hash changes the value, so type/format-sensitive tests may need updating

Behaviour to know

Adding a rule disables auto-detect

Behaviour to know

Mask of short values yields all stars

Behaviour to know

Low-cardinality columns leak their distribution when hashed

Behaviour to know

Free-tier caps make full-table fixtures impossible

Blocked

Sampling parent/child tables independently dangles foreign keys

Integrity break

Empty production export

Handled

An empty CSV returns zero rows in and out without error. If your fixture comes out empty, the source dump probably failed or contained only a header — re-export before troubleshooting the anonymizer.

Frequently asked questions

Will my foreign keys still join after anonymizing?

Why not just write a faker script?

Can I redact free-text columns that might contain stray PII?

How do I keep email/phone fields looking realistic in the UI?

Does it change the analytic columns I want to keep realistic?

Can I make the same token every time I regenerate the fixture?

What if the hashed id doesn't fit my test DB column type?

How do I handle parent/child tables so joins don't dangle?

Is the production data ever uploaded?

What's the row limit for a fixture?

Can I drop columns staging doesn't have in the same step?

Can I script this into my seed pipeline?

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

Make Anonymized Test Data From a Production CSV

How to make anonymized test data from a production csv

Strategy choice for test fixtures

Why deterministic hashing preserves referential integrity

Tier limits

Cookbook

Hash the shared key in two tables so joins survive

Mask emails so UI tests render plausibly

Drop columns your test schema doesn't have

Keep realistic distributions, change only identifiers

Sequential ids for a clean demo dataset

Errors and edge cases

Different salts across files break your joins

Sequential breaks referential integrity

Hash changes the value, so type/format-sensitive tests may need updating

Adding a rule disables auto-detect

Mask of short values yields all stars

Low-cardinality columns leak their distribution when hashed

Free-tier caps make full-table fixtures impossible

Sampling parent/child tables independently dangles foreign keys

Empty production export

Frequently asked questions

Will my foreign keys still join after anonymizing?

Why not just write a faker script?

Can I redact free-text columns that might contain stray PII?

How do I keep email/phone fields looking realistic in the UI?

Does it change the analytic columns I want to keep realistic?

Can I make the same token every time I regenerate the fixture?

What if the hashed id doesn't fit my test DB column type?

How do I handle parent/child tables so joins don't dangle?

Is the production data ever uploaded?

What's the row limit for a fixture?

Can I drop columns staging doesn't have in the same step?

Can I script this into my seed pipeline?

Privacy first

Related guides

Make Anonymized Test Data From a Production CSV

How to make anonymized test data from a production csv

Strategy choice for test fixtures

Why deterministic hashing preserves referential integrity

Tier limits

Cookbook

Hash the shared key in two tables so joins survive

Mask emails so UI tests render plausibly

Drop columns your test schema doesn't have

Keep realistic distributions, change only identifiers

Sequential ids for a clean demo dataset

Errors and edge cases

Different salts across files break your joins

Sequential breaks referential integrity

Hash changes the value, so type/format-sensitive tests may need updating

Adding a rule disables auto-detect

Mask of short values yields all stars

Low-cardinality columns leak their distribution when hashed

Free-tier caps make full-table fixtures impossible

Sampling parent/child tables independently dangles foreign keys

Empty production export

Frequently asked questions

Will my foreign keys still join after anonymizing?

Why not just write a faker script?

Can I redact free-text columns that might contain stray PII?

How do I keep email/phone fields looking realistic in the UI?

Does it change the analytic columns I want to keep realistic?

Can I make the same token every time I regenerate the fixture?

What if the hashed id doesn't fit my test DB column type?

How do I handle parent/child tables so joins don't dangle?

Is the production data ever uploaded?

What's the row limit for a fixture?

Can I drop columns staging doesn't have in the same step?

Can I script this into my seed pipeline?

Privacy first

Related guides