How to make anonymized test data from a production csv
- Step 1Export the production tables you need as fixtures — Dump the tables your tests touch (e.g.
customers,orders) to CSV. If you'll need joins to work in the test DB, export the related tables together so you can hash their shared keys with one consistent salt. - Step 2Drop the first CSV onto the anonymizer above — It parses in your browser — the production file isn't uploaded. Auto-detect pre-fills hash rules for recognised PII headers. The first row must be the header row, since rules target columns by name.
- Step 3Hash the keys that must still join — For every column that's a foreign key across tables (
customer_id,account_ref), add a hash rule and set a salt. Deterministic hashing guarantees the same value maps to the same token, so the relationship survives. Note the salt — you'll reuse it on the related tables. - Step 4Mask, redact, or drop the rest of the PII — Mask emails/phones to keep a realistic shape for UI tests. Redact free-text that might leak PII but whose presence you still want. Drop columns your test schema doesn't have. Leave purely analytic columns (plan, status, amounts) without a rule so they pass through verbatim.
- Step 5Run it on every related table with the SAME salt — Repeat the hash rule (and salt) on
orders.csv,invoices.csv, etc. for the shared key. Because the hash is deterministic,customer_id1001 becomes the identical token in every file — so loading them into the test DB preserves the joins. - Step 6Download the fixtures and load them into staging — Each download saves as
<name>.anon.csv. Load them into your test database or commit them as fixtures. The row counts and distributions match production; the PII is gone. Re-run any time you need a fresh fixture from current production shapes.
Strategy choice for test fixtures
Which strategy to use per column when the goal is realistic, safe test data.
| Column type | Best strategy | Why |
|---|---|---|
| Foreign key (customer_id, account_ref) | hash (shared salt) | Deterministic — same value → same token, so joins survive across files |
| Email / phone for UI tests | mask | Keeps a plausible shape (****1234) so templates and inputs render |
| Name | hash or mask | Hash if you only need uniqueness; mask if a test reads a partial name |
| Free-text notes | redact | Blanks unpredictable PII while keeping the column your schema expects |
| Columns not in the test schema | drop | Removes them so the fixture matches staging without a second tool |
| Analytic columns (amount, plan, status) | no rule | Pass through verbatim so distributions stay realistic |
Why deterministic hashing preserves referential integrity
The same input plus the same salt always yields the same token. That property is what keeps cross-file relationships intact.
| File | Column | Rule | Token for customer 1001 |
|---|---|---|---|
| customers.csv | customer_id | hash (salt: seed-2026) | 9f3a1c70be442d18 |
| orders.csv | customer_id | hash (salt: seed-2026) | 9f3a1c70be442d18 |
| invoices.csv | customer_id | hash (salt: seed-2026) | 9f3a1c70be442d18 |
| customers.csv | customer_id | hash (salt: DIFFERENT) | c1d8...mismatch — joins break |
Tier limits
Browser-side CSV limits. The CSV Anonymizer is Pro.
| Limit | Free | Pro |
|---|---|---|
| Max file size | 2 MB | 100 MB |
| Max rows | 500 | 100,000 |
| Batch files | 2 | 10 |
Cookbook
Fixture recipes from typical production table shapes. Tokens illustrate the deterministic 16-char hex format.
Hash the shared key in two tables so joins survive
Examplecustomers and orders both reference customer_id. Hash it with the same salt in both files and the foreign key still joins in your test database.
customers.csv rule: customer_id → hash (salt: seed-2026) 1001,jane@acme.com → 9f3a1c70be442d18, <hashed email> orders.csv rule: customer_id → hash (salt: seed-2026) ORD-5,1001,42.00 → ORD-5,9f3a1c70be442d18,42.00 → orders.customer_id joins customers.customer_id on the token.
Mask emails so UI tests render plausibly
ExampleA profile screen test reads the email field. A hashed token looks wrong in the UI; a masked value keeps a realistic shape. Mask keeping first 1 and last 4.
Input:
name,email
Jane Doe,jane@acme.com
Rule: email → mask (keepStart 1, keepEnd 4)
name → hash
Output:
name,email
5f9c...d2,j*******.com
→ renders as a plausible (fake) email in the UI test.Drop columns your test schema doesn't have
ExampleProduction has audit columns your staging schema lacks. Drop them so the fixture loads cleanly.
Input:
customer_id,plan,_audit_blob,_internal_flag
1001,Pro,{...},true
Rules: customer_id → hash; _audit_blob → drop; _internal_flag → drop
Output:
customer_id,plan
9f3a1c70be442d18,Pro
→ fixture matches the staging table's columns.Keep realistic distributions, change only identifiers
ExampleThe point of using production data is its lopsided realism. Leave amount/status columns ruleless so the distribution survives; only identifiers change.
Input: customer_id,orders_count,status 1001,412,active 1002,1,active 1003,0,churned Rules: customer_id → hash (others: no rule) Output: customer_id,orders_count,status 9f3a1c70be442d18,412,active 2b6e...,1,active 4c11...,0,churned → the heavy-tail order distribution is preserved for your tests.
Sequential ids for a clean demo dataset
ExampleFor a demo where ids should look tidy rather than realistic, sequential renumbers the key as id-1, id-2 by row. Remember it doesn't dedupe and isn't cross-file stable — only use it when joins don't matter.
Input: account_ref,plan ACME-7781,Pro GLBX-2210,Free Rule: account_ref → sequential Output: account_ref,plan id-1,Pro id-2,Free → tidy demo ids; do NOT use this for keys that must join.
Errors and edge cases
Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.
Different salts across files break your joins
Integrity breakDeterminism only holds for the same value AND the same salt. If you hash customer_id in customers.csv with salt A and in orders.csv with salt B, the tokens differ and the foreign key no longer joins. Use one salt for a related set of tables, and record it so future fixtures from the same data line up too.
Sequential breaks referential integrity
Integrity breaksequential numbers rows by position, so the same id in two files becomes unrelated tokens, and even within one file two rows with the same value get different ids. Never use sequential for a key that must join — use hash. Sequential is only for tidy, join-free demo labels.
Hash changes the value, so type/format-sensitive tests may need updating
Behaviour to knowA hashed id is a 16-char hex string, not the original integer/format. If your test DB column is an integer type or your code validates an id format, a hashed token won't fit. Either change the test schema to accept the token, or use sequential / mask to keep a more compatible shape — at the cost of join stability.
Adding a rule disables auto-detect
Behaviour to knowAuto-detect runs only with zero explicit rules. Once you add your first hash rule for a key, auto-detect stops covering the other PII columns — so emails/names you assumed were auto-hashed leak into the fixture. After adding any rule, add explicit rules for every PII column and verify via the applied-rules chips.
Mask of short values yields all stars
Behaviour to knowIf keepStart + keepEnd is at least the value length, mask returns all stars. Short ids or initials masked with generous keep counts become uniform ***, which can collapse distinct values in a way that surprises tests checking for variety. Set keep counts below the shortest value, or use hash for guaranteed-distinct tokens.
Low-cardinality columns leak their distribution when hashed
Behaviour to knowHashing status or region produces identical tokens for identical values, so the value distribution is still visible (and that's usually fine for test data). But if a low-cardinality column is itself sensitive, redact or drop it instead — hashing won't hide how many rows share each value.
Free-tier caps make full-table fixtures impossible
BlockedThis is a Pro tool; free CSV limits are 2 MB / 500 rows. A production table usually exceeds that. On Pro you get 100 MB / 100,000 rows. For very large tables, take a representative sample with csv-row-limiter (sampling a parent table) — but be careful that sampled child rows still reference present parents, or joins will dangle.
Sampling parent/child tables independently dangles foreign keys
Integrity breakIf you row-limit customers and orders separately, some orders will reference customers that didn't make the sample. The anonymizer preserves whatever keys are present but can't fix missing parents. Sample the parent first, then filter children to the sampled keys (e.g. with csv-column-filter) before anonymizing.
Empty production export
HandledAn empty CSV returns zero rows in and out without error. If your fixture comes out empty, the source dump probably failed or contained only a header — re-export before troubleshooting the anonymizer.
Frequently asked questions
Will my foreign keys still join after anonymizing?
Yes, if you hash the key column with the same salt in every related file. Hashing is deterministic, so customer_id 1001 becomes the identical token in customers.csv and orders.csv — the join works in your test database. The catch is consistency: a different salt (or using sequential instead of hash) breaks the relationship.
Why not just write a faker script?
You can, but synthetic data rarely reproduces production's edge cases — the empty cells, extreme value lengths, and lopsided counts that actually break code. Anonymizing real data keeps those distributions while removing PII. The anonymizer is faster to set up than a faker that has to mimic real shapes, and it preserves cross-table joins via deterministic hashing.
Can I redact free-text columns that might contain stray PII?
Yes — use the redact strategy on a notes or comments column and every value becomes [REDACTED], so unpredictable PII buried in free text never reaches the fixture while the column itself stays in the schema. If your test database doesn't have that column at all, use drop instead to remove it entirely in the same pass.
How do I keep email/phone fields looking realistic in the UI?
Use mask rather than hash for those columns. Mask keeps a few real characters and stars the rest (j****@****.com, ****1234), so a profile screen or email-template test renders something plausible. Hash would produce a hex token that looks wrong in a UI. Just remember masked values are partial PII, so still treat the fixture as sensitive-ish.
Does it change the analytic columns I want to keep realistic?
Only columns with a rule are touched; everything else passes through verbatim. So leave amount, status, plan, and timestamps ruleless and their real distributions survive into your fixture — which is exactly what makes the test data valuable.
Can I make the same token every time I regenerate the fixture?
Yes — use the same salt on every run. Deterministic hashing means the token for a given value is reproducible as long as the salt is identical. That's useful when you want a stable fixture across CI runs. Change the salt to deliberately produce a fresh, non-correlating dataset.
What if the hashed id doesn't fit my test DB column type?
A hash is a 16-char hex string, not an integer. If your test schema types the id column as INT, either widen it to a text/char column for the fixture, or use sequential/mask to keep a more compatible shape — accepting that sequential won't preserve joins. Many teams just make the test DB's id columns text for fixtures.
How do I handle parent/child tables so joins don't dangle?
Sample and filter before anonymizing. Take your sample of the parent table first, then filter the child rows to only those referencing sampled parents (use csv-column-filter), then hash the shared key with one salt across both. If you sample the tables independently, child rows will reference parents that aren't in the fixture.
Is the production data ever uploaded?
No. PapaParse parses and transforms the file in your browser; only the anonymized .anon.csv is produced. The production export never reaches a server, so it never lands in a CI artifact or shared bucket. A single no-content usage counter is stored server-side for signed-in stats and can be opted out of.
What's the row limit for a fixture?
Free tier caps at 2 MB / 500 rows; this tool is Pro, which raises it to 100 MB / 100,000 rows. For larger tables, sample with csv-row-limiter or split with csv-row-splitter, keeping referential integrity in mind, then anonymize and load.
Can I drop columns staging doesn't have in the same step?
Yes — add a drop rule per unwanted column and it's removed from the header and every row in the same pass as your hashing/masking. That saves you from running a separate column remover. The result panel reports how many columns were dropped.
Can I script this into my seed pipeline?
Yes — GET /api/v1/tools/csv-anonymizer returns the schema; pair the @jadapps/runner and POST to 127.0.0.1:9789/v1/tools/csv-anonymizer/run. It runs locally, so production data never reaches JAD's servers. A typical pipeline: nightly prod dump → runner anonymizes each table with one shared salt → load into the staging database before the test suite runs.
Privacy first
Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.