How to de-identify datasets before feeding them to a model
- Step 1Export your raw dataset as CSV or JSON — Pull the labelled dataset you'd otherwise load straight into training — a tabular CSV of features + a target column, or a JSON/array of records. The tool reads CSV (comma-delimited, first row = header) and JSON (object, array, nested). It does not read
.xlsx/.ods; convert those to CSV first. JSONL is not a JSON document — convert it to a JSON array first or scramble before you serialise to JSONL. - Step 2Drop the file onto the tool — PapaParse (CSV) or
JSON.parse(JSON) runs in your browser tab; nothing is sent to a server. The JSON path is taken when the filename ends in `.json`, otherwise the file is parsed as CSV. Rename a JSON file that lost its extension, or setformat: json, so it isn't mis-read as a single-column CSV. - Step 3Confirm identifier columns use recognised header names — Replacement is keyed on the whole column name. Headers literally named
email,name,phone,address,city,zip,ssn, etc. are de-identified. Compound names likecustomer_email,user_name, orcontact_phoneare NOT matched — rename them to the bare token first, or those identifiers stay in the training set. - Step 4Set a seed for reproducible datasets — Leave
seedblank for fresh randomness. Enter a number and the tool callsfaker.seed(n)first, so the same raw file + same seed regenerates a byte-identical de-identified dataset — which is what reproducible experiments, fixed splits, and dataset versioning need. - Step 5Scramble and verify the de-identified count — Every column / key matching the PII regex is overwritten with a faker value; the tool reports the number of replaced fields (
itemsRedacted). Compare it to the number of identifier columns you expected — a low count usually means a header didn't match and an identifier is still in the data. - Step 6Feed the de-identified file into your pipeline — The result downloads as
<name>-scrambled.<ext>. Use it as the input to your training, fine-tune, or RAG-indexing step. Keep the raw file out of the pipeline and the repo — the scramble is one-way with no reverse mapping, so the de-identified artefact is the one that should travel.
What gets de-identified vs. what the model still learns from
How identifier columns and feature/label columns are treated. Detection is name-based against a fixed regex (lib/security/security-processor.ts).
| Column role | Example headers | After scrambling | Effect on training |
|---|---|---|---|
| Direct identifier | name, email, phone, address | Faked | Removed as memorisation/leakage risk; not used as a feature anyway |
| Location identifier | city, zip, postal, street | Faked | Coarse geo becomes fake; if you needed region as a feature, derive it before scrambling |
| Government ID | ssn, tax_id | 9 random digits | faker.string.numeric(9) — high-cardinality noise, safe to keep or drop |
| Numeric feature | amount, score, tenure_days | Preserved | Distributions and correlations intact -> model signal preserved |
| Label / target | churned, class, label | Preserved | Class balance unchanged -> training objective unaffected |
| Category / timestamp | plan, created_at, channel | Preserved | Categorical and temporal features survive exactly |
The complete control surface
Every option this tool exposes, from lib/security/security-tool-schemas.ts. There is no field-list, masking-style, or format-of-fake control.
| Control | Type / values | Default | What it actually does |
|---|---|---|---|
seed | number (optional) | (blank) | Blank = fresh randomness. A number calls faker.seed(n) so the same dataset + seed gives identical fakes — for reproducible experiments. Not encryption, not reversible |
format | enum: auto / csv / json | auto | Server-safe auto treats a leading [/{ as JSON, else CSV. In-browser the JSON path is the .json extension. Force with csv / json |
| Field / column list | (not a control) | — | Fixed in code (PII_FIELDS_REGEX). Cannot be edited in the UI |
Tier, formats, and size limits
Metadata from lib/security/security-tools-registry.ts and limits from lib/tier-limits.ts. Server-safe tool; the live version runs in the browser so raw PII is never transmitted.
| Property | Value | Note |
|---|---|---|
| Minimum tier | Pro | minTier: "pro" |
| Input formats | CSV, JSON | JSON via JSON.parse; not JSONL, not .xlsx/.ods |
| Output | Text (CSV or pretty-printed JSON) | <name>-scrambled.<ext> |
| Pro limits | 100 MB / 5 files | Security family |
| Pro-media limits | 500 MB / 50 files | Security family |
| Developer limits | 2 GB / unlimited | Security family — useful for large training sets |
Cookbook
Before/after files for ML / LLM data prep. Notice identifiers get faked while features and labels survive exactly, so the dataset still trains. Faker values are illustrative — set a seed for reproducible output.
Tabular churn dataset for a classifier
A supervised dataset with a churned label. name, email and city are de-identified; tenure_days, mrr, plan and the churned label are preserved, so the feature distributions and class balance the model learns from are untouched.
Input (churn.csv): name,email,city,tenure_days,mrr,plan,churned Sarah Chen,sarah.chen@acme.io,Oakland,412,49,pro,0 Tomás Reyes,treyes@globex.com,Chicago,71,149,team,1 Output (churn-scrambled.csv): name,email,city,tenure_days,mrr,plan,churned Dr. Elena Rosales,Reanna.Lockman@yahoo.com,East Garfield,412,49,pro,0 Marcus Hettinger,Jaylin.Bode@gmail.com,Lake Verda,71,149,team,1 Identifiers faked; tenure_days / mrr / plan / churned preserved.
JSON records for a RAG index
Records destined for an embedding index. name, email, phone keys are faked; the id, topic, and numeric score survive. Note body is free text — see the edge case about PII inside text.
Input (records.json):
[
{ "id": "r1", "name": "Priya Nair", "email": "priya@shop.co", "topic": "billing", "score": 0.82 },
{ "id": "r2", "name": "Leo Park", "email": "leo@shop.co", "topic": "refund", "score": 0.40 }
]
Output (records-scrambled.json):
[
{ "id": "r1", "name": "Dr. Elena Rosales", "email": "Reanna.Lockman@yahoo.com", "topic": "billing", "score": 0.82 },
{ "id": "r2", "name": "Marcus Hettinger", "email": "Jaylin.Bode@gmail.com", "topic": "refund", "score": 0.40 }
]Reproducible split with a seed
For versioned experiments you want the same de-identified dataset every run so your train/test split is stable. Set a seed; identical raw file + identical seed = identical output.
Input (users.csv): user_id,first_name,last_name,email,label 7,Aisha,Khan,aisha.khan@corp.net,A 8,Leo,Park,leo.park@corp.net,B seed = 7 Output (every run with seed 7, byte-identical): user_id,first_name,last_name,email,label 7,<fake>,<fake>,<fake>,A 8,<fake>,<fake>,<fake>,B label column (A/B) preserved -> class balance unchanged.
PII embedded in a free-text training column survives
The tool de-identifies whole cells in matched columns; it does NOT scan inside free text. An email or phone in a body / transcript column — exactly the text an LLM would memorise — is NOT removed because the column name isn't a PII token. Scrub those columns first.
Input (chats.json):
[{ "email": "dana@x.com", "body": "Reach me at 415-555-0199, my SSN is 123-45-6789" }]
Output (chats-scrambled.json):
[{ "email": "Hilbert.Klein@gmail.com", "body": "Reach me at 415-555-0199, my SSN is 123-45-6789" }]
The email key was faked; the phone + SSN inside body were NOT.
Run the body text through email-phone-scrubber first -> it emits
[REDACTED_PHONE] / [REDACTED_SSN] so the model never sees them.Header doesn't match -> identifier stays in the training set
Common ML export headers like customer_email or user_name are not exact PII tokens, so they pass through unchanged and a real identifier remains in your dataset. Rename to the bare token before scrambling.
Input (data.csv): user_name,customer_email,feature_1,label Li Wei,li.wei@x.com,0.4,1 Output (data-scrambled.csv): <-- identifier NOT removed user_name,customer_email,feature_1,label Li Wei,li.wei@x.com,0.4,1 Nothing matched (itemsRedacted = 0). Rename to name / email: name,email,feature_1,label Mavis Goldner,Lonnie_Cremin@hotmail.com,0.4,1
Edge cases and what actually happens
PII inside a free-text column (body, transcript, notes)
By designThis is the most important caveat for ML data: the tool replaces whole cells in matched columns and never scans cell contents. A name, email, phone, or SSN written inside a body, transcript, comments, or notes column — precisely the free text an LLM memorises — is left intact because the column name isn't a PII token. Scrub those columns first with email-phone-scrubber, which emits fixed [REDACTED_*] tags.
JSONL or NDJSON fed in directly
ErrorTraining pipelines love JSONL, but a .json file containing one object per line is not a single JSON document — JSON.parse throws on the second line. Convert JSONL to a JSON array first (or scramble the source records before serialising to JSONL). CSV remains the simplest tabular path.
Compound identifier header not matched
Not matchedHeaders like customer_email, user_name, or contact_phone are not exact PII tokens, so the regex (anchored to the whole name) leaves them — and their real identifiers — in the dataset. Rename to email, name, phone before scrambling, and confirm via the replaced-field count.
City / zip faked but you needed geo as a feature
By designcity, zip, postal and street are treated as PII and replaced with fakes, which destroys them as features. If your model genuinely needs coarse geography, derive a non-PII feature (e.g. a region bucket or a one-hot column) BEFORE scrambling, since the fake city/zip afterwards carries no real signal.
SSN feature becomes 9 random digits
Expectedssn / tax_id columns are filled with faker.string.numeric(9) — random nine-digit noise with no checksum. As a high-cardinality identifier it was useless as a feature anyway; just be aware the values change every run unless you set a seed.
Same person recurs across rows / sessions
Not preservedReplacement is per-cell, so one real user appearing in many rows (e.g. multiple events per customer) gets a different fake each row. If your modelling relies on a consistent per-user identity, scramble will break that linkage — derive a stable surrogate key from the real data before scrambling, then scramble only the human-readable identifiers.
Malformed JSON dataset
ErrorJSON input goes through JSON.parse; a trailing comma, single quotes, unquoted keys, or a truncated dump throws and nothing is produced. Validate the dataset first. CSV is more forgiving — PapaParse parses ragged rows rather than throwing.
Large training set over the tier cap
RejectedDatasets get big. Security-family limits are Pro 100 MB / 5 files, Pro-media 500 MB / 50, Developer 2 GB / unlimited; this tool needs at least Pro. A file over your cap is rejected before processing — shard the dataset, scramble per shard, or use the Developer tier for very large corpora.
Header-only or PII-free dataset
SupportedA dataset whose headers contain no PII tokens parses fine and returns with itemsRedacted = 0 — every cell preserved. That's the correct result for an already-anonymous feature matrix; the zero count confirms nothing identifiable was present to remove.
Determinism vs. reversibility
By designA seed makes the de-identified dataset reproducible across experiment runs, but it is not a key and builds no lookup back to the originals. The operation is one-way — keep the raw dataset secured separately if you ever need to re-derive features from real values.
Frequently asked questions
Does this stop the model from regurgitating real people?
For identifiers in matched columns, yes — the real names, emails, phones and addresses are replaced with fakes before the data reaches the model, so there's nothing real left to memorise in those fields. But it does NOT scrub PII embedded inside free-text columns (a name in a body field), so de-identify those separately before training.
Will scrambling hurt my model's accuracy?
It shouldn't, because the columns it changes — names, emails, phones, addresses — are direct identifiers you wouldn't use as features anyway. Numeric features, categories, timestamps and labels are preserved byte-for-byte, so feature distributions and class balance are unchanged and the training signal survives.
Can I run JSONL straight through it?
No. JSONL / NDJSON is one object per line, not a single JSON document, so JSON.parse throws. Convert it to a JSON array first, or scramble your records before serialising to JSONL. A flat CSV is the simplest tabular path.
What about PII inside a transcript or body column?
Not handled here — the tool replaces whole cells in name-matched columns and never scans cell contents. Run free-text columns through email-phone-scrubber first; it matches email, phone, SSN, credit-card (Luhn), IBAN (mod-97) and UK-NI patterns inside text and emits fixed [REDACTED_*] tags, which is exactly what you want before a fine-tune.
My header is `customer_email` — is it de-identified?
No. The regex expects bare tokens, so customer_email, user_name, contact_phone are not matched and the identifier stays in the dataset. Rename to email, name, phone before scrambling and verify with the replaced-field count.
Can I keep the same fake for the same person across rows?
No. Replacement is per-cell with an independent faker value each time, so there's no consistent per-user mapping. If you need a stable per-user identity for modelling, derive a surrogate key from the real data first, then scramble only the human-readable identifiers.
Is the de-identification reproducible for versioned experiments?
Yes — set a numeric seed. The tool calls faker.seed(n) so the same raw file plus the same seed regenerates a byte-identical de-identified dataset, which is what reproducible splits and dataset versioning need. Leave it blank for fresh randomness.
Does my raw dataset get uploaded?
No. The live tool runs in your browser — PapaParse and faker are loaded client-side and the file is parsed and rewritten in the tab. The raw PII never leaves your machine; only the de-identified copy is downloaded for your pipeline.
Can I customise which columns get scrambled?
No. The PII field set is fixed in code with no UI control. To force a non-matching column to scramble, rename its header to a recognised token (email, name, phone, address, city, zip, ssn, ...) before running.
What about geography I want as a feature?
city, zip, postal and street are treated as PII and faked, so they lose their real signal. If your model needs coarse geography, derive a region bucket or other non-PII feature BEFORE scrambling — don't rely on the post-scramble fake city/zip.
What plan and sizes do I need for a big corpus?
This is a Pro-tier tool. Security-family limits are Pro 100 MB / 5 files, Pro-media 500 MB / 50, Developer 2 GB / unlimited. For large training sets, shard and scramble per shard, or use the Developer tier.
What pairs well with this for an AI data pipeline?
Scrub free-text columns first with email-phone-scrubber. If you must store the raw dataset, encrypt it at rest with aes-256-encryptor (Web Crypto AES-GCM 256, PBKDF2). Pin a version of the de-identified artefact with multi-hash-fingerprinter so every experiment references a verifiable hash.
Privacy first
Every JAD Security operation runs entirely in your browser. Files, passwords, and PGP private keys never leave your device — verified by zero outbound network requests during processing.