Scramble PII Out of CSV / JSON Before Using It in AI / LLM Training

How to de-identify datasets before feeding them to a model

Step 1
Export your raw dataset as CSV or JSON — Pull the labelled dataset you'd otherwise load straight into training — a tabular CSV of features + a target column, or a JSON/array of records. The tool reads CSV (comma-delimited, first row = header) and JSON (object, array, nested). It does not read .xlsx / .ods; convert those to CSV first. JSONL is not a JSON document — convert it to a JSON array first or scramble before you serialise to JSONL.
Step 2
Drop the file onto the tool — PapaParse (CSV) or JSON.parse (JSON) runs in your browser tab; nothing is sent to a server. The JSON path is taken when the filename ends in `.json`, otherwise the file is parsed as CSV. Rename a JSON file that lost its extension, or set format: json, so it isn't mis-read as a single-column CSV.
Step 3
Confirm identifier columns use recognised header names — Replacement is keyed on the whole column name. Headers literally named email, name, phone, address, city, zip, ssn, etc. are de-identified. Compound names like customer_email, user_name, or contact_phone are NOT matched — rename them to the bare token first, or those identifiers stay in the training set.
Step 4
Set a seed for reproducible datasets — Leave seed blank for fresh randomness. Enter a number and the tool calls faker.seed(n) first, so the same raw file + same seed regenerates a byte-identical de-identified dataset — which is what reproducible experiments, fixed splits, and dataset versioning need.
Step 5
Scramble and verify the de-identified count — Every column / key matching the PII regex is overwritten with a faker value; the tool reports the number of replaced fields (itemsRedacted). Compare it to the number of identifier columns you expected — a low count usually means a header didn't match and an identifier is still in the data.
Step 6
Feed the de-identified file into your pipeline — The result downloads as <name>-scrambled.<ext>. Use it as the input to your training, fine-tune, or RAG-indexing step. Keep the raw file out of the pipeline and the repo — the scramble is one-way with no reverse mapping, so the de-identified artefact is the one that should travel.

What gets de-identified vs. what the model still learns from

How identifier columns and feature/label columns are treated. Detection is name-based against a fixed regex (lib/security/security-processor.ts).

Column role	Example headers	After scrambling	Effect on training
Direct identifier	`name`, `email`, `phone`, `address`	Faked	Removed as memorisation/leakage risk; not used as a feature anyway
Location identifier	`city`, `zip`, `postal`, `street`	Faked	Coarse geo becomes fake; if you needed region as a feature, derive it before scrambling
Government ID	`ssn`, `tax_id`	9 random digits	`faker.string.numeric(9)` — high-cardinality noise, safe to keep or drop
Numeric feature	`amount`, `score`, `tenure_days`	Preserved	Distributions and correlations intact -> model signal preserved
Label / target	`churned`, `class`, `label`	Preserved	Class balance unchanged -> training objective unaffected
Category / timestamp	`plan`, `created_at`, `channel`	Preserved	Categorical and temporal features survive exactly

The complete control surface

Every option this tool exposes, from lib/security/security-tool-schemas.ts. There is no field-list, masking-style, or format-of-fake control.

Control	Type / values	Default	What it actually does
`seed`	number (optional)	(blank)	Blank = fresh randomness. A number calls `faker.seed(n)` so the same dataset + seed gives identical fakes — for reproducible experiments. Not encryption, not reversible
`format`	enum: `auto` / `csv` / `json`	`auto`	Server-safe `auto` treats a leading `[`/`{` as JSON, else CSV. In-browser the JSON path is the `.json` extension. Force with `csv` / `json`
Field / column list	(not a control)	—	Fixed in code (`PII_FIELDS_REGEX`). Cannot be edited in the UI

Tier, formats, and size limits

Metadata from lib/security/security-tools-registry.ts and limits from lib/tier-limits.ts. Server-safe tool; the live version runs in the browser so raw PII is never transmitted.

Property	Value	Note
Minimum tier	Pro	`minTier: "pro"`
Input formats	CSV, JSON	JSON via `JSON.parse`; not JSONL, not `.xlsx`/`.ods`
Output	Text (CSV or pretty-printed JSON)	`<name>-scrambled.<ext>`
Pro limits	100 MB / 5 files	Security family
Pro-media limits	500 MB / 50 files	Security family
Developer limits	2 GB / unlimited	Security family — useful for large training sets

Cookbook

Before/after files for ML / LLM data prep. Notice identifiers get faked while features and labels survive exactly, so the dataset still trains. Faker values are illustrative — set a seed for reproducible output.

Tabular churn dataset for a classifier

A supervised dataset with a churned label. name, email and city are de-identified; tenure_days, mrr, plan and the churned label are preserved, so the feature distributions and class balance the model learns from are untouched.

Input (churn.csv):
name,email,city,tenure_days,mrr,plan,churned
Sarah Chen,sarah.chen@acme.io,Oakland,412,49,pro,0
Tomás Reyes,treyes@globex.com,Chicago,71,149,team,1

Output (churn-scrambled.csv):
name,email,city,tenure_days,mrr,plan,churned
Dr. Elena Rosales,Reanna.Lockman@yahoo.com,East Garfield,412,49,pro,0
Marcus Hettinger,Jaylin.Bode@gmail.com,Lake Verda,71,149,team,1

Identifiers faked; tenure_days / mrr / plan / churned preserved.

JSON records for a RAG index

Records destined for an embedding index. name, email, phone keys are faked; the id, topic, and numeric score survive. Note body is free text — see the edge case about PII inside text.

Input (records.json):
[
  { "id": "r1", "name": "Priya Nair", "email": "priya@shop.co", "topic": "billing", "score": 0.82 },
  { "id": "r2", "name": "Leo Park", "email": "leo@shop.co", "topic": "refund", "score": 0.40 }
]

Output (records-scrambled.json):
[
  { "id": "r1", "name": "Dr. Elena Rosales", "email": "Reanna.Lockman@yahoo.com", "topic": "billing", "score": 0.82 },
  { "id": "r2", "name": "Marcus Hettinger", "email": "Jaylin.Bode@gmail.com", "topic": "refund", "score": 0.40 }
]

Reproducible split with a seed

For versioned experiments you want the same de-identified dataset every run so your train/test split is stable. Set a seed; identical raw file + identical seed = identical output.

Input (users.csv):
user_id,first_name,last_name,email,label
7,Aisha,Khan,aisha.khan@corp.net,A
8,Leo,Park,leo.park@corp.net,B

seed = 7

Output (every run with seed 7, byte-identical):
user_id,first_name,last_name,email,label
7,<fake>,<fake>,<fake>,A
8,<fake>,<fake>,<fake>,B

label column (A/B) preserved -> class balance unchanged.

PII embedded in a free-text training column survives

The tool de-identifies whole cells in matched columns; it does NOT scan inside free text. An email or phone in a body / transcript column — exactly the text an LLM would memorise — is NOT removed because the column name isn't a PII token. Scrub those columns first.

Input (chats.json):
[{ "email": "dana@x.com", "body": "Reach me at 415-555-0199, my SSN is 123-45-6789" }]

Output (chats-scrambled.json):
[{ "email": "Hilbert.Klein@gmail.com", "body": "Reach me at 415-555-0199, my SSN is 123-45-6789" }]

The email key was faked; the phone + SSN inside body were NOT.
Run the body text through email-phone-scrubber first -> it emits
[REDACTED_PHONE] / [REDACTED_SSN] so the model never sees them.

Header doesn't match -> identifier stays in the training set

Common ML export headers like customer_email or user_name are not exact PII tokens, so they pass through unchanged and a real identifier remains in your dataset. Rename to the bare token before scrambling.

Input (data.csv):
user_name,customer_email,feature_1,label
Li Wei,li.wei@x.com,0.4,1

Output (data-scrambled.csv):  <-- identifier NOT removed
user_name,customer_email,feature_1,label
Li Wei,li.wei@x.com,0.4,1

Nothing matched (itemsRedacted = 0). Rename to name / email:
name,email,feature_1,label
Mavis Goldner,Lonnie_Cremin@hotmail.com,0.4,1

Edge cases and what actually happens

PII inside a free-text column (body, transcript, notes)

By design

This is the most important caveat for ML data: the tool replaces whole cells in matched columns and never scans cell contents. A name, email, phone, or SSN written inside a body, transcript, comments, or notes column — precisely the free text an LLM memorises — is left intact because the column name isn't a PII token. Scrub those columns first with email-phone-scrubber, which emits fixed [REDACTED_*] tags.

JSONL or NDJSON fed in directly

Error

Training pipelines love JSONL, but a .json file containing one object per line is not a single JSON document — JSON.parse throws on the second line. Convert JSONL to a JSON array first (or scramble the source records before serialising to JSONL). CSV remains the simplest tabular path.

Compound identifier header not matched

Not matched

Headers like customer_email, user_name, or contact_phone are not exact PII tokens, so the regex (anchored to the whole name) leaves them — and their real identifiers — in the dataset. Rename to email, name, phone before scrambling, and confirm via the replaced-field count.

City / zip faked but you needed geo as a feature

By design

city, zip, postal and street are treated as PII and replaced with fakes, which destroys them as features. If your model genuinely needs coarse geography, derive a non-PII feature (e.g. a region bucket or a one-hot column) BEFORE scrambling, since the fake city/zip afterwards carries no real signal.

SSN feature becomes 9 random digits

Expected

ssn / tax_id columns are filled with faker.string.numeric(9) — random nine-digit noise with no checksum. As a high-cardinality identifier it was useless as a feature anyway; just be aware the values change every run unless you set a seed.

Same person recurs across rows / sessions

Not preserved

Replacement is per-cell, so one real user appearing in many rows (e.g. multiple events per customer) gets a different fake each row. If your modelling relies on a consistent per-user identity, scramble will break that linkage — derive a stable surrogate key from the real data before scrambling, then scramble only the human-readable identifiers.

Malformed JSON dataset

Error

JSON input goes through JSON.parse; a trailing comma, single quotes, unquoted keys, or a truncated dump throws and nothing is produced. Validate the dataset first. CSV is more forgiving — PapaParse parses ragged rows rather than throwing.

Large training set over the tier cap

Rejected

Datasets get big. Security-family limits are Pro 100 MB / 5 files, Pro-media 500 MB / 50, Developer 2 GB / unlimited; this tool needs at least Pro. A file over your cap is rejected before processing — shard the dataset, scramble per shard, or use the Developer tier for very large corpora.

Header-only or PII-free dataset

Supported

A dataset whose headers contain no PII tokens parses fine and returns with itemsRedacted = 0 — every cell preserved. That's the correct result for an already-anonymous feature matrix; the zero count confirms nothing identifiable was present to remove.

Determinism vs. reversibility

By design

A seed makes the de-identified dataset reproducible across experiment runs, but it is not a key and builds no lookup back to the originals. The operation is one-way — keep the raw dataset secured separately if you ever need to re-derive features from real values.

Frequently asked questions

Does this stop the model from regurgitating real people?

For identifiers in matched columns, yes — the real names, emails, phones and addresses are replaced with fakes before the data reaches the model, so there's nothing real left to memorise in those fields. But it does NOT scrub PII embedded inside free-text columns (a name in a body field), so de-identify those separately before training.

Will scrambling hurt my model's accuracy?

It shouldn't, because the columns it changes — names, emails, phones, addresses — are direct identifiers you wouldn't use as features anyway. Numeric features, categories, timestamps and labels are preserved byte-for-byte, so feature distributions and class balance are unchanged and the training signal survives.

Can I run JSONL straight through it?

No. JSONL / NDJSON is one object per line, not a single JSON document, so JSON.parse throws. Convert it to a JSON array first, or scramble your records before serialising to JSONL. A flat CSV is the simplest tabular path.

What about PII inside a transcript or body column?

Not handled here — the tool replaces whole cells in name-matched columns and never scans cell contents. Run free-text columns through email-phone-scrubber first; it matches email, phone, SSN, credit-card (Luhn), IBAN (mod-97) and UK-NI patterns inside text and emits fixed [REDACTED_*] tags, which is exactly what you want before a fine-tune.

My header is `customer_email` — is it de-identified?

No. The regex expects bare tokens, so customer_email, user_name, contact_phone are not matched and the identifier stays in the dataset. Rename to email, name, phone before scrambling and verify with the replaced-field count.

Can I keep the same fake for the same person across rows?

No. Replacement is per-cell with an independent faker value each time, so there's no consistent per-user mapping. If you need a stable per-user identity for modelling, derive a surrogate key from the real data first, then scramble only the human-readable identifiers.

Is the de-identification reproducible for versioned experiments?

Yes — set a numeric seed. The tool calls faker.seed(n) so the same raw file plus the same seed regenerates a byte-identical de-identified dataset, which is what reproducible splits and dataset versioning need. Leave it blank for fresh randomness.

Does my raw dataset get uploaded?

No. The live tool runs in your browser — PapaParse and faker are loaded client-side and the file is parsed and rewritten in the tab. The raw PII never leaves your machine; only the de-identified copy is downloaded for your pipeline.

Can I customise which columns get scrambled?

No. The PII field set is fixed in code with no UI control. To force a non-matching column to scramble, rename its header to a recognised token (email, name, phone, address, city, zip, ssn, ...) before running.

What about geography I want as a feature?

city, zip, postal and street are treated as PII and faked, so they lose their real signal. If your model needs coarse geography, derive a region bucket or other non-PII feature BEFORE scrambling — don't rely on the post-scramble fake city/zip.

What plan and sizes do I need for a big corpus?

This is a Pro-tier tool. Security-family limits are Pro 100 MB / 5 files, Pro-media 500 MB / 50, Developer 2 GB / unlimited. For large training sets, shard and scramble per shard, or use the Developer tier.

What pairs well with this for an AI data pipeline?

Scrub free-text columns first with email-phone-scrubber. If you must store the raw dataset, encrypt it at rest with aes-256-encryptor (Web Crypto AES-GCM 256, PBKDF2). Pin a version of the de-identified artefact with multi-hash-fingerprinter so every experiment references a verifiable hash.

Privacy first

Every JAD Security operation runs entirely in your browser. Files, passwords, and PGP private keys never leave your device — verified by zero outbound network requests during processing.

How to de-identify datasets before feeding them to a model

Step 1
Export your raw dataset as CSV or JSON — Pull the labelled dataset you'd otherwise load straight into training — a tabular CSV of features + a target column, or a JSON/array of records. The tool reads CSV (comma-delimited, first row = header) and JSON (object, array, nested). It does not read .xlsx / .ods; convert those to CSV first. JSONL is not a JSON document — convert it to a JSON array first or scramble before you serialise to JSONL.
Step 2
Drop the file onto the tool — PapaParse (CSV) or JSON.parse (JSON) runs in your browser tab; nothing is sent to a server. The JSON path is taken when the filename ends in `.json`, otherwise the file is parsed as CSV. Rename a JSON file that lost its extension, or set format: json, so it isn't mis-read as a single-column CSV.
Step 3
Confirm identifier columns use recognised header names — Replacement is keyed on the whole column name. Headers literally named email, name, phone, address, city, zip, ssn, etc. are de-identified. Compound names like customer_email, user_name, or contact_phone are NOT matched — rename them to the bare token first, or those identifiers stay in the training set.
Step 4
Set a seed for reproducible datasets — Leave seed blank for fresh randomness. Enter a number and the tool calls faker.seed(n) first, so the same raw file + same seed regenerates a byte-identical de-identified dataset — which is what reproducible experiments, fixed splits, and dataset versioning need.
Step 5
Scramble and verify the de-identified count — Every column / key matching the PII regex is overwritten with a faker value; the tool reports the number of replaced fields (itemsRedacted). Compare it to the number of identifier columns you expected — a low count usually means a header didn't match and an identifier is still in the data.
Step 6
Feed the de-identified file into your pipeline — The result downloads as <name>-scrambled.<ext>. Use it as the input to your training, fine-tune, or RAG-indexing step. Keep the raw file out of the pipeline and the repo — the scramble is one-way with no reverse mapping, so the de-identified artefact is the one that should travel.

What gets de-identified vs. what the model still learns from

How identifier columns and feature/label columns are treated. Detection is name-based against a fixed regex (lib/security/security-processor.ts).

Column role	Example headers	After scrambling	Effect on training
Direct identifier	`name`, `email`, `phone`, `address`	Faked	Removed as memorisation/leakage risk; not used as a feature anyway
Location identifier	`city`, `zip`, `postal`, `street`	Faked	Coarse geo becomes fake; if you needed region as a feature, derive it before scrambling
Government ID	`ssn`, `tax_id`	9 random digits	`faker.string.numeric(9)` — high-cardinality noise, safe to keep or drop
Numeric feature	`amount`, `score`, `tenure_days`	Preserved	Distributions and correlations intact -> model signal preserved
Label / target	`churned`, `class`, `label`	Preserved	Class balance unchanged -> training objective unaffected
Category / timestamp	`plan`, `created_at`, `channel`	Preserved	Categorical and temporal features survive exactly

The complete control surface

Every option this tool exposes, from lib/security/security-tool-schemas.ts. There is no field-list, masking-style, or format-of-fake control.

Control	Type / values	Default	What it actually does
`seed`	number (optional)	(blank)	Blank = fresh randomness. A number calls `faker.seed(n)` so the same dataset + seed gives identical fakes — for reproducible experiments. Not encryption, not reversible
`format`	enum: `auto` / `csv` / `json`	`auto`	Server-safe `auto` treats a leading `[`/`{` as JSON, else CSV. In-browser the JSON path is the `.json` extension. Force with `csv` / `json`
Field / column list	(not a control)	—	Fixed in code (`PII_FIELDS_REGEX`). Cannot be edited in the UI

Tier, formats, and size limits

Metadata from lib/security/security-tools-registry.ts and limits from lib/tier-limits.ts. Server-safe tool; the live version runs in the browser so raw PII is never transmitted.

Property	Value	Note
Minimum tier	Pro	`minTier: "pro"`
Input formats	CSV, JSON	JSON via `JSON.parse`; not JSONL, not `.xlsx`/`.ods`
Output	Text (CSV or pretty-printed JSON)	`<name>-scrambled.<ext>`
Pro limits	100 MB / 5 files	Security family
Pro-media limits	500 MB / 50 files	Security family
Developer limits	2 GB / unlimited	Security family — useful for large training sets

Cookbook

Tabular churn dataset for a classifier

Input (churn.csv):
name,email,city,tenure_days,mrr,plan,churned
Sarah Chen,sarah.chen@acme.io,Oakland,412,49,pro,0
Tomás Reyes,treyes@globex.com,Chicago,71,149,team,1

Output (churn-scrambled.csv):
name,email,city,tenure_days,mrr,plan,churned
Dr. Elena Rosales,Reanna.Lockman@yahoo.com,East Garfield,412,49,pro,0
Marcus Hettinger,Jaylin.Bode@gmail.com,Lake Verda,71,149,team,1

Identifiers faked; tenure_days / mrr / plan / churned preserved.

JSON records for a RAG index

Records destined for an embedding index. name, email, phone keys are faked; the id, topic, and numeric score survive. Note body is free text — see the edge case about PII inside text.

Input (records.json):
[
  { "id": "r1", "name": "Priya Nair", "email": "priya@shop.co", "topic": "billing", "score": 0.82 },
  { "id": "r2", "name": "Leo Park", "email": "leo@shop.co", "topic": "refund", "score": 0.40 }
]

Output (records-scrambled.json):
[
  { "id": "r1", "name": "Dr. Elena Rosales", "email": "Reanna.Lockman@yahoo.com", "topic": "billing", "score": 0.82 },
  { "id": "r2", "name": "Marcus Hettinger", "email": "Jaylin.Bode@gmail.com", "topic": "refund", "score": 0.40 }
]

Reproducible split with a seed

For versioned experiments you want the same de-identified dataset every run so your train/test split is stable. Set a seed; identical raw file + identical seed = identical output.

Input (users.csv):
user_id,first_name,last_name,email,label
7,Aisha,Khan,aisha.khan@corp.net,A
8,Leo,Park,leo.park@corp.net,B

seed = 7

Output (every run with seed 7, byte-identical):
user_id,first_name,last_name,email,label
7,<fake>,<fake>,<fake>,A
8,<fake>,<fake>,<fake>,B

label column (A/B) preserved -> class balance unchanged.

PII embedded in a free-text training column survives

Input (chats.json):
[{ "email": "dana@x.com", "body": "Reach me at 415-555-0199, my SSN is 123-45-6789" }]

Output (chats-scrambled.json):
[{ "email": "Hilbert.Klein@gmail.com", "body": "Reach me at 415-555-0199, my SSN is 123-45-6789" }]

The email key was faked; the phone + SSN inside body were NOT.
Run the body text through email-phone-scrubber first -> it emits
[REDACTED_PHONE] / [REDACTED_SSN] so the model never sees them.

Header doesn't match -> identifier stays in the training set

Input (data.csv):
user_name,customer_email,feature_1,label
Li Wei,li.wei@x.com,0.4,1

Output (data-scrambled.csv):  <-- identifier NOT removed
user_name,customer_email,feature_1,label
Li Wei,li.wei@x.com,0.4,1

Nothing matched (itemsRedacted = 0). Rename to name / email:
name,email,feature_1,label
Mavis Goldner,Lonnie_Cremin@hotmail.com,0.4,1

Edge cases and what actually happens

PII inside a free-text column (body, transcript, notes)

By design

JSONL or NDJSON fed in directly

Error

Compound identifier header not matched

Not matched

City / zip faked but you needed geo as a feature

By design

SSN feature becomes 9 random digits

Expected

Same person recurs across rows / sessions

Not preserved

Malformed JSON dataset

Error

Large training set over the tier cap

Rejected

Header-only or PII-free dataset

Supported

Determinism vs. reversibility

By design

Frequently asked questions

Does this stop the model from regurgitating real people?

Will scrambling hurt my model's accuracy?

Can I run JSONL straight through it?

What about PII inside a transcript or body column?

My header is `customer_email` — is it de-identified?

Can I keep the same fake for the same person across rows?

Is the de-identification reproducible for versioned experiments?

Does my raw dataset get uploaded?

Can I customise which columns get scrambled?

What about geography I want as a feature?

What plan and sizes do I need for a big corpus?

What pairs well with this for an AI data pipeline?

Privacy first

Every JAD Security operation runs entirely in your browser. Files, passwords, and PGP private keys never leave your device — verified by zero outbound network requests during processing.

De-identify Datasets Before Feeding Them to a Model

How to de-identify datasets before feeding them to a model

What gets de-identified vs. what the model still learns from

The complete control surface

Tier, formats, and size limits

Cookbook

Tabular churn dataset for a classifier

JSON records for a RAG index

Reproducible split with a seed

PII embedded in a free-text training column survives

Header doesn't match -> identifier stays in the training set

Edge cases and what actually happens

PII inside a free-text column (body, transcript, notes)

JSONL or NDJSON fed in directly

Compound identifier header not matched

City / zip faked but you needed geo as a feature

SSN feature becomes 9 random digits

Same person recurs across rows / sessions

Malformed JSON dataset

Large training set over the tier cap

Header-only or PII-free dataset

Determinism vs. reversibility

Frequently asked questions

Does this stop the model from regurgitating real people?

Will scrambling hurt my model's accuracy?

Can I run JSONL straight through it?

What about PII inside a transcript or body column?

My header is `customer_email` — is it de-identified?

Can I keep the same fake for the same person across rows?

Is the de-identification reproducible for versioned experiments?

Does my raw dataset get uploaded?

Can I customise which columns get scrambled?

What about geography I want as a feature?

What plan and sizes do I need for a big corpus?

What pairs well with this for an AI data pipeline?

Privacy first

Related guides

De-identify Datasets Before Feeding Them to a Model

How to de-identify datasets before feeding them to a model

What gets de-identified vs. what the model still learns from

The complete control surface

Tier, formats, and size limits

Cookbook

Tabular churn dataset for a classifier

JSON records for a RAG index

Reproducible split with a seed

PII embedded in a free-text training column survives

Header doesn't match -> identifier stays in the training set

Edge cases and what actually happens

PII inside a free-text column (body, transcript, notes)

JSONL or NDJSON fed in directly

Compound identifier header not matched

City / zip faked but you needed geo as a feature

SSN feature becomes 9 random digits

Same person recurs across rows / sessions

Malformed JSON dataset

Large training set over the tier cap

Header-only or PII-free dataset

Determinism vs. reversibility

Frequently asked questions

Does this stop the model from regurgitating real people?

Will scrambling hurt my model's accuracy?

Can I run JSONL straight through it?

What about PII inside a transcript or body column?

My header is `customer_email` — is it de-identified?

Can I keep the same fake for the same person across rows?

Is the de-identification reproducible for versioned experiments?

Does my raw dataset get uploaded?

Can I customise which columns get scrambled?

What about geography I want as a feature?

What plan and sizes do I need for a big corpus?

What pairs well with this for an AI data pipeline?

Privacy first

Related guides