Redact PII from Training Data Before Fine-Tuning — Free In-Browser Scrubber

How to strip emails, phones & financial pii out of llm training data locally

Step 1
Export your corpus to a text format — Dump your training data to CSV, JSON, JSONL-as-.txt, or Markdown — whatever your pipeline ingests. The scrubber reads raw text, so column names and JSON keys are irrelevant; only the *content* of cells and values is inspected. If your data lives in a spreadsheet, .xlsx/.xls/.ods is accepted directly (first sheet only).
Step 2
Drop the file onto the scrubber — Use the drop area above (it accepts .csv,.json,.txt,.md plus .xlsx/.xls/.ods). There is no paste box for this tool — it works on an uploaded file. The pass runs entirely in-browser; the raw corpus is never transmitted, so even pre-consent or licence-uncertain data stays local.
Step 3
Run the scrubber — Press Run Email/Phone Scrubber. There are no toggles — the six detectors always fire in a fixed order (email, IBAN, card, SSN, NI, then phone last). The IBAN and card passes apply their checksum filter, so only valid IBANs and Luhn-valid cards are replaced; everything else is left as written.
Step 4
Review the redacted corpus and count — The full scrubbed file appears in a scrollable panel, and a metrics line reports items redacted, bytes in/out, and run time. Skim it and confirm the [REDACTED_*] tokens landed in the fields you expected — a sample column you *want* to keep verbatim shouldn't be full of redaction tags.
Step 5
Download the cleaned dataset — Use Download to save the file; it's named after the source with a -scrubbed suffix (corpus.jsonl.txt → corpus.jsonl-scrubbed.txt, data.csv → data-scrubbed.csv). For a spreadsheet input the download keeps the .xlsx name but the content is the scrubbed JSON array of the first sheet — re-export to your loader's format if it expects binary Excel.
Step 6
Add a free-text pass for what regex can't catch — The scrubber only matches shaped identifiers. Names, mailing addresses, dates of birth, internal IDs, and quasi-identifiers written in prose are not removed and will reach your model. For field-level pseudonymisation that swaps names/addresses for plausible fakes, run the output through the CSV/JSON Data Scrambler afterwards.

What survives a training-data scrub — and what doesn't

The six detectors handle direct identifiers with a fixed shape. Quasi-identifiers and free-text PII pass through, which is the single biggest gotcha for ML teams treating this as a complete anonymiser.

PII type in your corpus	Detector	Removed?	Note for training
Customer email in a transcript	Email (pattern)	Yes → `[REDACTED_EMAIL]`	Catches plus-addressing and subdomains; no DNS check
Phone number in chat text	Phone (pattern, runs last)	Yes → `[REDACTED_PHONE]`	Common grouped/E.164 styles; odd local formats may slip
Real card number in a refund note	Credit card (Luhn-gated)	Yes → `[REDACTED_CARD]`	Only if it passes Luhn — fake example PANs survive
Customer name in prose	— none —	No	No name detector; reaches the model verbatim
Mailing address in a field	— none —	No	Free-text addresses are not shape-matchable here
Internal account / user ID	— none —	No	Unless it happens to be a Luhn-valid 13–19 digit run

Inputs, output, and tier limits

The scrubber is a free-tier tool; only the file-size ceiling changes by plan. The text path is gated by that ceiling, so chunk very large corpora before scrubbing.

Aspect	Behaviour
Accepted input	`.csv`, `.json`, `.md`, `.txt` read as text; `.xlsx`/`.xls`/`.ods` (first sheet flattened to a JSON array first)
Options	None — no panel, no per-category toggle, no custom mask. All six detectors always run
Output	Scrubbed text in the original format with `[REDACTED_*]` tags in place; downloaded as `<name>-scrubbed.<ext>`
Multiple files	The drop area accepts several, but only the first file is scrubbed per run — loop the rest
Where it runs	100% in your browser tab; a server-safe API path returns `{ output, redactedCount, counts }` for pipeline automation
File-size limit	Free 10 MB · Pro 100 MB · Pro + Media 500 MB · Developer 2 GB (oversize text throws an `exceeds the … limit` error)

Cookbook

Before/after fragments aimed at training corpora — JSONL prompt/completion rows, instruction pairs, and spreadsheet leads. All PII values shown are fabricated examples.

JSONL prompt/completion pair stays valid JSON

A line from a fine-tuning file where the user message contains an email and a phone. Because tags drop inside the existing quotes, each line stays parseable for your loader.

Input (one .jsonl line):
{"messages":[{"role":"user","content":"resend the invoice to mia@acme.io or call 0207 946 0958"},{"role":"assistant","content":"Done."}]}

Output (-scrubbed):
{"messages":[{"role":"user","content":"resend the invoice to [REDACTED_EMAIL] or call [REDACTED_PHONE]"},{"role":"assistant","content":"Done."}]}

Instruction CSV — email and card removed, name kept

A typical instruction-tuning CSV. Two identifiers are tagged, but the customer name is NOT a detected category and passes straight through — the thing teams most often forget.

Input (instructions.csv):
instruction,response
"refund Jane Okafor at jane@x.io on 4111 1111 1111 1111","refunded"

Output (instructions-scrubbed.csv):
instruction,response
"refund Jane Okafor at [REDACTED_EMAIL] on [REDACTED_CARD]","refunded"

# Name 'Jane Okafor' is NOT redacted — no name detector

Synthetic example IDs survive the card pass

Prompts often embed fake 16-digit placeholders. The Luhn gate means only checksum-valid runs become [REDACTED_CARD], so your deliberately-synthetic samples aren't corrupted.

Input (prompts.txt):
Use order 1234567890123456 as the example. Real card 4242 4242 4242 4242 must be hidden.

Output (-scrubbed):
Use order 1234567890123456 as the example. Real card [REDACTED_CARD] must be hidden.

# 1234567890123456 fails Luhn → kept
# 4242424242424242 passes Luhn → redacted

Spreadsheet of leads becomes scrubbed JSON

A leads workbook used as raw training material. The first sheet is flattened to a JSON array of row objects, scrubbed, and returned as JSON — the download keeps the .xlsx name but carries JSON text.

Input: leads.xlsx  (first sheet)
  | Email          | Mobile        |
  | kit@demo.org   | 07700 900123  |

Download: leads-scrubbed.xlsx  (CONTENT is JSON):
[
  {
    "Email": "[REDACTED_EMAIL]",
    "Mobile": "[REDACTED_PHONE]"
  }
]

Coverage check before a training run

After scrubbing, grep the cleaned corpus for the tags and for stray identifiers. A non-zero tag count plus zero stray emails is your go signal; a zero count usually means the file isn't the one you thought.

# confirm tags landed
grep -c "\[REDACTED_" corpus-scrubbed.jsonl.txt
# -> 1428

# confirm no raw emails slipped (should be 0)
grep -Ec "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}" corpus-scrubbed.jsonl.txt
# -> 0

# names/addresses are NOT covered — handle separately

Edge cases and what actually happens

Customer names and addresses still reach the model

By design

There is no name, address, or date-of-birth detector — only six shape-matchable categories. Prose identifiers, mailing addresses, and quasi-identifiers pass through to your training run untouched. Treat this tool as a *direct-identifier* scrub, then run the result through the CSV/JSON Data Scrambler for field-level pseudonymisation of names and addresses.

A run reports zero items redacted

Check your input

A zero count is valid — but for a corpus you *expect* to contain PII it usually means you scrubbed the wrong file, an already-clean export, or a spreadsheet whose data lives on a sheet other than the first. Re-check the source before assuming the corpus is safe.

Fake example PANs in prompts are left intact

Preserved

The card pass requires a Luhn-valid 13–19 digit run, so synthetic example IDs that fail Luhn survive. This is what keeps your deliberately-synthetic samples usable — but it also means a real card written with a checksum-breaking typo won't be caught. Don't rely on the scrub to fix malformed real data.

Only the first file is scrubbed when you select many

First file only

The drop area accepts a multi-file selection, but a run processes only files[0]. For a directory of shards, loop them one at a time, or call the server-safe API in your pipeline. Don't assume a multi-file pick produced multi-file output — check the -scrubbed filename.

Spreadsheet download is JSON wearing an .xlsx name

Expected

For .xlsx/.xls/.ods inputs the first sheet is converted to a JSON array of row objects, scrubbed, and returned as that JSON — the filename keeps .xlsx but the content is JSON text, not a rebuilt workbook. If your loader expects binary Excel, save the JSON and re-import, or export to CSV before scrubbing.

Corpus is larger than your plan's text limit

Rejected — over limit

Text inputs pass through the tier size gate (Free 10 MB, Pro 100 MB, Pro + Media 500 MB, Developer 2 GB). An oversize file throws File "…" is N MB — exceeds the … limit for your plan. and nothing is scrubbed. Shard the corpus or upgrade. The spreadsheet path reads the workbook directly into the tab, so huge workbooks are memory-bound instead.

An internal user ID happens to pass Luhn

Card wins

Detectors run in a fixed order and the Luhn-gated card pass runs before phone. A 13–19 digit internal ID that coincidentally satisfies Luhn becomes [REDACTED_CARD] — over-redaction rather than a leak, but it can blank out a field you wanted to keep as a training signal. Check ID-bearing columns after a run.

Undashed SSNs in your data aren't removed

Not matched

The US SSN detector only matches the strict NNN-NN-NNNN dashed form (with SSA invalid-block rules). A nine-digit run like 123456789 is too ambiguous and is left alone. If your corpus stores SSNs without dashes, normalise them to the dashed form before scrubbing or they'll reach the model.

Illustrative emails in documentation get redacted too

Pattern match

Anything matching local@domain.tld is replaced, including a sample address you put in docstrings or few-shot examples on purpose. The scrubber can't tell a real contact from an intentional placeholder. In documentation-heavy corpora, review redactions so you don't lose meaningful example addresses.

You need encrypted-at-rest data, not redaction

Wrong tool

Redaction destroys the value; if instead you need the raw corpus protected in transit or at rest for a trusted partner, encrypt it with the AES-256 Encryptor (Web Crypto AES-GCM-256, PBKDF2 100k). To verify a downloaded dataset matches its published hash, use the Multi-Hash Fingerprinter.

Frequently asked questions

Does this fully anonymise my training data?

No — it removes six direct-identifier categories (email, phone, IBAN, card, US SSN, UK NI) but has no detector for names, addresses, dates of birth, or internal IDs. Those quasi-identifiers reach your model verbatim. Treat this as the direct-identifier step, then pseudonymise the rest with the CSV/JSON Data Scrambler.

Will the redaction tags confuse my model?

The tags are stable literal strings ([REDACTED_EMAIL], [REDACTED_PHONE], etc.), so a model learns a single consistent placeholder token rather than memorising a leaked value. That's the intended outcome — a predictable mask the model can generalise around.

Is my corpus uploaded to clean it?

No. The on-page tool runs the regex passes entirely in your browser tab via local JavaScript — the corpus and every identifier in it never reach a server. That's what makes it safe to clean data you haven't yet vetted for consent or licensing. A separate opt-in API path exists for automation.

Can I scrub JSONL files?

Yes — save them with a .txt (or .json) extension and drop them in. The scrubber works on raw text line by line implicitly, and because tags drop inside existing quotes, each JSONL line stays valid JSON. The download keeps your structure intact.

Can I keep some categories and only redact others?

No. There is no options panel — all six detectors always run and the tags are fixed strings. If you only want certain tags in the final corpus, find-and-replace the unwanted ones back in a text editor after the scrub. There is no per-detector toggle.

Why did a fake card number in my prompts survive?

The card detector applies the Luhn checksum first. A 13–19 digit run is only replaced if it passes Luhn, so synthetic example PANs (which usually fail Luhn) are left intact — exactly what you want for deliberately-synthetic training samples. Real cards pass Luhn and are redacted.

How do I verify the scrub worked before training?

Download the -scrubbed file and grep it: count [REDACTED_ occurrences to confirm tags landed, and run the raw email/phone patterns against it to confirm zero survivors. A non-zero tag count plus zero stray identifiers is your go signal.

Does it handle international phone numbers in transcripts?

It targets common international and grouped formats — an optional +, a 1–3 digit prefix, then digit blocks split by spaces, dots, dashes, or parentheses (e.g. +44 20 7946 0958, (212) 555-0143). Separator-less or heavily localised formats may slip; scan the output for any phone style your data uses.

Can it process my whole dataset folder at once?

Not in one click. The drop area accepts a multi-file selection but a single run scrubs only the first file. For a sharded corpus, run files one at a time or call the server-safe API in a loop — each run yields one scrubbed file.

It accepts Excel — do I get Excel back for my pipeline?

It accepts .xlsx/.xls/.ods but converts the first sheet to a JSON array of row objects, scrubs that, and returns the JSON. The download keeps the .xlsx name yet holds JSON text, not a rebuilt workbook. Export your sheet to CSV before scrubbing if your loader needs binary Excel.

What's the size limit for a training corpus?

Text files are gated by the security tier: Free 10 MB, Pro 100 MB, Pro + Media 500 MB, Developer 2 GB. An oversize text file throws an exceeds the … limit for your plan error and isn't scrubbed. Shard large corpora, or upgrade. Spreadsheets are read directly and bounded by tab memory.

What about PII inside PDFs or images in my dataset?

This tool is text/CSV/JSON only. For text redaction inside a PDF use the PDF PII Redactor; to remove a signature or stamp from a document image use Signature Burner. For realistic field-level fakes across a tabular dataset, use the CSV/JSON Data Scrambler.

Privacy first

Every JAD Security operation runs entirely in your browser. Files, passwords, and PGP private keys never leave your device — verified by zero outbound network requests during processing.

How to strip emails, phones & financial pii out of llm training data locally

Step 1
Export your corpus to a text format — Dump your training data to CSV, JSON, JSONL-as-.txt, or Markdown — whatever your pipeline ingests. The scrubber reads raw text, so column names and JSON keys are irrelevant; only the *content* of cells and values is inspected. If your data lives in a spreadsheet, .xlsx/.xls/.ods is accepted directly (first sheet only).
Step 2
Drop the file onto the scrubber — Use the drop area above (it accepts .csv,.json,.txt,.md plus .xlsx/.xls/.ods). There is no paste box for this tool — it works on an uploaded file. The pass runs entirely in-browser; the raw corpus is never transmitted, so even pre-consent or licence-uncertain data stays local.
Step 3
Run the scrubber — Press Run Email/Phone Scrubber. There are no toggles — the six detectors always fire in a fixed order (email, IBAN, card, SSN, NI, then phone last). The IBAN and card passes apply their checksum filter, so only valid IBANs and Luhn-valid cards are replaced; everything else is left as written.
Step 4
Review the redacted corpus and count — The full scrubbed file appears in a scrollable panel, and a metrics line reports items redacted, bytes in/out, and run time. Skim it and confirm the [REDACTED_*] tokens landed in the fields you expected — a sample column you *want* to keep verbatim shouldn't be full of redaction tags.
Step 5
Download the cleaned dataset — Use Download to save the file; it's named after the source with a -scrubbed suffix (corpus.jsonl.txt → corpus.jsonl-scrubbed.txt, data.csv → data-scrubbed.csv). For a spreadsheet input the download keeps the .xlsx name but the content is the scrubbed JSON array of the first sheet — re-export to your loader's format if it expects binary Excel.
Step 6
Add a free-text pass for what regex can't catch — The scrubber only matches shaped identifiers. Names, mailing addresses, dates of birth, internal IDs, and quasi-identifiers written in prose are not removed and will reach your model. For field-level pseudonymisation that swaps names/addresses for plausible fakes, run the output through the CSV/JSON Data Scrambler afterwards.

What survives a training-data scrub — and what doesn't

PII type in your corpus	Detector	Removed?	Note for training
Customer email in a transcript	Email (pattern)	Yes → `[REDACTED_EMAIL]`	Catches plus-addressing and subdomains; no DNS check
Phone number in chat text	Phone (pattern, runs last)	Yes → `[REDACTED_PHONE]`	Common grouped/E.164 styles; odd local formats may slip
Real card number in a refund note	Credit card (Luhn-gated)	Yes → `[REDACTED_CARD]`	Only if it passes Luhn — fake example PANs survive
Customer name in prose	— none —	No	No name detector; reaches the model verbatim
Mailing address in a field	— none —	No	Free-text addresses are not shape-matchable here
Internal account / user ID	— none —	No	Unless it happens to be a Luhn-valid 13–19 digit run

Inputs, output, and tier limits

The scrubber is a free-tier tool; only the file-size ceiling changes by plan. The text path is gated by that ceiling, so chunk very large corpora before scrubbing.

Aspect	Behaviour
Accepted input	`.csv`, `.json`, `.md`, `.txt` read as text; `.xlsx`/`.xls`/`.ods` (first sheet flattened to a JSON array first)
Options	None — no panel, no per-category toggle, no custom mask. All six detectors always run
Output	Scrubbed text in the original format with `[REDACTED_*]` tags in place; downloaded as `<name>-scrubbed.<ext>`
Multiple files	The drop area accepts several, but only the first file is scrubbed per run — loop the rest
Where it runs	100% in your browser tab; a server-safe API path returns `{ output, redactedCount, counts }` for pipeline automation
File-size limit	Free 10 MB · Pro 100 MB · Pro + Media 500 MB · Developer 2 GB (oversize text throws an `exceeds the … limit` error)

Cookbook

Before/after fragments aimed at training corpora — JSONL prompt/completion rows, instruction pairs, and spreadsheet leads. All PII values shown are fabricated examples.

JSONL prompt/completion pair stays valid JSON

A line from a fine-tuning file where the user message contains an email and a phone. Because tags drop inside the existing quotes, each line stays parseable for your loader.

Input (one .jsonl line):
{"messages":[{"role":"user","content":"resend the invoice to mia@acme.io or call 0207 946 0958"},{"role":"assistant","content":"Done."}]}

Output (-scrubbed):
{"messages":[{"role":"user","content":"resend the invoice to [REDACTED_EMAIL] or call [REDACTED_PHONE]"},{"role":"assistant","content":"Done."}]}

Instruction CSV — email and card removed, name kept

A typical instruction-tuning CSV. Two identifiers are tagged, but the customer name is NOT a detected category and passes straight through — the thing teams most often forget.

Input (instructions.csv):
instruction,response
"refund Jane Okafor at jane@x.io on 4111 1111 1111 1111","refunded"

Output (instructions-scrubbed.csv):
instruction,response
"refund Jane Okafor at [REDACTED_EMAIL] on [REDACTED_CARD]","refunded"

# Name 'Jane Okafor' is NOT redacted — no name detector

Synthetic example IDs survive the card pass

Prompts often embed fake 16-digit placeholders. The Luhn gate means only checksum-valid runs become [REDACTED_CARD], so your deliberately-synthetic samples aren't corrupted.

Input (prompts.txt):
Use order 1234567890123456 as the example. Real card 4242 4242 4242 4242 must be hidden.

Output (-scrubbed):
Use order 1234567890123456 as the example. Real card [REDACTED_CARD] must be hidden.

# 1234567890123456 fails Luhn → kept
# 4242424242424242 passes Luhn → redacted

Spreadsheet of leads becomes scrubbed JSON

Input: leads.xlsx  (first sheet)
  | Email          | Mobile        |
  | kit@demo.org   | 07700 900123  |

Download: leads-scrubbed.xlsx  (CONTENT is JSON):
[
  {
    "Email": "[REDACTED_EMAIL]",
    "Mobile": "[REDACTED_PHONE]"
  }
]

Coverage check before a training run

# confirm tags landed
grep -c "\[REDACTED_" corpus-scrubbed.jsonl.txt
# -> 1428

# confirm no raw emails slipped (should be 0)
grep -Ec "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}" corpus-scrubbed.jsonl.txt
# -> 0

# names/addresses are NOT covered — handle separately

Edge cases and what actually happens

Customer names and addresses still reach the model

By design

A run reports zero items redacted

Check your input

Fake example PANs in prompts are left intact

Preserved

Only the first file is scrubbed when you select many

First file only

Spreadsheet download is JSON wearing an .xlsx name

Expected

Corpus is larger than your plan's text limit

Rejected — over limit

An internal user ID happens to pass Luhn

Card wins

Undashed SSNs in your data aren't removed

Not matched

Illustrative emails in documentation get redacted too

Pattern match

You need encrypted-at-rest data, not redaction

Wrong tool

Frequently asked questions

Does this fully anonymise my training data?

Will the redaction tags confuse my model?

Is my corpus uploaded to clean it?

Can I scrub JSONL files?

Can I keep some categories and only redact others?

Why did a fake card number in my prompts survive?

How do I verify the scrub worked before training?

Does it handle international phone numbers in transcripts?

Can it process my whole dataset folder at once?

It accepts Excel — do I get Excel back for my pipeline?

What's the size limit for a training corpus?

What about PII inside PDFs or images in my dataset?

Privacy first

Every JAD Security operation runs entirely in your browser. Files, passwords, and PGP private keys never leave your device — verified by zero outbound network requests during processing.

Strip Emails, Phones & Financial PII Out of LLM Training Data Locally

How to strip emails, phones & financial pii out of llm training data locally

What survives a training-data scrub — and what doesn't

Inputs, output, and tier limits

Cookbook

JSONL prompt/completion pair stays valid JSON

Instruction CSV — email and card removed, name kept

Synthetic example IDs survive the card pass

Spreadsheet of leads becomes scrubbed JSON

Coverage check before a training run

Edge cases and what actually happens

Customer names and addresses still reach the model

A run reports zero items redacted

Fake example PANs in prompts are left intact

Only the first file is scrubbed when you select many

Spreadsheet download is JSON wearing an .xlsx name

Corpus is larger than your plan's text limit

An internal user ID happens to pass Luhn

Undashed SSNs in your data aren't removed

Illustrative emails in documentation get redacted too

You need encrypted-at-rest data, not redaction

Frequently asked questions

Does this fully anonymise my training data?

Will the redaction tags confuse my model?

Is my corpus uploaded to clean it?

Can I scrub JSONL files?

Can I keep some categories and only redact others?

Why did a fake card number in my prompts survive?

How do I verify the scrub worked before training?

Does it handle international phone numbers in transcripts?

Can it process my whole dataset folder at once?

It accepts Excel — do I get Excel back for my pipeline?

What's the size limit for a training corpus?

What about PII inside PDFs or images in my dataset?

Privacy first

Related guides

Strip Emails, Phones & Financial PII Out of LLM Training Data Locally

How to strip emails, phones & financial pii out of llm training data locally

What survives a training-data scrub — and what doesn't

Inputs, output, and tier limits

Cookbook

JSONL prompt/completion pair stays valid JSON

Instruction CSV — email and card removed, name kept

Synthetic example IDs survive the card pass

Spreadsheet of leads becomes scrubbed JSON

Coverage check before a training run

Edge cases and what actually happens

Customer names and addresses still reach the model

A run reports zero items redacted

Fake example PANs in prompts are left intact

Only the first file is scrubbed when you select many

Spreadsheet download is JSON wearing an .xlsx name

Corpus is larger than your plan's text limit

An internal user ID happens to pass Luhn

Undashed SSNs in your data aren't removed

Illustrative emails in documentation get redacted too

You need encrypted-at-rest data, not redaction

Frequently asked questions

Does this fully anonymise my training data?

Will the redaction tags confuse my model?

Is my corpus uploaded to clean it?

Can I scrub JSONL files?

Can I keep some categories and only redact others?

Why did a fake card number in my prompts survive?

How do I verify the scrub worked before training?

Does it handle international phone numbers in transcripts?

Can it process my whole dataset folder at once?

It accepts Excel — do I get Excel back for my pipeline?

What's the size limit for a training corpus?

What about PII inside PDFs or images in my dataset?

Privacy first

Related guides