How to strip emails, phones & financial pii out of llm training data locally
- Step 1Export your corpus to a text format — Dump your training data to CSV, JSON, JSONL-as-
.txt, or Markdown — whatever your pipeline ingests. The scrubber reads raw text, so column names and JSON keys are irrelevant; only the *content* of cells and values is inspected. If your data lives in a spreadsheet,.xlsx/.xls/.odsis accepted directly (first sheet only). - Step 2Drop the file onto the scrubber — Use the drop area above (it accepts
.csv,.json,.txt,.mdplus.xlsx/.xls/.ods). There is no paste box for this tool — it works on an uploaded file. The pass runs entirely in-browser; the raw corpus is never transmitted, so even pre-consent or licence-uncertain data stays local. - Step 3Run the scrubber — Press Run Email/Phone Scrubber. There are no toggles — the six detectors always fire in a fixed order (email, IBAN, card, SSN, NI, then phone last). The IBAN and card passes apply their checksum filter, so only valid IBANs and Luhn-valid cards are replaced; everything else is left as written.
- Step 4Review the redacted corpus and count — The full scrubbed file appears in a scrollable panel, and a metrics line reports items redacted, bytes in/out, and run time. Skim it and confirm the
[REDACTED_*]tokens landed in the fields you expected — a sample column you *want* to keep verbatim shouldn't be full of redaction tags. - Step 5Download the cleaned dataset — Use Download to save the file; it's named after the source with a
-scrubbedsuffix (corpus.jsonl.txt→corpus.jsonl-scrubbed.txt,data.csv→data-scrubbed.csv). For a spreadsheet input the download keeps the.xlsxname but the content is the scrubbed JSON array of the first sheet — re-export to your loader's format if it expects binary Excel. - Step 6Add a free-text pass for what regex can't catch — The scrubber only matches shaped identifiers. Names, mailing addresses, dates of birth, internal IDs, and quasi-identifiers written in prose are not removed and will reach your model. For field-level pseudonymisation that swaps names/addresses for plausible fakes, run the output through the CSV/JSON Data Scrambler afterwards.
What survives a training-data scrub — and what doesn't
The six detectors handle direct identifiers with a fixed shape. Quasi-identifiers and free-text PII pass through, which is the single biggest gotcha for ML teams treating this as a complete anonymiser.
| PII type in your corpus | Detector | Removed? | Note for training |
|---|---|---|---|
| Customer email in a transcript | Email (pattern) | Yes → [REDACTED_EMAIL] | Catches plus-addressing and subdomains; no DNS check |
| Phone number in chat text | Phone (pattern, runs last) | Yes → [REDACTED_PHONE] | Common grouped/E.164 styles; odd local formats may slip |
| Real card number in a refund note | Credit card (Luhn-gated) | Yes → [REDACTED_CARD] | Only if it passes Luhn — fake example PANs survive |
| Customer name in prose | — none — | No | No name detector; reaches the model verbatim |
| Mailing address in a field | — none — | No | Free-text addresses are not shape-matchable here |
| Internal account / user ID | — none — | No | Unless it happens to be a Luhn-valid 13–19 digit run |
Inputs, output, and tier limits
The scrubber is a free-tier tool; only the file-size ceiling changes by plan. The text path is gated by that ceiling, so chunk very large corpora before scrubbing.
| Aspect | Behaviour |
|---|---|
| Accepted input | .csv, .json, .md, .txt read as text; .xlsx/.xls/.ods (first sheet flattened to a JSON array first) |
| Options | None — no panel, no per-category toggle, no custom mask. All six detectors always run |
| Output | Scrubbed text in the original format with [REDACTED_*] tags in place; downloaded as <name>-scrubbed.<ext> |
| Multiple files | The drop area accepts several, but only the first file is scrubbed per run — loop the rest |
| Where it runs | 100% in your browser tab; a server-safe API path returns { output, redactedCount, counts } for pipeline automation |
| File-size limit | Free 10 MB · Pro 100 MB · Pro + Media 500 MB · Developer 2 GB (oversize text throws an exceeds the … limit error) |
Cookbook
Before/after fragments aimed at training corpora — JSONL prompt/completion rows, instruction pairs, and spreadsheet leads. All PII values shown are fabricated examples.
JSONL prompt/completion pair stays valid JSON
A line from a fine-tuning file where the user message contains an email and a phone. Because tags drop inside the existing quotes, each line stays parseable for your loader.
Input (one .jsonl line):
{"messages":[{"role":"user","content":"resend the invoice to mia@acme.io or call 0207 946 0958"},{"role":"assistant","content":"Done."}]}
Output (-scrubbed):
{"messages":[{"role":"user","content":"resend the invoice to [REDACTED_EMAIL] or call [REDACTED_PHONE]"},{"role":"assistant","content":"Done."}]}Instruction CSV — email and card removed, name kept
A typical instruction-tuning CSV. Two identifiers are tagged, but the customer name is NOT a detected category and passes straight through — the thing teams most often forget.
Input (instructions.csv): instruction,response "refund Jane Okafor at jane@x.io on 4111 1111 1111 1111","refunded" Output (instructions-scrubbed.csv): instruction,response "refund Jane Okafor at [REDACTED_EMAIL] on [REDACTED_CARD]","refunded" # Name 'Jane Okafor' is NOT redacted — no name detector
Synthetic example IDs survive the card pass
Prompts often embed fake 16-digit placeholders. The Luhn gate means only checksum-valid runs become [REDACTED_CARD], so your deliberately-synthetic samples aren't corrupted.
Input (prompts.txt): Use order 1234567890123456 as the example. Real card 4242 4242 4242 4242 must be hidden. Output (-scrubbed): Use order 1234567890123456 as the example. Real card [REDACTED_CARD] must be hidden. # 1234567890123456 fails Luhn → kept # 4242424242424242 passes Luhn → redacted
Spreadsheet of leads becomes scrubbed JSON
A leads workbook used as raw training material. The first sheet is flattened to a JSON array of row objects, scrubbed, and returned as JSON — the download keeps the .xlsx name but carries JSON text.
Input: leads.xlsx (first sheet)
| Email | Mobile |
| kit@demo.org | 07700 900123 |
Download: leads-scrubbed.xlsx (CONTENT is JSON):
[
{
"Email": "[REDACTED_EMAIL]",
"Mobile": "[REDACTED_PHONE]"
}
]Coverage check before a training run
After scrubbing, grep the cleaned corpus for the tags and for stray identifiers. A non-zero tag count plus zero stray emails is your go signal; a zero count usually means the file isn't the one you thought.
# confirm tags landed
grep -c "\[REDACTED_" corpus-scrubbed.jsonl.txt
# -> 1428
# confirm no raw emails slipped (should be 0)
grep -Ec "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}" corpus-scrubbed.jsonl.txt
# -> 0
# names/addresses are NOT covered — handle separatelyEdge cases and what actually happens
Customer names and addresses still reach the model
By designThere is no name, address, or date-of-birth detector — only six shape-matchable categories. Prose identifiers, mailing addresses, and quasi-identifiers pass through to your training run untouched. Treat this tool as a *direct-identifier* scrub, then run the result through the CSV/JSON Data Scrambler for field-level pseudonymisation of names and addresses.
A run reports zero items redacted
Check your inputA zero count is valid — but for a corpus you *expect* to contain PII it usually means you scrubbed the wrong file, an already-clean export, or a spreadsheet whose data lives on a sheet other than the first. Re-check the source before assuming the corpus is safe.
Fake example PANs in prompts are left intact
PreservedThe card pass requires a Luhn-valid 13–19 digit run, so synthetic example IDs that fail Luhn survive. This is what keeps your deliberately-synthetic samples usable — but it also means a real card written with a checksum-breaking typo won't be caught. Don't rely on the scrub to fix malformed real data.
Only the first file is scrubbed when you select many
First file onlyThe drop area accepts a multi-file selection, but a run processes only files[0]. For a directory of shards, loop them one at a time, or call the server-safe API in your pipeline. Don't assume a multi-file pick produced multi-file output — check the -scrubbed filename.
Spreadsheet download is JSON wearing an .xlsx name
ExpectedFor .xlsx/.xls/.ods inputs the first sheet is converted to a JSON array of row objects, scrubbed, and returned as that JSON — the filename keeps .xlsx but the content is JSON text, not a rebuilt workbook. If your loader expects binary Excel, save the JSON and re-import, or export to CSV before scrubbing.
Corpus is larger than your plan's text limit
Rejected — over limitText inputs pass through the tier size gate (Free 10 MB, Pro 100 MB, Pro + Media 500 MB, Developer 2 GB). An oversize file throws File "…" is N MB — exceeds the … limit for your plan. and nothing is scrubbed. Shard the corpus or upgrade. The spreadsheet path reads the workbook directly into the tab, so huge workbooks are memory-bound instead.
An internal user ID happens to pass Luhn
Card winsDetectors run in a fixed order and the Luhn-gated card pass runs before phone. A 13–19 digit internal ID that coincidentally satisfies Luhn becomes [REDACTED_CARD] — over-redaction rather than a leak, but it can blank out a field you wanted to keep as a training signal. Check ID-bearing columns after a run.
Undashed SSNs in your data aren't removed
Not matchedThe US SSN detector only matches the strict NNN-NN-NNNN dashed form (with SSA invalid-block rules). A nine-digit run like 123456789 is too ambiguous and is left alone. If your corpus stores SSNs without dashes, normalise them to the dashed form before scrubbing or they'll reach the model.
Illustrative emails in documentation get redacted too
Pattern matchAnything matching local@domain.tld is replaced, including a sample address you put in docstrings or few-shot examples on purpose. The scrubber can't tell a real contact from an intentional placeholder. In documentation-heavy corpora, review redactions so you don't lose meaningful example addresses.
You need encrypted-at-rest data, not redaction
Wrong toolRedaction destroys the value; if instead you need the raw corpus protected in transit or at rest for a trusted partner, encrypt it with the AES-256 Encryptor (Web Crypto AES-GCM-256, PBKDF2 100k). To verify a downloaded dataset matches its published hash, use the Multi-Hash Fingerprinter.
Frequently asked questions
Does this fully anonymise my training data?
No — it removes six direct-identifier categories (email, phone, IBAN, card, US SSN, UK NI) but has no detector for names, addresses, dates of birth, or internal IDs. Those quasi-identifiers reach your model verbatim. Treat this as the direct-identifier step, then pseudonymise the rest with the CSV/JSON Data Scrambler.
Will the redaction tags confuse my model?
The tags are stable literal strings ([REDACTED_EMAIL], [REDACTED_PHONE], etc.), so a model learns a single consistent placeholder token rather than memorising a leaked value. That's the intended outcome — a predictable mask the model can generalise around.
Is my corpus uploaded to clean it?
No. The on-page tool runs the regex passes entirely in your browser tab via local JavaScript — the corpus and every identifier in it never reach a server. That's what makes it safe to clean data you haven't yet vetted for consent or licensing. A separate opt-in API path exists for automation.
Can I scrub JSONL files?
Yes — save them with a .txt (or .json) extension and drop them in. The scrubber works on raw text line by line implicitly, and because tags drop inside existing quotes, each JSONL line stays valid JSON. The download keeps your structure intact.
Can I keep some categories and only redact others?
No. There is no options panel — all six detectors always run and the tags are fixed strings. If you only want certain tags in the final corpus, find-and-replace the unwanted ones back in a text editor after the scrub. There is no per-detector toggle.
Why did a fake card number in my prompts survive?
The card detector applies the Luhn checksum first. A 13–19 digit run is only replaced if it passes Luhn, so synthetic example PANs (which usually fail Luhn) are left intact — exactly what you want for deliberately-synthetic training samples. Real cards pass Luhn and are redacted.
How do I verify the scrub worked before training?
Download the -scrubbed file and grep it: count [REDACTED_ occurrences to confirm tags landed, and run the raw email/phone patterns against it to confirm zero survivors. A non-zero tag count plus zero stray identifiers is your go signal.
Does it handle international phone numbers in transcripts?
It targets common international and grouped formats — an optional +, a 1–3 digit prefix, then digit blocks split by spaces, dots, dashes, or parentheses (e.g. +44 20 7946 0958, (212) 555-0143). Separator-less or heavily localised formats may slip; scan the output for any phone style your data uses.
Can it process my whole dataset folder at once?
Not in one click. The drop area accepts a multi-file selection but a single run scrubs only the first file. For a sharded corpus, run files one at a time or call the server-safe API in a loop — each run yields one scrubbed file.
It accepts Excel — do I get Excel back for my pipeline?
It accepts .xlsx/.xls/.ods but converts the first sheet to a JSON array of row objects, scrubs that, and returns the JSON. The download keeps the .xlsx name yet holds JSON text, not a rebuilt workbook. Export your sheet to CSV before scrubbing if your loader needs binary Excel.
What's the size limit for a training corpus?
Text files are gated by the security tier: Free 10 MB, Pro 100 MB, Pro + Media 500 MB, Developer 2 GB. An oversize text file throws an exceeds the … limit for your plan error and isn't scrubbed. Shard large corpora, or upgrade. Spreadsheets are read directly and bounded by tab memory.
What about PII inside PDFs or images in my dataset?
This tool is text/CSV/JSON only. For text redaction inside a PDF use the PDF PII Redactor; to remove a signature or stamp from a document image use Signature Burner. For realistic field-level fakes across a tabular dataset, use the CSV/JSON Data Scrambler.
Privacy first
Every JAD Security operation runs entirely in your browser. Files, passwords, and PGP private keys never leave your device — verified by zero outbound network requests during processing.