How to auto-detect and black-box pii across a pdf
- Step 1Open the canonical PDF PII redactor — This Security entry routes through to the real engine at /pdf-tools/pdf-pii-redactor. It is Pro-tier (
minTier: pro). Free accounts can run other PDF privacy tools but this redactor needs Pro. - Step 2Drop a text-layer PDF — Upload a single PDF that contains a real text layer (one file at a time —
acceptsMultiple: false). Born-digital exports from Word, Google Docs, accounting software, and most government forms have a text layer. Scanned or photographed pages do not — OCR them first. - Step 3Let the scanner walk every page — pdfjs reads each page's
getTextContent()items; pdf-lib loads the same document. For each text item, the four PII regexes run in order — email, phone, SSN, credit-card — and the first match flags the item. - Step 4Boxes are drawn over each matched item — When an item matches, a black rectangle is drawn at that item's
x/yposition spanning its full width and height (plus 2 pt). One box per text item is enough, so the whole span is covered — not just the matched substring. - Step 5Download the redacted PDF — The result is saved as a new PDF blob and downloaded. Page count and the rest of the layout are preserved; only black boxes are added on top of matched text.
- Step 6Make it permanent before sharing — Critical: open the downloaded file and verify with
Ctrl+A → copy— if redacted text still pastes out, the glyphs are still in the stream. Flatten/rasterise the PDF (print-to-PDF as image, or a flatten tool) so the boxes become pixels and the text is gone for good.
What the redactor detects (the four built-in patterns)
These are the exact PII_PATTERNS the PDF redactor runs against each text item, in order. There are no toggles or custom patterns — the set is fixed in code. Note these differ from the richer text-scrubber set (no IBAN, no UK NI, no Luhn check, no name detection).
| PII class | What it matches | Validation | Notes / gotchas |
|---|---|---|---|
local@domain.tld — letters, digits, ._%+- in the local part; a domain with a 2+ letter TLD | Regex shape only | Catches almost all real addresses; very long or unusual TLDs are fine. No DNS/validity check | |
| Phone | Optional + country code, optional area code in parentheses, then 3–4 + 3–4 digit groups separated by space, dot, or dash | Regex shape only | Deliberately loose. Can also match other digit strings shaped like phones (invoice numbers, IDs) — see edge cases |
| US SSN | NNN-NN-NNNN — exactly 3-2-4 digits with literal dashes | Format only (no SSA invalid-block exclusion here) | Requires the dashes. 123456789 (no dashes) is NOT caught as an SSN by this pattern |
| Card number | A run of 13–16 digits, optionally separated by spaces or dashes | No Luhn check in the PDF redactor | Any 13–16 digit run matches, so long order/account numbers can be flagged. Strict 19-digit cards are not the target here |
Redaction behaviour — what it does vs. what it does NOT do
The single most important table on this page. "Visual" means a filled rectangle is drawn over the text; the characters underneath are not removed.
| Behaviour | Reality in this tool | Why it matters |
|---|---|---|
| Redaction method | Filled black drawRectangle over the matched text item (pdf-lib), at the item's coordinates from pdfjs | It is real ink on the page, visible in every viewer — but it is an overlay, not a deletion |
| Text removal | Not removed. The glyphs stay in the content stream | Ctrl+A → copy and text-extraction tools can still recover the "redacted" text — flatten/rasterise to fix |
| Redaction granularity | Whole text item (the run pdfjs returns), not the exact matched substring | An item like Call 555-123-4567 now gets one box over the whole run — adjacent words are covered too |
| Scanned / image PDFs | No text layer → zero text items → zero matches → nothing redacted | Image-only pages pass through untouched; OCR first or use a manual region tool |
| Review / preview UI | None surfaced — the tool returns the redacted PDF directly | There is no per-match confirm step or confidence list to approve; verify the output yourself |
| Options / settings | None (needsOptions: false) — patterns and box style are fixed | You cannot add a pattern, change the box colour, or redact only some classes |
Tier and file limits (PDF family)
This redactor is gated at Pro (minTier: pro) and runs through the PDF tool family, so PDF-family file/page limits apply. One file at a time.
| Tier | Max file size | Max pages | Files per run |
|---|---|---|---|
| Free | Tool gated — Pro required to run this redactor | — | — |
| Pro | 50 MB | 500 pages | 5 (this tool: 1 at a time) |
| Developer | 2 GB | 10,000 pages | Unlimited (this tool: 1 at a time) |
Cookbook
Real before/after page snippets from the kinds of documents FOIA and compliance teams redact. PII values are fabricated. "Before" is the page text; "After" shows what a viewer displays once boxes are drawn — and what copy/paste still recovers underneath.
A benefits letter with an SSN and email
Born-digital PDF from HR software — full text layer. The SSN is in dashed NNN-NN-NNNN form and the email is standard, so both are caught. Each whole text item is boxed.
Before (page text): Member: Dana Cole SSN: 532-19-4471 Contact: dana.cole@example.org Plan: Gold PPO After (what the viewer shows): Member: Dana Cole SSN: ███████████ Contact: ███████████████████████ Plan: Gold PPO Verify: Ctrl+A -> copy still pastes: SSN: 532-19-4471 Contact: dana.cole@example.org -> flatten/rasterise to remove the text for real.
An invoice where a long account number gets boxed
The card pattern matches any 13–16 digit run. A 16-digit purchase-order or account number on an invoice will be flagged as a card. This is over-redaction, not a card leak — but it shows why you should eyeball the output.
Before: PO Number: 4002 8812 3456 7890 Card on file: ending 0042 Amount: $1,240.00 After: PO Number: █████████████████████ Card on file: ending 0042 Amount: $1,240.00 The PO (16 digits) matched the card pattern and was boxed. The masked 'ending 0042' was NOT (only 4 digits).
Phone pattern also covering nearby words
Redaction is per text item, not per substring. If pdfjs returns a phone number inside a longer run, the entire run is covered — useful when context itself is sensitive, surprising when it hides wanted text.
Before (single text item from pdfjs): 'Reach the case officer at (202) 555-0148 ext 6' After: '███████████████████████████████████████████' The whole item is boxed because it contained a phone match, not just the digits. Reflow/copy of that item recovers all of it.
Scanned FOIA page with no text layer
A photocopied, scanned packet has only images — no text items for pdfjs to read. The auto-redactor finds nothing and the page is returned untouched. OCR it first to add a text layer, or use a manual region tool.
Input: scanned_complaint_packet.pdf (image-only) Scan result: 0 text items -> 0 matches -> 0 boxes Output: identical pages, no redactions. Fix path: 1. Run OCR via /pdf-tools/pdf-ocr to add a text layer 2. Re-run this redactor, OR 3. Burn manual rectangles with /security-tools/signature-burner
Undashed SSN slips through
The SSN pattern requires the literal dashes (NNN-NN-NNNN). A 9-digit string with no separators is not matched as an SSN, and 9 digits is below the 13-digit card threshold — so it is missed entirely. Normalise SSNs to dashed form before redacting, or add a text pass.
Before: Taxpayer ID: 532194471 SSN: 532-19-4471 After: Taxpayer ID: 532194471 <- NOT redacted (no dashes, < 13 digits) SSN: ███████████ <- redacted (dashed form matched) Mitigation: search/replace IDs into dashed form first, or pre-scrub the text with /security-tools/email-phone-scrubber.
Edge cases and what actually happens
Redacted text is still copy-pasteable
By design (visual only)This is the headline caveat. The tool draws a filled rectangle over the text; it does NOT delete glyphs from the content stream. The code comment is explicit: "the glyphs underneath are still in the file's content stream." So Ctrl+A → copy, text extraction, and accessibility readers can recover the redacted values. For genuine removal, flatten or rasterise the downloaded PDF (e.g. print-to-PDF as an image) so the boxes become pixels.
Scanned / image-only PDF produces no redactions
No matchesDetection relies on pdfjs reading a text layer (getTextContent()). A scanned or photographed document has only images, so there are zero text items and zero matches — the file comes back unchanged. Add a text layer with PDF OCR first, then re-run, or burn manual rectangles with signature-burner.
Whole text item is boxed, not just the matched value
ExpectedRedaction granularity is one box per matched text item. If pdfjs returns a phone or email inside a longer run (Reach us at (202) 555-0148 today), the entire run is covered. This over-covers neighbouring words. It is intentional ("one redaction box per item is enough") and usually safer, but check the output if you needed adjacent text to stay visible.
Long account / order number flagged as a card
Over-redactionThe card pattern matches any 13–16 digit run with optional spaces/dashes and performs no Luhn check. Purchase-order numbers, account IDs, and tracking numbers in that length range get boxed even though they aren't cards. That's a false positive in the safe direction (it hides, doesn't leak), but it can obscure wanted data — review the result.
SSN without dashes is not detected
Missed matchThe SSN regex requires the dashed form NNN-NN-NNNN. A bare 532194471 is not matched as an SSN, and at 9 digits it is below the 13-digit card threshold, so it slips through entirely. Normalise IDs to dashed form before redacting, or pre-scrub the text content with email-phone-scrubber.
Names, addresses, dates of birth are not redacted
Out of scopeOnly four classes are detected: email, phone, SSN, and card-number runs. There is no name, address, DOB, IBAN, or UK National Insurance detection in this PDF redactor (despite a registry FAQ mentioning "name patterns" — the code does not implement that). Redact those manually, or use signature-burner for arbitrary regions.
Text split across multiple items isn't matched
Missed matchRegexes run against each text item independently. If a PDF's text engine split an email or phone across two items (john.doe@ in one item, example.com in the next), neither fragment matches and nothing is boxed. This happens with justified text and certain export pipelines. Spot-check critical pages; flatten + re-OCR can re-flow text into single items.
Encrypted / password-protected PDF
Loaded with ignoreEncryptionThe redactor loads with ignoreEncryption: true, so many lightly-protected PDFs open and process. Strongly encrypted files (those pdfjs/pdf-lib can't parse) will error out before scanning. Remove the password first with pdf-password-protect / an unlock tool, then redact.
Box doesn't perfectly cover rotated or skewed text
Visual mismatch possibleThe rectangle is drawn axis-aligned at the item's x/y with its width/height. For rotated pages or text with unusual transforms, the box may sit slightly off the glyphs. Always visually inspect the rendered output before treating any page as redacted.
Free tier can't run this tool
Pro requiredThe redactor is gated at minTier: pro. On the Free tier the run is blocked before processing. Pro allows up to 50 MB / 500 pages; Developer raises that to 2 GB / 10,000 pages. The tool processes one PDF at a time.
Frequently asked questions
Is this real redaction — is the text actually removed?
No, and this is the most important thing to know. The tool draws a black rectangle over each matched text item with pdf-lib, but the underlying characters stay in the PDF content stream. The code comment says so directly. That means Ctrl+A → copy or any text-extraction tool can still recover the "redacted" values. Treat this as a fast visual pass, then flatten or rasterise the file (print-to-PDF as an image, or a flatten step) to delete the text for real before you share it.
What PII does it detect?
Four fixed patterns: emails, phone numbers, US Social Security Numbers in dashed NNN-NN-NNNN form, and runs of 13–16 digits (treated as card numbers). These are the exact PII_PATTERNS in the redactor. There is no IBAN, UK National Insurance, name, address, or date-of-birth detection in this PDF tool — that richer set lives in the text-based email-phone-scrubber.
Does it work on scanned PDFs?
No. Detection reads the PDF text layer via pdfjs. A scanned or photographed document is just images — zero text items, zero matches, nothing redacted. Run PDF OCR to add a text layer first, then re-run this redactor, or draw your own rectangles with signature-burner.
Why did it black out a whole sentence instead of just the email?
Redaction is per text item, not per substring. pdfjs returns text in runs, and if a match lands inside a longer run the whole run gets one box ("one redaction box per item is enough"). It over-covers neighbouring words, which is usually safer. If you need surrounding text visible, you'll have to redact that region manually instead.
Why was a long order number redacted as a card?
The card pattern matches any 13–16 digit run with optional spaces/dashes and does not run a Luhn check. Purchase-order numbers, account IDs, and tracking numbers in that length range get boxed too. It's a false positive in the safe direction — it hides rather than leaks — but eyeball the output if you needed those numbers to stay readable.
My SSN wasn't redacted — why?
The SSN pattern requires the literal dashes (NNN-NN-NNNN). A bare 9-digit string like 532194471 doesn't match, and 9 digits is below the 13-digit card threshold, so it's missed entirely. Reformat SSNs into dashed form before redacting, or pre-scrub the text with email-phone-scrubber first.
Can I choose which PII classes to redact, or add my own pattern?
No. The tool has no options (needsOptions: false). All four patterns always run, the box is always black, and you can't add a custom pattern or change the colour. If you need configurable masking with [REDACTED_*] labels, use the text-based email-phone-scrubber or csv-json-data-scrambler.
Is there a review step before I download?
No per-match review or confidence list is surfaced on this path — the tool returns the redacted PDF directly. You should open the result yourself, page through it, and verify (including a copy-paste check) before treating any document as redacted.
Does the file get uploaded anywhere?
No. The whole pipeline runs in your browser — pdfjs reads the pages, pdf-lib draws the boxes, and the result is saved locally. The PDF and its contents never leave your device, which is what makes it usable for HIPAA, GDPR, and FOIA source material.
What file size and page limits apply?
The redactor is gated at Pro. Pro allows up to 50 MB and 500 pages per PDF; Developer raises that to 2 GB and 10,000 pages. It processes one file at a time (acceptsMultiple: false). Free accounts can't run this particular tool.
How do I make the redaction permanent for a FOIA release?
Run this tool to place the boxes, then flatten or rasterise the output so the glyphs are destroyed: print-to-PDF as an image, or use a flatten tool, so each page becomes pixels with no recoverable text. Re-verify with copy-paste afterward — if nothing pastes from the boxed areas, the text is gone. Only then is the document safe to release.
What if my data is in a CSV, JSON, or plain text file instead of a PDF?
Use the text-native siblings. email-phone-scrubber replaces PII in pasted text or .txt with [REDACTED_*] labels (and supports a richer set including IBAN and UK NI), and csv-json-data-scrambler handles structured rows. Those genuinely remove/replace the values rather than covering them, since text formats have no glyph-layer problem.
Privacy first
Every JAD Security operation runs entirely in your browser. Files, passwords, and PGP private keys never leave your device — verified by zero outbound network requests during processing.