How to redact email addresses from a pdf
- Step 1Confirm the PDF has a real text layer — The redactor reads text via pdf.js
getTextContent(). Select an email in your reader — if it highlights, there is a text layer. If the page is a scan (nothing selects), run PDF OCR first to add a text layer, then come back. - Step 2Open the redactor and drop the PDF — Load the file into the PDF PII Redactor. Processing happens in your browser; 0 bytes are uploaded. Note there is no options panel for this tool.
- Step 3Let it auto-run — The tool runs immediately on drop — email, phone, SSN, and credit-card patterns all fire together. You cannot restrict it to emails only; the other patterns simply find nothing in an email-only document, or box additional matches if present.
- Step 4Download the redacted PDF — The result panel shows output size and page count. Click Download to save
yourfile.pii-redactor.pdf. There is no on-screen list of how many addresses were boxed. - Step 5Verify coverage in a reader — Open the output and try to select text under each black box. Because the box covers the whole text run, check that no fragment of an address peeks out at the edges, and that wrapped addresses split across two lines are both covered.
- Step 6Flatten to make it unrecoverable — The glyphs still exist beneath the box. To destroy them, rasterise via PDF to PNG then Image to PDF, or open the result in PDF Flatten. Then re-verify that copy-paste yields nothing.
What the email pattern matches (and misses)
The single email regex is [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}, applied to each pdf.js text item. It is greedy on common shapes but not exhaustive.
| Address in the PDF | Matched? | Why |
|---|---|---|
jane.doe@example.com | Yes | Local part, @, domain, 2+ letter TLD — the canonical case |
info+newsletter@sub.domain.co.uk | Yes | + and . are in the local-part class; multi-label domains match |
J.Smith@NHS.NET | Yes | Case-insensitive character classes cover upper-case addresses |
user@localhost | No | No dot-TLD, so the \.[A-Za-z]{2,} tail fails — bare hostnames are skipped |
"odd name"@example.com | Partial | Quoted local parts with spaces are not in the class; only the trailing name@example.com portion can match |
josé@example.com | Partial | The accented é is outside [A-Za-z0-9._%+-], so only @example.com and following ASCII matches — the local part may be left visible |
| Address split across two lines by wrapping | Per-line | pdf.js emits each line as a separate item; only the part on a line that forms a full pattern is boxed — see edge cases |
Redaction behaviour vs. expectations
Read this before you treat the output as safe. The behaviour is identical for all four PII patterns; only the regex differs.
| Aspect | What actually happens | Implication |
|---|---|---|
| Box scope | The whole pdf.js text item containing the match is boxed (x, y, item width, item height + 2pt), not just the address substring | Surrounding words on the same run get covered too — usually fine, occasionally over-redacts a label |
| Underlying text | The original glyphs remain in the content stream beneath the black rectangle | Copy-paste and text-extraction tools can still recover the address until you flatten / rasterise |
| Match reporting | The engine counts matches internally but the processor discards the count | The UI shows no '12 emails redacted' figure — verify visually |
| Trigger | Runs automatically on file drop; all four patterns fire | You cannot scope it to emails only; expect phone/SSN/card matches to be boxed too if present |
| Encryption | pdf.js needs to read the text; an encrypted PDF without its password fails to parse | Unlock first with PDF Unlock or Remove Password |
Cookbook
Concrete before/after cases for email-heavy documents. 'Box' means an opaque black rectangle over the text item; 'recoverable' means the glyphs still extract until you flatten.
A staff directory line
A directory row where the address sits in its own text run gets cleanly boxed. Because the run is just the address, nothing extra is covered.
Before (page text item): jane.doe@example.com After (visual): [ ████████████████ ] Under the box (extract): jane.doe@example.com <- still there until flattened
Address embedded in a sentence
When the address shares a text run with surrounding words, the whole run is boxed — you lose the sentence fragment too, which is usually acceptable for a redaction.
Before: Please contact jane.doe@example.com for access. Text items: ["Please contact "]["jane.doe@example.com"][" for access."] After: Please contact [ ██████████████ ] for access. (only the address item is boxed; layout decides item boundaries)
Wrapped address split across two lines
A long address that wraps becomes two text items. Each line is tested independently, so a fragment that does not itself form a full user@domain.tld is left visible.
Visible in PDF: a.very.long.name@ department.example.com pdf.js items: ["a.very.long.name@"] ["department.example.com"] Neither line alone is a full pattern -> NEITHER is boxed. Fix: review wrapped addresses manually before sharing.
Mixed PII on the same page
Because all four patterns fire together, an email document that also contains a phone number gets both boxed in the single pass — there is no way to box only the email.
Page text: Email jane.doe@example.com / Tel 020 7946 0991 After: Email [ ██████████ ] / Tel [ ████████ ] (email AND phone patterns matched; both items boxed)
Making it unrecoverable
The destroy-the-glyphs step the tool itself does not do. Rasterise to images then rebuild, or flatten, then confirm extraction returns nothing.
1. pdf-pii-redactor -> visual boxes, glyphs still present 2. pdf-to-png -> each page becomes a flat image 3. image-to-pdf -> rebuild a glyph-free PDF 4. pdf-to-text -> should now return NO email addresses
Edge cases and what actually happens
Scanned PDF with no text layer
0 matchesThe redactor reads text via pdf.js getTextContent(). A photographed or scanned page has no extractable text, so the email pattern finds nothing and no boxes are drawn. Run PDF OCR first to add a text layer, then redact.
Glyphs survive under the black box
RecoverableThis is the single most important caveat. The tool draws a rectangle; it does not delete the underlying text. Anyone can copy-paste or run text extraction on the output and recover the address. To make it forensically safe, rasterise (via PDF to PNG + Image to PDF) or flatten the result.
Address wraps across two lines
Partialpdf.js emits each visual line as a separate text item. A wrapped address (name@ on one line, domain.com on the next) is tested per line; neither half forms a complete user@domain.tld, so neither is boxed. Review wrapped addresses by eye.
Internationalised local part (accents)
PartialCharacters like é, ü, or non-Latin scripts are outside the local-part class [A-Za-z0-9._%+-]. An address such as josé@example.com may only get @example.com onward boxed, leaving the name visible. Check international addresses manually.
Over-redaction of the surrounding run
By designBecause the whole text item is boxed, words sharing the run with the address are covered too. This errs on the side of removing more, not less — generally desirable for redaction, but verify a label or amount you needed to keep was not hidden.
Encrypted / password-protected PDF
fails to parsepdf.js must read the text to find matches. A PDF encrypted without the open password cannot be parsed and the redaction will not run. Remove the password first with PDF Unlock or Remove Password, redact, then re-protect if needed.
Email inside an annotation or form field
Not coveredThe redactor scans the page content text layer only. An address typed into a comment, sticky note, or form field is not page text and will not be boxed. Strip those first with Annotation Remover or Flatten.
Bare hostname address (no TLD)
Not matchedThe pattern requires a .tld of two or more letters. Intranet addresses like user@mailhost or admin@localhost have no dotted TLD and are skipped. These are rare in shared documents but worth a manual scan if your org uses them.
No on-screen count of redactions
ExpectedThe processor returns only the redacted PDF; the internal match count is discarded. You will not see a '7 emails redacted' summary. Confirm coverage by trying to select text under each box in the output.
Free-tier size or page cap exceeded
rejectedFree tier caps PDFs at 2 MB and 50 pages. A larger correspondence bundle is rejected until you upgrade (Pro: 50 MB / 500 pages) or split it first with PDF Split by Range.
Frequently asked questions
Are the email addresses actually removed, or just covered?
Just covered, by default. The tool draws an opaque black rectangle over the text run; the original glyphs remain in the PDF content stream beneath it. That means the address can still be recovered by copy-paste or text extraction. To make it unrecoverable, rasterise the output (PDF to PNG then Image to PDF) or run it through PDF Flatten, then confirm extraction returns nothing.
Can I redact only emails and leave phone numbers visible?
No. The redactor has no options panel — it auto-runs all four patterns (email, phone, SSN, credit card) together the moment you drop the file. In an email-only document the other patterns simply find no matches, but if a phone number or card-shaped digit string is present it will also be boxed. There is no per-category toggle.
Does it catch every email format?
It catches canonical ASCII addresses — first.last@sub.domain.co.uk, info+tag@example.org, upper-case variants. It misses bare hostnames with no dot-TLD (user@localhost), quoted local parts with spaces, and the accented part of internationalised addresses (only the @domain tail matches). Review unusual addresses by eye.
Will it work on a scanned PDF?
Not on its own. Detection reads the text layer via pdf.js; a scan has no extractable text, so nothing is boxed. Run PDF OCR first to create a text layer, then redact. OCR mistakes can also cause misses, so verify visually.
What happens to an address that wraps across two lines?
pdf.js treats each line as a separate text item. A wrapped address (name@ then domain.com) is tested line by line; neither half is a complete user@domain.tld, so neither is boxed. Wrapped addresses need manual review before sharing.
How much of the line gets blacked out?
The whole pdf.js text item containing the match — not just the address characters. If the address sits in its own run, only it is covered. If it shares a run with surrounding words, those are covered too. This over-redacts slightly, which is safer for a redaction but can hide a label you wanted to keep.
Does it tell me how many emails it redacted?
No. The engine counts matches internally but the count is discarded before it reaches the UI. The result panel shows only output size and page count. Verify by trying to select text under each black box in the downloaded file.
Is anything uploaded to a server?
No. Detection and redaction both run in your browser using pdf.js and pdf-lib. The document and the addresses in it never leave your device — only an anonymous usage counter is recorded when you are signed in. This is what makes it suitable for privileged correspondence and GDPR work.
Can I redact an email inside a PDF comment or form field?
No — the scan is limited to the page content text layer. Addresses in annotations, sticky notes, or form fields are not page text. Remove those first with Annotation Remover or Flatten, then run the redactor.
My PDF is password-protected — can I redact it?
Not while it is encrypted. pdf.js needs to read the text, which it cannot do without the open password. Remove the password with PDF Unlock or Remove Password, redact, then re-apply protection with Password Protect if you still need it.
What size of file can I redact?
Free tier: up to 2 MB and 50 pages. Pro: 50 MB / 500 pages. Pro Media: 500 MB / 2000 pages. For a bundle over your limit, split it with PDF Split by Range, redact each part, and recombine.
What is a complete safe-to-share workflow?
Add a text layer if scanned (OCR) → run the PII Redactor → verify no fragments peek out and wrapped addresses are covered → flatten or rasterise so the glyphs are gone → scrub document properties with Metadata Scrubber. Then extract text one last time and confirm no address survives.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.