How to redact personal data from a report pdf before release
- Step 1Decide what must be removed for release — Separate the identifiers the tool can box (emails, phones, dash-SSNs, cards) from those it cannot (names, addresses, dates of birth, job titles that identify someone). Plan a manual pass for the latter.
- Step 2Preserve anonymised codes deliberately — If your report uses codes like 'Respondent A' or 'Subject 3', those are not PII and are not matched — they stay intact. Add such codes before redaction if you want them in the released version.
- Step 3Confirm a text layer, OCR if needed — Detection reads text via pdf.js. A scanned exhibit needs PDF OCR first; verify the OCR output before relying on it.
- Step 4Open the redactor and drop the report — Load the file into the PDF PII Redactor. Browser-only, nothing uploaded, no options panel — it runs immediately and produces the boxed output.
- Step 5Manually redact names and addresses — Because the tool cannot box names/addresses, cover those in your PDF editor (or edit the source before export) and flatten them, in addition to running this tool for the structured fields.
- Step 6Flatten, scrub, and verify before release — Rasterise via PDF to PNG + Image to PDF or use PDF Flatten to destroy the glyphs, run Metadata Scrubber, then PDF to Text to confirm no detected PII extracts.
Report identifiers: detected vs. manual
The engine has four patterns. Names and addresses — the identifiers most reports turn on — are NOT among them.
| Identifier in the report | Auto-detected? | Action |
|---|---|---|
| Interviewee email | Yes | Boxed by the email pattern |
| Contact / witness phone | Yes | Boxed by the numeric pattern |
SSN (nnn-nn-nnnn) | Yes (dash form) | Boxed; other digit forms missed |
| Card / account number (13–16 digits) | Yes | Boxed by the card pattern |
| Subject's name | No | Manual — no name pattern; replace with a code or cover it |
| Home / site address | No | Manual — no address pattern |
| Date of birth / key dates | No | Manual — no date pattern |
| 'Respondent A' style code | No (and correctly so) | Left intact — not PII; keep it in the release version |
From draft to safe-to-release
A boxed report is not yet a release version. Each stage moves it closer.
| Stage | What it does | Still leaks? |
|---|---|---|
| Auto-redact (this tool) | Boxes emails/phones/SSNs/cards in the text layer | Yes — names/addresses visible; glyphs recoverable |
| Manual redaction | Covers names, addresses, dates the tool cannot detect | Glyphs still recoverable until flattened |
| Flatten / rasterise | Destroys the underlying text under every box | No, for what was covered |
| Metadata scrub | Clears author, producer, title, dates from properties | No — removes the hidden authorship trail |
Cookbook
Report de-identification cases. 'Detected' fields are boxed automatically; 'manual' fields you handle yourself. 'Recoverable' until flattened.
An inspection report line
Contact details are boxed; the named individual is not.
Source: Interview with Dr Helen Carter (h.carter@trust.nhs.uk, 020 7946 0991) After auto-redaction: Interview with Dr Helen Carter ([ ████████████ ], [ ███████████ ]) ^ "Dr Helen Carter" still visible -> redact manually
Preserving an anonymised code
Codes you add are not PII, so they survive untouched — exactly what you want in the released version.
Source: Respondent A (a.smith@x.com) reported the incident.
After: Respondent A ([ ██████████ ]) reported the incident.
('Respondent A' kept; only the email boxed)FOI bundle with an exhibit invoice
A 13–16 digit card or account number on an attached invoice is boxed by the card pattern.
Exhibit line: Settled via card 4111 1111 1111 1111 After: Settled via card [ ████████████████ ]
The full release workflow
Manual for names/addresses, automated for structured fields, then flatten and scrub.
1. Replace/cover subject names + addresses (manual) 2. pdf-pii-redactor -> boxes emails/phones/SSNs/cards 3. pdf-to-png + image-to-pdf (or pdf-flatten) -> destroy glyphs 4. pdf-metadata-scrubber -> clear author/producer/dates 5. pdf-to-text -> verify no detected PII extracts
Verifying before publication
The final check for an FOI or public release.
Run pdf-to-text on the release file and search for: '@' -> no emails digit runs -> no phones/SSNs/cards subject names -> confirm your manual redactions held If anything appears, it was not flattened or not covered.
Edge cases and what actually happens
Subject names are not detected
Not coveredThe tool has no name pattern, so proper nouns identifying interviewees, complainants, or staff are left fully visible. For a report, this is the central limitation: you must redact names manually (or replace them with codes like 'Respondent A') in addition to running this tool.
Addresses and dates not detected
Not coveredThere is no address or date pattern. Home addresses, site locations, and dates of birth remain visible and must be redacted by hand before release.
Boxes are visual until flattened
RecoverableEach box hides text but leaves the glyphs in the content stream — recoverable by extraction. A published or FOI-released PDF with recoverable personal data is a disclosure failure. Always flatten or rasterise (PDF Flatten or PDF to PNG round-trip) before release.
Anonymised codes are preserved
PreservedCodes such as 'Respondent A' or 'Subject 3' are not PII and match no pattern, so they pass through untouched — which is the desired behaviour for a de-identified report.
Over-redaction of surrounding prose
By designBoxing the whole pdf.js text item can cover words next to a detected identifier. For a release version this is usually acceptable, but check you did not hide a finding or context the public is entitled to.
Scanned exhibit with no text layer
0 matchesDetection needs extractable text. OCR scanned exhibits with PDF OCR first, then verify, since OCR errors can leave identifiers unmatched.
Un-hyphenated or spaced numbers
PartialThe SSN rule matches only nnn-nn-nnnn; compact or space-delimited numeric IDs are missed. Review numeric identifiers in the report manually.
Author name hidden in metadata
Not coveredThe redactor scans page text only. A report's author name and authoring software live in the document properties. Run Metadata Scrubber before release.
Reviewer comments contain PII
Not coveredTracked comments and sticky notes are not page content text and are not scanned. Remove them with Annotation Remover or Flatten before redacting.
Report over the tier cap
rejectedFree tier rejects PDFs over 2 MB / 50 pages. Upgrade (Pro: 50 MB / 500 pages; Pro Media: 500 MB / 2000 pages) or split with PDF Split by Range.
Frequently asked questions
Will it redact the names that appear in the report's text?
No. The tool has no name-detection pattern — it boxes only emails, phones, dash-delimited SSNs, and 13–16 digit card numbers. Personal names like 'Dr Helen Carter' are left fully visible. You must redact names manually (cover them in a PDF editor and flatten, or replace them with anonymised codes in the source) in addition to running this tool.
Can I preserve anonymised references like 'Respondent A'?
Yes, and automatically. Codes such as 'Respondent A' or 'Subject 3' are not personal data and match none of the four patterns, so they pass through untouched. Add them to the document before redaction if you want them in the released version.
What personal data does it actually remove?
It boxes four structured types: email addresses, phone-shaped numbers, US SSNs in nnn-nn-nnnn form, and credit/debit card numbers of 13–16 digits. It does not detect names, addresses, dates of birth, or other free-text identifiers — those are a manual job.
What redaction colour is used, and does it fully cover the text?
A solid black rectangle (rgb(0,0,0)) is drawn over the whole pdf.js text item containing the match, sized to the item plus a 2pt height margin. It visually covers the run, but zoom in to confirm no fragment peeks out at the edges — and remember the box is visual only until you flatten.
Is a boxed report safe to publish as-is?
No. The boxes hide the detected fields but leave the underlying glyphs in the content stream, recoverable by text extraction. You must also redact the names/addresses the tool cannot detect, then flatten or rasterise the whole file (PDF Flatten or PDF to PNG + Image to PDF) so nothing is recoverable, and finally scrub metadata.
Does it have a review-and-confirm step?
No. The tool auto-runs the moment you drop the file and produces the boxed output directly — there is no preview, no per-match confirmation, and no manual-redaction mode inside the tool. Your review happens on the downloaded file: open it, check coverage, and verify extraction returns nothing after flattening.
How do I handle addresses and dates the tool misses?
Manually. Cover home/site addresses and dates of birth in your PDF editor and flatten them, or edit the source document before exporting to PDF. The redactor only handles the structured contact/account fields.
Does it remove the author's name from the file?
No — that lives in the document properties, not the page text, so the redactor does not touch it. Run Metadata Scrubber to clear the author, producer, title, and date fields before you release the report.
What about a scanned appendix or exhibit?
Scanned pages have no text layer, so nothing is detected. Run PDF OCR to add a text layer first, then redact — and verify, because OCR errors can leave an identifier unmatched and visible.
Is the draft uploaded anywhere during redaction?
No. All detection and redaction happen in your browser via pdf.js and pdf-lib. The pre-publication draft never leaves your device; only an anonymous usage counter is recorded when you are signed in.
How large a report can I process?
Free: 2 MB / 50 pages. Pro: 50 MB / 500 pages. Pro Media: 500 MB / 2000 pages. For a longer report, split it with PDF Split by Range, redact each section, then recombine for release.
What is the complete de-identification workflow before release?
Separate detectable fields from manual ones → preserve any anonymised codes → OCR scanned exhibits → manually redact names, addresses, and dates → run the PII Redactor for emails/phones/SSNs/cards → flatten or rasterise so all covered text is destroyed → scrub properties with Metadata Scrubber → confirm with PDF to Text that no detected PII or subject name extracts.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.