Redact Personal Data from a Report PDF Before Release

How to redact personal data from a report pdf before release

Step 1
Decide what must be removed for release — Separate the identifiers the tool can box (emails, phones, dash-SSNs, cards) from those it cannot (names, addresses, dates of birth, job titles that identify someone). Plan a manual pass for the latter.
Step 2
Preserve anonymised codes deliberately — If your report uses codes like 'Respondent A' or 'Subject 3', those are not PII and are not matched — they stay intact. Add such codes before redaction if you want them in the released version.
Step 3
Confirm a text layer, OCR if needed — Detection reads text via pdf.js. A scanned exhibit needs PDF OCR first; verify the OCR output before relying on it.
Step 4
Open the redactor and drop the report — Load the file into the PDF PII Redactor. Browser-only, nothing uploaded, no options panel — it runs immediately and produces the boxed output.
Step 5
Manually redact names and addresses — Because the tool cannot box names/addresses, cover those in your PDF editor (or edit the source before export) and flatten them, in addition to running this tool for the structured fields.
Step 6
Flatten, scrub, and verify before release — Rasterise via PDF to PNG + Image to PDF or use PDF Flatten to destroy the glyphs, run Metadata Scrubber, then PDF to Text to confirm no detected PII extracts.

Report identifiers: detected vs. manual

The engine has four patterns. Names and addresses — the identifiers most reports turn on — are NOT among them.

Identifier in the report	Auto-detected?	Action
Interviewee email	Yes	Boxed by the email pattern
Contact / witness phone	Yes	Boxed by the numeric pattern
SSN (`nnn-nn-nnnn`)	Yes (dash form)	Boxed; other digit forms missed
Card / account number (13–16 digits)	Yes	Boxed by the card pattern
Subject's name	No	Manual — no name pattern; replace with a code or cover it
Home / site address	No	Manual — no address pattern
Date of birth / key dates	No	Manual — no date pattern
'Respondent A' style code	No (and correctly so)	Left intact — not PII; keep it in the release version

From draft to safe-to-release

A boxed report is not yet a release version. Each stage moves it closer.

Stage	What it does	Still leaks?
Auto-redact (this tool)	Boxes emails/phones/SSNs/cards in the text layer	Yes — names/addresses visible; glyphs recoverable
Manual redaction	Covers names, addresses, dates the tool cannot detect	Glyphs still recoverable until flattened
Flatten / rasterise	Destroys the underlying text under every box	No, for what was covered
Metadata scrub	Clears author, producer, title, dates from properties	No — removes the hidden authorship trail

Cookbook

Report de-identification cases. 'Detected' fields are boxed automatically; 'manual' fields you handle yourself. 'Recoverable' until flattened.

An inspection report line

Contact details are boxed; the named individual is not.

Source:
  Interview with Dr Helen Carter (h.carter@trust.nhs.uk, 020 7946 0991)
After auto-redaction:
  Interview with Dr Helen Carter ([ ████████████ ], [ ███████████ ])
  ^ "Dr Helen Carter" still visible -> redact manually

Preserving an anonymised code

Codes you add are not PII, so they survive untouched — exactly what you want in the released version.

Source:  Respondent A (a.smith@x.com) reported the incident.
After:   Respondent A ([ ██████████ ]) reported the incident.
('Respondent A' kept; only the email boxed)

FOI bundle with an exhibit invoice

A 13–16 digit card or account number on an attached invoice is boxed by the card pattern.

Exhibit line: Settled via card 4111 1111 1111 1111
After:        Settled via card [ ████████████████ ]

The full release workflow

Manual for names/addresses, automated for structured fields, then flatten and scrub.

1. Replace/cover subject names + addresses (manual)
2. pdf-pii-redactor -> boxes emails/phones/SSNs/cards
3. pdf-to-png + image-to-pdf (or pdf-flatten) -> destroy glyphs
4. pdf-metadata-scrubber -> clear author/producer/dates
5. pdf-to-text -> verify no detected PII extracts

Verifying before publication

The final check for an FOI or public release.

Run pdf-to-text on the release file and search for:
  '@'         -> no emails
  digit runs  -> no phones/SSNs/cards
  subject names -> confirm your manual redactions held
If anything appears, it was not flattened or not covered.

Edge cases and what actually happens

Subject names are not detected

Not covered

The tool has no name pattern, so proper nouns identifying interviewees, complainants, or staff are left fully visible. For a report, this is the central limitation: you must redact names manually (or replace them with codes like 'Respondent A') in addition to running this tool.

Addresses and dates not detected

Not covered

There is no address or date pattern. Home addresses, site locations, and dates of birth remain visible and must be redacted by hand before release.

Boxes are visual until flattened

Recoverable

Each box hides text but leaves the glyphs in the content stream — recoverable by extraction. A published or FOI-released PDF with recoverable personal data is a disclosure failure. Always flatten or rasterise (PDF Flatten or PDF to PNG round-trip) before release.

Anonymised codes are preserved

Preserved

Codes such as 'Respondent A' or 'Subject 3' are not PII and match no pattern, so they pass through untouched — which is the desired behaviour for a de-identified report.

Over-redaction of surrounding prose

By design

Boxing the whole pdf.js text item can cover words next to a detected identifier. For a release version this is usually acceptable, but check you did not hide a finding or context the public is entitled to.

Scanned exhibit with no text layer

0 matches

Detection needs extractable text. OCR scanned exhibits with PDF OCR first, then verify, since OCR errors can leave identifiers unmatched.

Un-hyphenated or spaced numbers

Partial

The SSN rule matches only nnn-nn-nnnn; compact or space-delimited numeric IDs are missed. Review numeric identifiers in the report manually.

Author name hidden in metadata

Not covered

The redactor scans page text only. A report's author name and authoring software live in the document properties. Run Metadata Scrubber before release.

Reviewer comments contain PII

Not covered

Tracked comments and sticky notes are not page content text and are not scanned. Remove them with Annotation Remover or Flatten before redacting.

Report over the tier cap

rejected

Free tier rejects PDFs over 2 MB / 50 pages. Upgrade (Pro: 50 MB / 500 pages; Pro Media: 500 MB / 2000 pages) or split with PDF Split by Range.

Frequently asked questions

Will it redact the names that appear in the report's text?

No. The tool has no name-detection pattern — it boxes only emails, phones, dash-delimited SSNs, and 13–16 digit card numbers. Personal names like 'Dr Helen Carter' are left fully visible. You must redact names manually (cover them in a PDF editor and flatten, or replace them with anonymised codes in the source) in addition to running this tool.

Can I preserve anonymised references like 'Respondent A'?

Yes, and automatically. Codes such as 'Respondent A' or 'Subject 3' are not personal data and match none of the four patterns, so they pass through untouched. Add them to the document before redaction if you want them in the released version.

What personal data does it actually remove?

It boxes four structured types: email addresses, phone-shaped numbers, US SSNs in nnn-nn-nnnn form, and credit/debit card numbers of 13–16 digits. It does not detect names, addresses, dates of birth, or other free-text identifiers — those are a manual job.

What redaction colour is used, and does it fully cover the text?

A solid black rectangle (rgb(0,0,0)) is drawn over the whole pdf.js text item containing the match, sized to the item plus a 2pt height margin. It visually covers the run, but zoom in to confirm no fragment peeks out at the edges — and remember the box is visual only until you flatten.

Is a boxed report safe to publish as-is?

No. The boxes hide the detected fields but leave the underlying glyphs in the content stream, recoverable by text extraction. You must also redact the names/addresses the tool cannot detect, then flatten or rasterise the whole file (PDF Flatten or PDF to PNG + Image to PDF) so nothing is recoverable, and finally scrub metadata.

Does it have a review-and-confirm step?

No. The tool auto-runs the moment you drop the file and produces the boxed output directly — there is no preview, no per-match confirmation, and no manual-redaction mode inside the tool. Your review happens on the downloaded file: open it, check coverage, and verify extraction returns nothing after flattening.

How do I handle addresses and dates the tool misses?

Manually. Cover home/site addresses and dates of birth in your PDF editor and flatten them, or edit the source document before exporting to PDF. The redactor only handles the structured contact/account fields.

Does it remove the author's name from the file?

No — that lives in the document properties, not the page text, so the redactor does not touch it. Run Metadata Scrubber to clear the author, producer, title, and date fields before you release the report.

What about a scanned appendix or exhibit?

Scanned pages have no text layer, so nothing is detected. Run PDF OCR to add a text layer first, then redact — and verify, because OCR errors can leave an identifier unmatched and visible.

Is the draft uploaded anywhere during redaction?

No. All detection and redaction happen in your browser via pdf.js and pdf-lib. The pre-publication draft never leaves your device; only an anonymous usage counter is recorded when you are signed in.

How large a report can I process?

Free: 2 MB / 50 pages. Pro: 50 MB / 500 pages. Pro Media: 500 MB / 2000 pages. For a longer report, split it with PDF Split by Range, redact each section, then recombine for release.

What is the complete de-identification workflow before release?

Separate detectable fields from manual ones → preserve any anonymised codes → OCR scanned exhibits → manually redact names, addresses, and dates → run the PII Redactor for emails/phones/SSNs/cards → flatten or rasterise so all covered text is destroyed → scrub properties with Metadata Scrubber → confirm with PDF to Text that no detected PII or subject name extracts.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to redact personal data from a report pdf before release

Step 1
Decide what must be removed for release — Separate the identifiers the tool can box (emails, phones, dash-SSNs, cards) from those it cannot (names, addresses, dates of birth, job titles that identify someone). Plan a manual pass for the latter.
Step 2
Preserve anonymised codes deliberately — If your report uses codes like 'Respondent A' or 'Subject 3', those are not PII and are not matched — they stay intact. Add such codes before redaction if you want them in the released version.
Step 3
Confirm a text layer, OCR if needed — Detection reads text via pdf.js. A scanned exhibit needs PDF OCR first; verify the OCR output before relying on it.
Step 4
Open the redactor and drop the report — Load the file into the PDF PII Redactor. Browser-only, nothing uploaded, no options panel — it runs immediately and produces the boxed output.
Step 5
Manually redact names and addresses — Because the tool cannot box names/addresses, cover those in your PDF editor (or edit the source before export) and flatten them, in addition to running this tool for the structured fields.
Step 6
Flatten, scrub, and verify before release — Rasterise via PDF to PNG + Image to PDF or use PDF Flatten to destroy the glyphs, run Metadata Scrubber, then PDF to Text to confirm no detected PII extracts.

Report identifiers: detected vs. manual

The engine has four patterns. Names and addresses — the identifiers most reports turn on — are NOT among them.

Identifier in the report	Auto-detected?	Action
Interviewee email	Yes	Boxed by the email pattern
Contact / witness phone	Yes	Boxed by the numeric pattern
SSN (`nnn-nn-nnnn`)	Yes (dash form)	Boxed; other digit forms missed
Card / account number (13–16 digits)	Yes	Boxed by the card pattern
Subject's name	No	Manual — no name pattern; replace with a code or cover it
Home / site address	No	Manual — no address pattern
Date of birth / key dates	No	Manual — no date pattern
'Respondent A' style code	No (and correctly so)	Left intact — not PII; keep it in the release version

From draft to safe-to-release

A boxed report is not yet a release version. Each stage moves it closer.

Stage	What it does	Still leaks?
Auto-redact (this tool)	Boxes emails/phones/SSNs/cards in the text layer	Yes — names/addresses visible; glyphs recoverable
Manual redaction	Covers names, addresses, dates the tool cannot detect	Glyphs still recoverable until flattened
Flatten / rasterise	Destroys the underlying text under every box	No, for what was covered
Metadata scrub	Clears author, producer, title, dates from properties	No — removes the hidden authorship trail

Cookbook

Report de-identification cases. 'Detected' fields are boxed automatically; 'manual' fields you handle yourself. 'Recoverable' until flattened.

An inspection report line

Contact details are boxed; the named individual is not.

Source:
  Interview with Dr Helen Carter (h.carter@trust.nhs.uk, 020 7946 0991)
After auto-redaction:
  Interview with Dr Helen Carter ([ ████████████ ], [ ███████████ ])
  ^ "Dr Helen Carter" still visible -> redact manually

Preserving an anonymised code

Codes you add are not PII, so they survive untouched — exactly what you want in the released version.

Source:  Respondent A (a.smith@x.com) reported the incident.
After:   Respondent A ([ ██████████ ]) reported the incident.
('Respondent A' kept; only the email boxed)

FOI bundle with an exhibit invoice

A 13–16 digit card or account number on an attached invoice is boxed by the card pattern.

Exhibit line: Settled via card 4111 1111 1111 1111
After:        Settled via card [ ████████████████ ]

The full release workflow

Manual for names/addresses, automated for structured fields, then flatten and scrub.

1. Replace/cover subject names + addresses (manual)
2. pdf-pii-redactor -> boxes emails/phones/SSNs/cards
3. pdf-to-png + image-to-pdf (or pdf-flatten) -> destroy glyphs
4. pdf-metadata-scrubber -> clear author/producer/dates
5. pdf-to-text -> verify no detected PII extracts

Verifying before publication

The final check for an FOI or public release.

Run pdf-to-text on the release file and search for:
  '@'         -> no emails
  digit runs  -> no phones/SSNs/cards
  subject names -> confirm your manual redactions held
If anything appears, it was not flattened or not covered.

Edge cases and what actually happens

Subject names are not detected

Not covered

Addresses and dates not detected

Not covered

There is no address or date pattern. Home addresses, site locations, and dates of birth remain visible and must be redacted by hand before release.

Boxes are visual until flattened

Recoverable

Anonymised codes are preserved

Preserved

Codes such as 'Respondent A' or 'Subject 3' are not PII and match no pattern, so they pass through untouched — which is the desired behaviour for a de-identified report.

Over-redaction of surrounding prose

By design

Scanned exhibit with no text layer

0 matches

Detection needs extractable text. OCR scanned exhibits with PDF OCR first, then verify, since OCR errors can leave identifiers unmatched.

Un-hyphenated or spaced numbers

Partial

The SSN rule matches only nnn-nn-nnnn; compact or space-delimited numeric IDs are missed. Review numeric identifiers in the report manually.

Author name hidden in metadata

Not covered

The redactor scans page text only. A report's author name and authoring software live in the document properties. Run Metadata Scrubber before release.

Reviewer comments contain PII

Not covered

Tracked comments and sticky notes are not page content text and are not scanned. Remove them with Annotation Remover or Flatten before redacting.

Report over the tier cap

rejected

Free tier rejects PDFs over 2 MB / 50 pages. Upgrade (Pro: 50 MB / 500 pages; Pro Media: 500 MB / 2000 pages) or split with PDF Split by Range.

Frequently asked questions

Will it redact the names that appear in the report's text?

Can I preserve anonymised references like 'Respondent A'?

What personal data does it actually remove?

What redaction colour is used, and does it fully cover the text?

Is a boxed report safe to publish as-is?

Does it have a review-and-confirm step?

How do I handle addresses and dates the tool misses?

Does it remove the author's name from the file?

What about a scanned appendix or exhibit?

Scanned pages have no text layer, so nothing is detected. Run PDF OCR to add a text layer first, then redact — and verify, because OCR errors can leave an identifier unmatched and visible.

Is the draft uploaded anywhere during redaction?

No. All detection and redaction happen in your browser via pdf.js and pdf-lib. The pre-publication draft never leaves your device; only an anonymous usage counter is recorded when you are signed in.

How large a report can I process?

Free: 2 MB / 50 pages. Pro: 50 MB / 500 pages. Pro Media: 500 MB / 2000 pages. For a longer report, split it with PDF Split by Range, redact each section, then recombine for release.

What is the complete de-identification workflow before release?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Redact Personal Data from a Report PDF Before Release

How to redact personal data from a report pdf before release

Report identifiers: detected vs. manual

From draft to safe-to-release

Cookbook

An inspection report line

Preserving an anonymised code

FOI bundle with an exhibit invoice

The full release workflow

Verifying before publication

Edge cases and what actually happens

Subject names are not detected

Addresses and dates not detected

Boxes are visual until flattened

Anonymised codes are preserved

Over-redaction of surrounding prose

Scanned exhibit with no text layer

Un-hyphenated or spaced numbers

Author name hidden in metadata

Reviewer comments contain PII

Report over the tier cap

Frequently asked questions

Will it redact the names that appear in the report's text?

Can I preserve anonymised references like 'Respondent A'?

What personal data does it actually remove?

What redaction colour is used, and does it fully cover the text?

Is a boxed report safe to publish as-is?

Does it have a review-and-confirm step?

How do I handle addresses and dates the tool misses?

Does it remove the author's name from the file?

What about a scanned appendix or exhibit?

Is the draft uploaded anywhere during redaction?

How large a report can I process?

What is the complete de-identification workflow before release?

Privacy first

Related guides

Redact Personal Data from a Report PDF Before Release

How to redact personal data from a report pdf before release

Report identifiers: detected vs. manual

From draft to safe-to-release

Cookbook

An inspection report line

Preserving an anonymised code

FOI bundle with an exhibit invoice

The full release workflow

Verifying before publication

Edge cases and what actually happens

Subject names are not detected

Addresses and dates not detected

Boxes are visual until flattened

Anonymised codes are preserved

Over-redaction of surrounding prose

Scanned exhibit with no text layer

Un-hyphenated or spaced numbers

Author name hidden in metadata

Reviewer comments contain PII

Report over the tier cap

Frequently asked questions

Will it redact the names that appear in the report's text?

Can I preserve anonymised references like 'Respondent A'?

What personal data does it actually remove?

What redaction colour is used, and does it fully cover the text?

Is a boxed report safe to publish as-is?

Does it have a review-and-confirm step?

How do I handle addresses and dates the tool misses?

Does it remove the author's name from the file?

What about a scanned appendix or exhibit?

Is the draft uploaded anywhere during redaction?

How large a report can I process?

What is the complete de-identification workflow before release?

Privacy first

Related guides