How to anonymise a pdf by clearing its document metadata
- Step 1Redact identifying content first — Black out participant names, IDs, emails, and addresses on the page with pdf-pii-redactor. Content de-identification is the part ethics committees and GDPR actually scrutinise; metadata is secondary.
- Step 2Remove comments and markup — Run pdf-annotation-remover to strip reviewer notes, which carry annotator names the metadata scrubber cannot reach.
- Step 3Flatten interactive fields — If the document has form fields (a survey, a consent form), flatten them with pdf-flatten so values become static content and any incremental layers are collapsed.
- Step 4Drop the prepared PDF onto the scrubber — Add the file here. It loads locally with pdf-lib and the scrub runs automatically — no options to set. All eight document-info fields are cleared.
- Step 5Drop the XMP packet — Because this tool does not rewrite XMP, re-save through pdf-compress-lossless to rebuild the document and remove any residual XMP author or date. Then download.
- Step 6Audit the final file — Run
exiftool -G1 -a -s final.pdfand confirm there is no Author, Creator, or real date in either the PDF or XMP groups before you share the dataset.
The anonymisation layers — and which tool owns each
Metadata scrubbing is one layer. A document that is anonymous in metadata but not in content is not anonymous. This is the full chain.
| Identifying layer | Owned by this tool? | Tool |
|---|---|---|
| Document-info Author / Creator / Producer | Yes | This tool (pdf-metadata-scrubber) |
| Document-info Title / Subject / Keywords | Yes | This tool |
| Creation / modification timestamps | Yes (reset to epoch) | This tool |
| Names / IDs / emails in visible text | No | pdf-pii-redactor |
| Comments / annotation authors | No | pdf-annotation-remover |
| Form field values / incremental layers | No | pdf-flatten |
| XMP packet (dc:creator, dates) | No | Re-save via pdf-compress-lossless |
What the metadata step clears
The single pass applied to the document-information dictionary during anonymisation.
| Field | Identifying risk | After scrubbing |
|---|---|---|
/Author | Names the researcher / preparer | Empty |
/Creator | Authoring app or template owner | Empty |
/Producer | Toolchain fingerprint | Empty |
/Title | Often a participant ID or study codename | Empty |
/Subject + /Keywords | Study tags, classification | Cleared |
/CreationDate + /ModDate | Re-identifiable timing | Reset to 1970-01-01T00:00:00Z |
Cookbook
Anonymisation workflows for real data-sharing situations. The metadata-scrubber step is shown in its proper place within the chain.
Research dataset PDF — full de-identification chain
A consent form or case report shared with collaborators must be anonymous in both content and metadata. Metadata scrubbing comes after content redaction and form flattening.
1. pdf-pii-redactor -> redact participant name, DOB, ID 2. pdf-annotation-remover-> remove reviewer notes 3. pdf-flatten -> flatten consent-form fields 4. pdf-metadata-scrubber -> clear Author/Title/dates (this tool) 5. pdf-compress-lossless -> drop XMP, finalise 6. exiftool -G1 -a final.pdf -> audit
Title field leaked a participant ID
The PDF's Title was 'Subject-0427-interview' — a re-identifier on its own. The scrubber empties Title along with Author and dates.
Before (Acrobat → Description): Title: Subject-0427-interview Author: Dr Researcher Created: 2026-03-18 14:02 After scrubbing: Title: (empty) Author: (empty) Created: 1970-01-01 00:00 UTC
Metadata clean, but the name is still on the page
The most dangerous false sense of security: Document Properties is blank, but the participant's name is printed in the body. Metadata scrubbing is not content redaction.
Author (metadata): (empty after scrub) ✓ Page 1 body: 'Interview with Jane D., 42' ✗ STILL VISIBLE Fix: pdf-pii-redactor must run BEFORE you call the doc anonymous. Metadata scrub alone does not de-identify content.
Annotator name survives in a comment
A coder's initials are attached to a margin comment. The metadata scrub leaves it; the annotation remover clears it.
Metadata: clean ✓ Comment: 'coded as theme 3 - RP' ✗ Fix: /pdf-tools/pdf-annotation-remover before metadata scrub.
GDPR data-minimisation: nothing leaves your device
Because the scrub is browser-local, the personal data in the source PDF never transits a server during anonymisation — supporting a data-minimisation posture for the processing step.
Processing model: file -> browser (pdf-lib) -> scrubbed file No upload of document content. Only an anonymous run counter is recorded for signed-in users (opt-out in account settings).
Edge cases and what actually happens
Visible PII still on the page after metadata scrub
Not anonymisedClearing metadata does nothing to text or images you can see. A document with blank Document Properties but a participant name on page 1 is NOT anonymous. Redact content with pdf-pii-redactor before treating the file as de-identified — this is the most common anonymisation mistake.
Annotation author names remain
Out of scopeReviewer or coder names attached to comments live in the annotation layer, not the metadata. Remove them with pdf-annotation-remover as part of the chain.
XMP author/date survives
XMP not rewrittenThe tool clears the document-info dictionary but not the XMP packet. A residual dc:creator or real xmp:CreateDate can re-identify the source. Re-save through pdf-compress-lossless to drop the XMP, then audit with ExifTool.
Form field values reveal identity
Not flattenedA filled consent form or survey can carry the respondent's entries in interactive fields. Flatten with pdf-flatten so the values become static content the scrub and a viewer treat as page content (then redact if visible).
Incremental-update history retains earlier content
May persistPDFs saved incrementally can keep prior, pre-redaction content layers. A plain metadata scrub does not remove them. Flatten or re-save through pdf-compress-lossless to rebuild the file and drop the history before sharing.
Dates show 1970-01-01 rather than blank
ExpectedThe two date fields are reset to the Unix epoch, not deleted, so a viewer shows 01/01/1970. The real, potentially re-identifying timestamp is gone — the epoch value is the intended output.
File over the free 2 MB / 50-page limit
BlockedFree handles 2 MB and 50 pages; Pro 50 MB / 500 pages; Pro+Media 500 MB / 2,000 pages. Large research scans may exceed Free — the tool blocks before processing with an upgrade prompt.
Document is digitally signed
Signature breaksAnonymising re-saves the file and invalidates a signature, which itself can name the signer. For anonymisation that is usually fine; verify the resulting file with pdf-signature-verify if signature state matters.
Frequently asked questions
Does scrubbing metadata make a PDF fully anonymous?
No. It anonymises the document-information metadata layer (Author, Creator, Producer, Title, Subject, Keywords, and the dates). Real anonymisation also requires redacting visible content (pdf-pii-redactor), removing comments (pdf-annotation-remover), flattening form fields (pdf-flatten), and dropping the XMP packet (a lossless re-save). This tool owns one verified layer of that chain.
What's the correct order for anonymising a research PDF?
Redact content → remove comments → flatten fields → scrub metadata (this tool) → re-save losslessly to drop XMP → audit with ExifTool. Metadata is near the end because earlier steps re-save the file and could otherwise reintroduce metadata.
Does it read or redact the visible text on the page?
No. The scrubber only touches the hidden document-info fields. Any name, ID, or email printed on the page survives — redact those with pdf-pii-redactor before calling the document anonymous.
What about revision history embedded in the PDF?
Linearised/rebuilt PDFs don't retain history, but incremental-update files can keep earlier content layers. A metadata scrub doesn't collapse them — flatten with pdf-flatten or re-save through pdf-compress-lossless first for the most thorough result.
Should I flatten before scrubbing?
Yes, if the document has form fields or you suspect incremental layers. Flatten with pdf-flatten first to turn field values into static content and collapse the file, then scrub the metadata.
Does this support GDPR data-minimisation?
The processing step does: the scrub runs in your browser via pdf-lib, so the document's personal data never transits a server. Note the tool doesn't track consent or legal basis — those remain your responsibility — but it gives you a no-upload way to strip the metadata fingerprint.
Are the dates removed or reset?
Reset. Both /CreationDate and /ModDate are set to the Unix epoch (1970-01-01T00:00:00Z), so a viewer shows that date rather than a blank. The original, potentially re-identifying timestamp is unrecoverable from the info dictionary.
Is the file uploaded anywhere?
No. Everything runs locally in your browser. The document and its metadata never leave your device; only an anonymous run counter is recorded for signed-in users, which you can opt out of.
Can I anonymise a whole folder of documents at once?
This tool is single-file in the browser. For a batch, pair the @jadapps/runner and POST each file to 127.0.0.1:9789/v1/tools/pdf-metadata-scrubber/run (no options needed). Processing stays on your machine, which suits sensitive datasets.
Will the document look different after anonymising the metadata?
No — metadata is invisible to readers. Only the hidden fields and date stamps change. Visible content is unchanged, which is exactly why content redaction is a separate, essential step.
What's the largest PDF I can anonymise?
Free: 2 MB / 50 pages. Pro: 50 MB / 500 pages. Pro+Media: 500 MB / 2,000 pages. The metadata operation is fast; for image-heavy scans you may need a higher tier or to compress first with pdf-compress-lossy.
Does it break a digital signature?
Yes — re-saving invalidates an existing signature, which can itself name the signer. For anonymisation that is normally acceptable; if you need to confirm signature state, check the result with pdf-signature-verify.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.