How to extract text from a pdf into a clean word-ready file
- Step 1
- Step 2Drop the PDF — Add a single PDF. Extraction begins immediately; there is nothing to configure first.
- Step 3Confirm the text layer in the preview — The first 5,000 characters appear in a scrollable panel. If you see your text, the extraction worked. If it's empty, the PDF is a scan and needs OCR (see the cookbook).
- Step 4Download the full .txt — Click Download to save the complete text (the preview is truncated, the file is not). It's named after your PDF, e.g.
manuscript.pdf→manuscript.txt. - Step 5Bring it into your writing tool — Open or paste the
.txtinto Word, Google Docs, or LibreOffice. It arrives as one plain-text body you can re-flow and style with your house template. - Step 6Clean up predictable artefacts — Use Find & Replace to collapse double spaces, re-join hard-wrapped lines, and remove repeated page headers/footers that were part of the text layer. Then apply heading styles.
How the text is structured in the output
The exact shape pdf.js extraction produces (lib/pdf/pdf-text-extract.ts).
| Aspect | Behaviour | Practical note |
|---|---|---|
| Within a page | Text items joined with a single space | Visual line breaks inside a page are not preserved as newlines — words flow together separated by spaces. |
| Between pages | Joined with a blank line (\n\n) | You can find page boundaries by the double newline; useful for re-splitting later. |
| Reading order | Order pdf.js reports text items | Correct for single-column; multi-column can interleave. |
| Encoding | UTF-8 | Accents, em-dashes, curly quotes, and non-Latin scripts survive when the PDF maps them correctly. |
| Headers / footers | Included as plain text | Repeated running heads appear once per page — strip with Find & Replace if unwanted. |
| Tables | Cell text only, no grid | For structured rows/columns use PDF to Excel or PDF table to JSON. |
When to use a sibling tool instead
Pick the output format that matches what you'll do with the text next.
| You want… | Use | Output |
|---|---|---|
| Plain text for Word/Docs | PDF to Word (this tool) | .txt |
| Same text, plumbing/NLP framing | PDF to Text | .txt |
| Markdown with page headings | PDF to Markdown | .md |
| Structured table rows | PDF to Excel | CSV |
| Searchable text from a scan | PDF OCR | searchable PDF |
Cookbook
Extraction-and-cleanup recipes for people whose goal is the words, not the design. Output blocks approximate the .txt content.
Lift the body text out of a 12-page report
A digital-native report PDF: extraction returns the full prose, pages separated by blank lines, ready to drop into a fresh Word template.
Input: market-report.pdf (12 pages, exported from InDesign) Workflow: 1. Drop onto /pdf-tools/pdf-to-word -> auto-extracts 2. Download market-report.txt Output (abbreviated): Executive Summary Demand grew 14% year over year ... (blank line = page break) Methodology We surveyed 1,200 respondents ...
Remove a repeating page header/footer
Running heads land in the text layer once per page. A single regex Find & Replace in Word strips every copy at once.
Output (.txt) repeats this on every page: ACME CONFIDENTIAL — DO NOT DISTRIBUTE In Word, Find & Replace (wildcards): Find: ACME CONFIDENTIAL — DO NOT DISTRIBUTE Replace: (empty) -> Replace All Result: header gone from all 12 pages in one click.
Collapse double spaces from justified text
Justified PDFs add micro-spacing that the extractor renders as extra spaces. One pass normalises it.
Output (.txt): The committee approved the motion. In Word, Find & Replace: Find: two spaces Replace: one space -> Replace All (run twice to catch triples) Result: The committee approved the motion.
Split the .txt back into per-page chunks
Because pages are separated by a blank line, you can reconstruct page boundaries downstream — handy when feeding sections into another doc.
manuscript.txt (pages separated by \n\n) In a text editor or script, split on a blank line: pages = text.split(/\n\n+/) pages[0] -> page 1 text pages[1] -> page 2 text ...
Scan detected — OCR then extract
An empty extraction means no text layer. OCR creates one; then this tool returns real text.
Input: archive-page.pdf (scanned) Preview: (empty) Fix: 1. /pdf-tools/pdf-ocr (English) -> searchable archive-page PDF 2. /pdf-tools/pdf-to-word -> archive-page.txt with real text
Edge cases and what actually happens
Empty extraction on a scanned PDF
No text layerIf the PDF is a scan or image, there is no selectable text and the output is empty. Run PDF OCR to add a text layer, then re-extract.
Repeated running headers/footers in the text
ExpectedThe extractor returns every page's text layer verbatim, so a running head appears once per page. This isn't a bug — strip it with Find & Replace in Word, or accept it if you want the markers.
Hard-wrapped lines instead of paragraphs
ExpectedWithin a page, text items are joined with spaces and the PDF's visual wrapping carries through. Long paragraphs may arrive broken at the original line widths — re-flow with a Find & Replace that joins a lowercase line-end to the next line.
Two-column paper reads across columns
Reading order may differMulti-column layouts can interleave because extraction follows pdf.js item order. Inspect the preview; if scrambled, separate the columns manually in Word or extract the single-column pages individually.
Non-Latin script (Arabic, CJK, Cyrillic)
Usually preservedOutput is UTF-8, so any script the PDF maps via ToUnicode comes across. Right-to-left text may need direction set in Word; CJK is fine as long as the source embedded proper Unicode mappings.
Encrypted PDF
Blocked until decryptedText can't be read from an encrypted PDF. Decrypt with Remove PDF Password first, then extract.
File exceeds the tier limit
RejectedFree is 2 MB / 50 pages. Larger files are blocked. Upgrade to Pro (50 MB / 500 pages) or split with Extract Pages.
Invisible / off-page text appears in the output
ExpectedSome PDFs carry hidden text (transparent watermarks, OCR ghost layers, off-canvas content). It's part of the text layer, so it's extracted. Search and remove anything unexpected in Word.
Frequently asked questions
Why is text extraction better than copy-paste from a PDF viewer?
Viewers paste with hard line breaks, inconsistent spacing, and often duplicated headers, and many won't let you select across a whole document at once. This tool extracts the entire text layer in one pass, page by page, into a single clean .txt you control.
What file do I get?
A UTF-8 .txt file named after your PDF (e.g. report.txt). Pages are separated by a blank line. There is no .docx — you open or paste the text into Word/Docs/LibreOffice and style it yourself.
Does the preview show the whole document?
No — the on-screen preview is truncated to the first 5,000 characters to stay responsive. The downloaded .txt contains the complete extraction.
Are headers and footers included?
Yes, they're part of each page's text layer, so they appear (typically once per page). Remove them with a single Find & Replace in Word if you don't want them.
How are footnotes handled?
Footnote text is in the page's text layer, so it's extracted — usually appearing after the body text of the page where it sits, not auto-linked as a Word footnote. Re-attach footnotes manually if you need Word's footnote feature.
Will the reading order always be right?
For single-column documents, yes. Multi-column layouts can interleave because extraction follows pdf.js's item order. Check the preview and re-order blocks in Word if needed.
Does it keep accents and special characters?
Yes. Output is UTF-8, so accents, em-dashes, curly quotes, and non-Latin scripts survive as long as the PDF maps them correctly via its embedded ToUnicode tables.
Can I extract tables this way?
You'll get the cells' text but not a structured table. For rows and columns, use PDF to Excel (CSV, columns by position) or PDF table to JSON.
Is the file uploaded?
No. pdf.js parses the PDF in your browser. The UI confirms "0 bytes uploaded." Good for confidential drafts and unpublished manuscripts.
How do I get text from a scanned PDF?
Run PDF OCR first (it recognises the glyphs and outputs a searchable PDF), then run this tool on the OCR'd file to extract the recognised text.
What's the page limit?
50 pages free, 500 on Pro, 2,000 on Pro Media. For longer documents, split with Extract Pages and extract each part.
Is this the same as the PDF to Text tool?
Yes, functionally — both extract the text layer to .txt. This page targets the "prepare text for a Word document" workflow; PDF to Text targets search/NLP plumbing. Same engine, same output.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.