How to extract all text from a pdf document
- Step 1Open the PDF to Plain Text tool — Go to the PDF to Plain Text tool. Everything runs locally in your browser using pdf.js — nothing is uploaded.
- Step 2Drop your PDF onto the dropzone — Drag a
.pdffile in or click to browse. There are no options to configure — extraction starts automatically as soon as the file is read. - Step 3Wait for the page-by-page read — The tool loads the document, then walks every page calling
getTextContent(). Long documents take a moment because each page is parsed in turn; the status shows "Reading PDF…" then "Processing…". - Step 4Check the on-screen preview — The result panel shows the input page count, the output text size, and a preview of the first ~5,000 characters. Skim it to confirm the text came through (not blank, not garbled).
- Step 5Download the .txt file — Click Download to save the full extracted text as
<name>.txtin UTF-8. The preview is truncated, but the downloaded file contains everything. - Step 6Use the text in your workflow — Open it in a text editor, paste it into a translator, or feed it to a script. If the layout came through jumbled (multi-column), do a manual reflow, or convert to Markdown for page-headed structure.
What comes through, and what doesn't
The tool reads the embedded text layer with pdf.js. These are the categories of content and how each behaves.
| PDF content | Extracted? | Notes |
|---|---|---|
| Born-digital body text (typed in Word, LaTeX, InDesign, etc.) | Yes — fully | Stored as Unicode glyph runs; comes through verbatim. |
| Headers, footers, page numbers | Yes | They are text runs like any other, so they appear inline in the output — strip them downstream if you don't want them. |
| Footnotes and endnotes | Yes | Extracted as text, but their reading position depends on where the runs sit on the page. |
| Scanned / photographed pages (image-only) | No | No text layer exists — output for those pages is empty. Run OCR first. |
| Text inside images, charts, or logos | No | Rasterised pixels, not text runs. OCR is the only way to recover it. |
| Tables | Partially | Cell text is extracted but flattened into a line; for structured rows use PDF Table to JSON or PDF to Excel. |
Output format and tier limits
Real behaviour and the caps enforced before processing. Free-tier checks run on file size and page count.
| Property | Value |
|---|---|
| Output file | .txt (named after the source, e.g. report.txt) |
| Encoding | UTF-8, no BOM |
| Within-page join | Text runs joined with a single space |
| Page separator | Blank line (double newline \n\n) between pages |
| On-screen preview | First ~5,000 characters (the download is complete) |
| Free tier | Up to 2 MB file size and 50 pages |
| Pro tier | Up to 50 MB and 500 pages |
| Processing location | Your browser (pdf.js) — 0 bytes uploaded |
Cookbook
Concrete extraction scenarios with what the input looks like and what lands in the .txt file.
A clean single-column report
The ideal case — a born-digital report or article with one column of body text. Runs flow in natural reading order and the output reads like the original.
Input: quarterly-report.pdf (born-digital, 12 pages) Action: drop on the tool → auto-extracts Output (quarterly-report.txt): Quarterly Report 2026 Q1 Revenue grew 14% year over year, driven by ... [blank line marks the page break] Page 2 body continues here ...
Pull a long PDF into a text editor for find-and-replace
You need to search or bulk-edit text that the PDF viewer makes painful. Extract once, then work in any editor.
1. Drop manual.pdf → Download manual.txt 2. Open manual.txt in VS Code / Notepad++ 3. Ctrl+F across the whole document, or run a find-and-replace, with no page-by-page hunting
Headers and footers appear inline
Running headers and page numbers are text runs, so they are extracted too. This is expected; remove them downstream if they pollute your text.
Output excerpt (notice the repeated header + number): ... end of section three. ACME Corp — Internal 14 Section Four The following clauses apply ...
A scanned page comes out blank
An image-only (scanned) page has no text layer, so pdf.js returns nothing for it. The fix is OCR first.
Input: scan.pdf (photographed pages) Output: scan.txt → mostly empty / whitespace only Fix: 1. Run scan.pdf through /pdf-tools/pdf-ocr 2. Download the OCR'd PDF (now has a text layer) 3. Drop THAT on this tool → text extracts
Confirm extraction worked before relying on it
Always glance at the preview. A near-empty preview on a document that clearly has text usually means the pages are scans, or the font lacks a Unicode map.
Result panel checklist: - Input pages: matches the document? ✓ - Output size: non-trivial (KB, not bytes)? ✓ - Preview: real words, not □□□ / mojibake? ✓
Edge cases and what actually happens
Scanned / image-only PDF
Empty outputThere is no embedded text layer to read, so pages of scans extract to nothing. This is the single most common reason a PDF "won't extract." Run the file through the PDF OCR tool to add an invisible text layer, then extract from the OCR'd copy.
Password-protected PDF that needs a password to open
fails to openExtraction uses pdf.js, which cannot open a document that requires a user/open password — it will error. Remove the open password first with PDF Remove Password (you must know the password), then extract.
Copy-restricted PDF (owner password only, but openable)
SupportedIf the PDF opens without a password but has "copying" disabled, this tool still extracts the text — pdf.js reads the content rather than honouring the copy-permission flag. If you also need to lift the restriction in the file itself, use PDF Permission Setter or PDF Unlock.
File larger than 2 MB on the free tier
blockedThe free tier caps PDFs at 2 MB. A larger file is blocked before processing with an upgrade prompt. Pro raises the ceiling to 50 MB. You can also split the PDF with PDF Split by Range and extract each part.
More than 50 pages on the free tier
blockedFree-tier extraction is capped at 50 pages (Pro: 500). A longer document is blocked with an upgrade prompt. Splitting the PDF into ≤50-page parts lets you extract each on the free tier.
Multi-column layout
May interleaveThe tool joins text runs in the order pdf.js yields them; it does not reconstruct columns. A two-column page can interleave left and right columns line by line. Single-column documents are unaffected. For column-heavy material, expect to reflow manually.
Custom-encoded or subset font with no Unicode map
garbledSome PDFs embed fonts without a proper ToUnicode mapping. The glyphs render correctly on screen but extract as wrong characters or boxes. There is no clean fix in-tool; OCR the page (which reads pixels, not the font map) as a workaround.
Ligatures and special typographic characters
Usually preservedLigatures (fi, fl), curly quotes, and em dashes come through when the font carries the right Unicode mapping — which most modern PDFs do. A minority of older PDFs map ligatures to private-use code points; check the preview if exact characters matter.
Tables and forms
FlattenedTable cells and form-field labels are text runs, so they extract — but the row/column structure is lost (cells flatten into lines). For structured output use PDF Table to JSON, PDF to Excel, or the form-field map.
Frequently asked questions
Does this work on scanned PDFs?
No — a scanned PDF is an image with no text layer, so it extracts to nothing. Run it through the PDF OCR tool first to add an invisible, selectable text layer, then drop the OCR'd file here. You can tell a scan by trying to select text in your normal PDF viewer: if you can't, it's a scan.
Will the extracted text include headers and footers?
Yes. Every text run on a page is extracted, including running headers, footers, and page numbers — they're ordinary text to the parser. If you don't want them, strip them downstream (e.g. with a regex that drops lines matching the recurring header), or use a tool tuned for clean indexing.
Will multi-column text be in the right reading order?
Single-column documents extract in natural reading order. Multi-column layouts can interleave because the tool joins runs in the order pdf.js returns them rather than reconstructing columns. For two-column papers, expect to reflow the output by hand.
What format and encoding is the output?
A plain .txt file in UTF-8 (no BOM), named after your source file. Pages are separated by a blank line. It opens in any editor, word processor, or script with no PDF library required.
How are pages separated in the output?
With a blank line — a double newline (\n\n) between consecutive pages. That gives you a reliable marker for where one page ends and the next begins, which is handy when you want to drop running headers or split by page later.
Is there a page or file-size limit?
Yes. The free tier handles PDFs up to 2 MB and 50 pages. Pro raises that to 50 MB and 500 pages, and higher tiers go further. If you hit a limit, split the PDF into smaller pieces with PDF Split by Range and extract each.
Does my PDF get uploaded anywhere?
No. Extraction runs entirely in your browser using pdf.js. The file never leaves your device — the result panel even shows "0 bytes uploaded." That makes it safe for confidential documents.
Why is my preview only showing part of the document?
The on-screen preview is capped at the first ~5,000 characters for performance. The downloaded .txt file contains the complete extracted text — don't judge completeness by the preview alone.
Can I set options, like a page range or output encoding?
No. This tool has no options — it extracts all text from every page automatically on drop, as UTF-8. To work with a subset of pages, first extract them with PDF Extract Pages, then run text extraction on the result.
Why did some characters come out as boxes or wrong letters?
The PDF embeds a font without a proper Unicode (ToUnicode) mapping, so the glyphs display correctly but extract as the wrong code points. OCR is the practical workaround because it reads the rendered pixels instead of relying on the font's character map.
Can I keep the formatting — bold, headings, layout?
No — plain text carries no styling. If you need structure, PDF to Markdown adds per-page headings, PDF to HTML wraps paragraphs in tags, and PDF to Word gives you editable text. For tables, use PDF to Excel.
What should I do if the document is encrypted?
If it needs a password just to open, the extractor can't read it — remove the open password first with PDF Remove Password. If it only blocks copying (opens fine), extraction works as-is.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.