Extract Text from a PDF Document — Free Online Tool

How to extract all text from a pdf document

Step 1
Open the PDF to Plain Text tool — Go to the PDF to Plain Text tool. Everything runs locally in your browser using pdf.js — nothing is uploaded.
Step 2
Drop your PDF onto the dropzone — Drag a .pdf file in or click to browse. There are no options to configure — extraction starts automatically as soon as the file is read.
Step 3
Wait for the page-by-page read — The tool loads the document, then walks every page calling getTextContent(). Long documents take a moment because each page is parsed in turn; the status shows "Reading PDF…" then "Processing…".
Step 4
Check the on-screen preview — The result panel shows the input page count, the output text size, and a preview of the first ~5,000 characters. Skim it to confirm the text came through (not blank, not garbled).
Step 5
Download the .txt file — Click Download to save the full extracted text as <name>.txt in UTF-8. The preview is truncated, but the downloaded file contains everything.
Step 6
Use the text in your workflow — Open it in a text editor, paste it into a translator, or feed it to a script. If the layout came through jumbled (multi-column), do a manual reflow, or convert to Markdown for page-headed structure.

What comes through, and what doesn't

The tool reads the embedded text layer with pdf.js. These are the categories of content and how each behaves.

PDF content	Extracted?	Notes
Born-digital body text (typed in Word, LaTeX, InDesign, etc.)	Yes — fully	Stored as Unicode glyph runs; comes through verbatim.
Headers, footers, page numbers	Yes	They are text runs like any other, so they appear inline in the output — strip them downstream if you don't want them.
Footnotes and endnotes	Yes	Extracted as text, but their reading position depends on where the runs sit on the page.
Scanned / photographed pages (image-only)	No	No text layer exists — output for those pages is empty. Run OCR first.
Text inside images, charts, or logos	No	Rasterised pixels, not text runs. OCR is the only way to recover it.
Tables	Partially	Cell text is extracted but flattened into a line; for structured rows use PDF Table to JSON or PDF to Excel.

Output format and tier limits

Real behaviour and the caps enforced before processing. Free-tier checks run on file size and page count.

Property	Value
Output file	`.txt` (named after the source, e.g. `report.txt`)
Encoding	UTF-8, no BOM
Within-page join	Text runs joined with a single space
Page separator	Blank line (double newline `\n\n`) between pages
On-screen preview	First ~5,000 characters (the download is complete)
Free tier	Up to 2 MB file size and 50 pages
Pro tier	Up to 50 MB and 500 pages
Processing location	Your browser (pdf.js) — 0 bytes uploaded

Cookbook

Concrete extraction scenarios with what the input looks like and what lands in the .txt file.

A clean single-column report

The ideal case — a born-digital report or article with one column of body text. Runs flow in natural reading order and the output reads like the original.

Input:  quarterly-report.pdf (born-digital, 12 pages)
Action: drop on the tool → auto-extracts

Output (quarterly-report.txt):
Quarterly Report 2026 Q1
Revenue grew 14% year over year, driven by ...

[blank line marks the page break]

Page 2 body continues here ...

Pull a long PDF into a text editor for find-and-replace

You need to search or bulk-edit text that the PDF viewer makes painful. Extract once, then work in any editor.

1. Drop manual.pdf → Download manual.txt
2. Open manual.txt in VS Code / Notepad++
3. Ctrl+F across the whole document, or run a
   find-and-replace, with no page-by-page hunting

Headers and footers appear inline

Running headers and page numbers are text runs, so they are extracted too. This is expected; remove them downstream if they pollute your text.

Output excerpt (notice the repeated header + number):
... end of section three.
ACME Corp — Internal      14
Section Four
The following clauses apply ...

A scanned page comes out blank

An image-only (scanned) page has no text layer, so pdf.js returns nothing for it. The fix is OCR first.

Input:  scan.pdf (photographed pages)
Output: scan.txt → mostly empty / whitespace only

Fix:
1. Run scan.pdf through /pdf-tools/pdf-ocr
2. Download the OCR'd PDF (now has a text layer)
3. Drop THAT on this tool → text extracts

Confirm extraction worked before relying on it

Always glance at the preview. A near-empty preview on a document that clearly has text usually means the pages are scans, or the font lacks a Unicode map.

Result panel checklist:
- Input pages:   matches the document?  ✓
- Output size:   non-trivial (KB, not bytes)?  ✓
- Preview:       real words, not □□□ / mojibake?  ✓

Edge cases and what actually happens

Scanned / image-only PDF

Empty output

There is no embedded text layer to read, so pages of scans extract to nothing. This is the single most common reason a PDF "won't extract." Run the file through the PDF OCR tool to add an invisible text layer, then extract from the OCR'd copy.

Password-protected PDF that needs a password to open

fails to open

Extraction uses pdf.js, which cannot open a document that requires a user/open password — it will error. Remove the open password first with PDF Remove Password (you must know the password), then extract.

Copy-restricted PDF (owner password only, but openable)

Supported

If the PDF opens without a password but has "copying" disabled, this tool still extracts the text — pdf.js reads the content rather than honouring the copy-permission flag. If you also need to lift the restriction in the file itself, use PDF Permission Setter or PDF Unlock.

File larger than 2 MB on the free tier

blocked

The free tier caps PDFs at 2 MB. A larger file is blocked before processing with an upgrade prompt. Pro raises the ceiling to 50 MB. You can also split the PDF with PDF Split by Range and extract each part.

More than 50 pages on the free tier

blocked

Free-tier extraction is capped at 50 pages (Pro: 500). A longer document is blocked with an upgrade prompt. Splitting the PDF into ≤50-page parts lets you extract each on the free tier.

Multi-column layout

May interleave

The tool joins text runs in the order pdf.js yields them; it does not reconstruct columns. A two-column page can interleave left and right columns line by line. Single-column documents are unaffected. For column-heavy material, expect to reflow manually.

Custom-encoded or subset font with no Unicode map

garbled

Some PDFs embed fonts without a proper ToUnicode mapping. The glyphs render correctly on screen but extract as wrong characters or boxes. There is no clean fix in-tool; OCR the page (which reads pixels, not the font map) as a workaround.

Ligatures and special typographic characters

Usually preserved

Ligatures (ﬁ, ﬂ), curly quotes, and em dashes come through when the font carries the right Unicode mapping — which most modern PDFs do. A minority of older PDFs map ligatures to private-use code points; check the preview if exact characters matter.

Tables and forms

Flattened

Table cells and form-field labels are text runs, so they extract — but the row/column structure is lost (cells flatten into lines). For structured output use PDF Table to JSON, PDF to Excel, or the form-field map.

Frequently asked questions

Does this work on scanned PDFs?

No — a scanned PDF is an image with no text layer, so it extracts to nothing. Run it through the PDF OCR tool first to add an invisible, selectable text layer, then drop the OCR'd file here. You can tell a scan by trying to select text in your normal PDF viewer: if you can't, it's a scan.

Will the extracted text include headers and footers?

Yes. Every text run on a page is extracted, including running headers, footers, and page numbers — they're ordinary text to the parser. If you don't want them, strip them downstream (e.g. with a regex that drops lines matching the recurring header), or use a tool tuned for clean indexing.

Will multi-column text be in the right reading order?

Single-column documents extract in natural reading order. Multi-column layouts can interleave because the tool joins runs in the order pdf.js returns them rather than reconstructing columns. For two-column papers, expect to reflow the output by hand.

What format and encoding is the output?

A plain .txt file in UTF-8 (no BOM), named after your source file. Pages are separated by a blank line. It opens in any editor, word processor, or script with no PDF library required.

How are pages separated in the output?

With a blank line — a double newline (\n\n) between consecutive pages. That gives you a reliable marker for where one page ends and the next begins, which is handy when you want to drop running headers or split by page later.

Is there a page or file-size limit?

Yes. The free tier handles PDFs up to 2 MB and 50 pages. Pro raises that to 50 MB and 500 pages, and higher tiers go further. If you hit a limit, split the PDF into smaller pieces with PDF Split by Range and extract each.

Does my PDF get uploaded anywhere?

No. Extraction runs entirely in your browser using pdf.js. The file never leaves your device — the result panel even shows "0 bytes uploaded." That makes it safe for confidential documents.

Why is my preview only showing part of the document?

The on-screen preview is capped at the first ~5,000 characters for performance. The downloaded .txt file contains the complete extracted text — don't judge completeness by the preview alone.

Can I set options, like a page range or output encoding?

No. This tool has no options — it extracts all text from every page automatically on drop, as UTF-8. To work with a subset of pages, first extract them with PDF Extract Pages, then run text extraction on the result.

Why did some characters come out as boxes or wrong letters?

The PDF embeds a font without a proper Unicode (ToUnicode) mapping, so the glyphs display correctly but extract as the wrong code points. OCR is the practical workaround because it reads the rendered pixels instead of relying on the font's character map.

Can I keep the formatting — bold, headings, layout?

No — plain text carries no styling. If you need structure, PDF to Markdown adds per-page headings, PDF to HTML wraps paragraphs in tags, and PDF to Word gives you editable text. For tables, use PDF to Excel.

What should I do if the document is encrypted?

If it needs a password just to open, the extractor can't read it — remove the open password first with PDF Remove Password. If it only blocks copying (opens fine), extraction works as-is.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to extract all text from a pdf document

Step 1
Open the PDF to Plain Text tool — Go to the PDF to Plain Text tool. Everything runs locally in your browser using pdf.js — nothing is uploaded.
Step 2
Drop your PDF onto the dropzone — Drag a .pdf file in or click to browse. There are no options to configure — extraction starts automatically as soon as the file is read.
Step 3
Wait for the page-by-page read — The tool loads the document, then walks every page calling getTextContent(). Long documents take a moment because each page is parsed in turn; the status shows "Reading PDF…" then "Processing…".
Step 4
Check the on-screen preview — The result panel shows the input page count, the output text size, and a preview of the first ~5,000 characters. Skim it to confirm the text came through (not blank, not garbled).
Step 5
Download the .txt file — Click Download to save the full extracted text as <name>.txt in UTF-8. The preview is truncated, but the downloaded file contains everything.
Step 6
Use the text in your workflow — Open it in a text editor, paste it into a translator, or feed it to a script. If the layout came through jumbled (multi-column), do a manual reflow, or convert to Markdown for page-headed structure.

What comes through, and what doesn't

The tool reads the embedded text layer with pdf.js. These are the categories of content and how each behaves.

PDF content	Extracted?	Notes
Born-digital body text (typed in Word, LaTeX, InDesign, etc.)	Yes — fully	Stored as Unicode glyph runs; comes through verbatim.
Headers, footers, page numbers	Yes	They are text runs like any other, so they appear inline in the output — strip them downstream if you don't want them.
Footnotes and endnotes	Yes	Extracted as text, but their reading position depends on where the runs sit on the page.
Scanned / photographed pages (image-only)	No	No text layer exists — output for those pages is empty. Run OCR first.
Text inside images, charts, or logos	No	Rasterised pixels, not text runs. OCR is the only way to recover it.
Tables	Partially	Cell text is extracted but flattened into a line; for structured rows use PDF Table to JSON or PDF to Excel.

Output format and tier limits

Real behaviour and the caps enforced before processing. Free-tier checks run on file size and page count.

Property	Value
Output file	`.txt` (named after the source, e.g. `report.txt`)
Encoding	UTF-8, no BOM
Within-page join	Text runs joined with a single space
Page separator	Blank line (double newline `\n\n`) between pages
On-screen preview	First ~5,000 characters (the download is complete)
Free tier	Up to 2 MB file size and 50 pages
Pro tier	Up to 50 MB and 500 pages
Processing location	Your browser (pdf.js) — 0 bytes uploaded

Cookbook

Concrete extraction scenarios with what the input looks like and what lands in the .txt file.

A clean single-column report

The ideal case — a born-digital report or article with one column of body text. Runs flow in natural reading order and the output reads like the original.

Input:  quarterly-report.pdf (born-digital, 12 pages)
Action: drop on the tool → auto-extracts

Output (quarterly-report.txt):
Quarterly Report 2026 Q1
Revenue grew 14% year over year, driven by ...

[blank line marks the page break]

Page 2 body continues here ...

Pull a long PDF into a text editor for find-and-replace

You need to search or bulk-edit text that the PDF viewer makes painful. Extract once, then work in any editor.

1. Drop manual.pdf → Download manual.txt
2. Open manual.txt in VS Code / Notepad++
3. Ctrl+F across the whole document, or run a
   find-and-replace, with no page-by-page hunting

Headers and footers appear inline

Running headers and page numbers are text runs, so they are extracted too. This is expected; remove them downstream if they pollute your text.

Output excerpt (notice the repeated header + number):
... end of section three.
ACME Corp — Internal      14
Section Four
The following clauses apply ...

A scanned page comes out blank

An image-only (scanned) page has no text layer, so pdf.js returns nothing for it. The fix is OCR first.

Input:  scan.pdf (photographed pages)
Output: scan.txt → mostly empty / whitespace only

Fix:
1. Run scan.pdf through /pdf-tools/pdf-ocr
2. Download the OCR'd PDF (now has a text layer)
3. Drop THAT on this tool → text extracts

Confirm extraction worked before relying on it

Always glance at the preview. A near-empty preview on a document that clearly has text usually means the pages are scans, or the font lacks a Unicode map.

Result panel checklist:
- Input pages:   matches the document?  ✓
- Output size:   non-trivial (KB, not bytes)?  ✓
- Preview:       real words, not □□□ / mojibake?  ✓

Edge cases and what actually happens

Scanned / image-only PDF

Empty output

Password-protected PDF that needs a password to open

fails to open

Copy-restricted PDF (owner password only, but openable)

Supported

File larger than 2 MB on the free tier

blocked

More than 50 pages on the free tier

blocked

Free-tier extraction is capped at 50 pages (Pro: 500). A longer document is blocked with an upgrade prompt. Splitting the PDF into ≤50-page parts lets you extract each on the free tier.

Multi-column layout

May interleave

Custom-encoded or subset font with no Unicode map

garbled

Ligatures and special typographic characters

Usually preserved

Tables and forms

Flattened

Frequently asked questions

Does this work on scanned PDFs?

Will the extracted text include headers and footers?

Will multi-column text be in the right reading order?

What format and encoding is the output?

A plain .txt file in UTF-8 (no BOM), named after your source file. Pages are separated by a blank line. It opens in any editor, word processor, or script with no PDF library required.

How are pages separated in the output?

Is there a page or file-size limit?

Does my PDF get uploaded anywhere?

No. Extraction runs entirely in your browser using pdf.js. The file never leaves your device — the result panel even shows "0 bytes uploaded." That makes it safe for confidential documents.

Why is my preview only showing part of the document?

The on-screen preview is capped at the first ~5,000 characters for performance. The downloaded .txt file contains the complete extracted text — don't judge completeness by the preview alone.

Can I set options, like a page range or output encoding?

Why did some characters come out as boxes or wrong letters?

Can I keep the formatting — bold, headings, layout?

What should I do if the document is encrypted?

If it needs a password just to open, the extractor can't read it — remove the open password first with PDF Remove Password. If it only blocks copying (opens fine), extraction works as-is.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Extract All Text from a PDF Document

How to extract all text from a pdf document

What comes through, and what doesn't

Output format and tier limits

Cookbook

A clean single-column report

Pull a long PDF into a text editor for find-and-replace

Headers and footers appear inline

A scanned page comes out blank

Confirm extraction worked before relying on it

Edge cases and what actually happens

Scanned / image-only PDF

Password-protected PDF that needs a password to open

Copy-restricted PDF (owner password only, but openable)

File larger than 2 MB on the free tier

More than 50 pages on the free tier

Multi-column layout

Custom-encoded or subset font with no Unicode map

Ligatures and special typographic characters

Tables and forms

Frequently asked questions

Does this work on scanned PDFs?

Will the extracted text include headers and footers?

Will multi-column text be in the right reading order?

What format and encoding is the output?

How are pages separated in the output?

Is there a page or file-size limit?

Does my PDF get uploaded anywhere?

Why is my preview only showing part of the document?

Can I set options, like a page range or output encoding?

Why did some characters come out as boxes or wrong letters?

Can I keep the formatting — bold, headings, layout?

What should I do if the document is encrypted?

Privacy first

Related guides

Extract All Text from a PDF Document

How to extract all text from a pdf document

What comes through, and what doesn't

Output format and tier limits

Cookbook

A clean single-column report

Pull a long PDF into a text editor for find-and-replace

Headers and footers appear inline

A scanned page comes out blank

Confirm extraction worked before relying on it

Edge cases and what actually happens

Scanned / image-only PDF

Password-protected PDF that needs a password to open

Copy-restricted PDF (owner password only, but openable)

File larger than 2 MB on the free tier

More than 50 pages on the free tier

Multi-column layout

Custom-encoded or subset font with no Unicode map

Ligatures and special typographic characters

Tables and forms

Frequently asked questions

Does this work on scanned PDFs?

Will the extracted text include headers and footers?

Will multi-column text be in the right reading order?

What format and encoding is the output?

How are pages separated in the output?

Is there a page or file-size limit?

Does my PDF get uploaded anywhere?

Why is my preview only showing part of the document?

Can I set options, like a page range or output encoding?

Why did some characters come out as boxes or wrong letters?

Can I keep the formatting — bold, headings, layout?

What should I do if the document is encrypted?

Privacy first

Related guides