Extract PDF Text into a Clean Word-Ready File

How to extract text from a pdf into a clean word-ready file

Step 1
Open the converter — Go to the PDF to Word tool. It extracts text locally with pdf.js — no upload.
Step 2
Drop the PDF — Add a single PDF. Extraction begins immediately; there is nothing to configure first.
Step 3
Confirm the text layer in the preview — The first 5,000 characters appear in a scrollable panel. If you see your text, the extraction worked. If it's empty, the PDF is a scan and needs OCR (see the cookbook).
Step 4
Download the full .txt — Click Download to save the complete text (the preview is truncated, the file is not). It's named after your PDF, e.g. manuscript.pdf → manuscript.txt.
Step 5
Bring it into your writing tool — Open or paste the .txt into Word, Google Docs, or LibreOffice. It arrives as one plain-text body you can re-flow and style with your house template.
Step 6
Clean up predictable artefacts — Use Find & Replace to collapse double spaces, re-join hard-wrapped lines, and remove repeated page headers/footers that were part of the text layer. Then apply heading styles.

How the text is structured in the output

The exact shape pdf.js extraction produces (lib/pdf/pdf-text-extract.ts).

Aspect	Behaviour	Practical note
Within a page	Text items joined with a single space	Visual line breaks inside a page are not preserved as newlines — words flow together separated by spaces.
Between pages	Joined with a blank line (`\n\n`)	You can find page boundaries by the double newline; useful for re-splitting later.
Reading order	Order pdf.js reports text items	Correct for single-column; multi-column can interleave.
Encoding	UTF-8	Accents, em-dashes, curly quotes, and non-Latin scripts survive when the PDF maps them correctly.
Headers / footers	Included as plain text	Repeated running heads appear once per page — strip with Find & Replace if unwanted.
Tables	Cell text only, no grid	For structured rows/columns use PDF to Excel or PDF table to JSON.

When to use a sibling tool instead

Pick the output format that matches what you'll do with the text next.

You want…	Use	Output
Plain text for Word/Docs	PDF to Word (this tool)	`.txt`
Same text, plumbing/NLP framing	PDF to Text	`.txt`
Markdown with page headings	PDF to Markdown	`.md`
Structured table rows	PDF to Excel	CSV
Searchable text from a scan	PDF OCR	searchable PDF

Cookbook

Extraction-and-cleanup recipes for people whose goal is the words, not the design. Output blocks approximate the .txt content.

Lift the body text out of a 12-page report

A digital-native report PDF: extraction returns the full prose, pages separated by blank lines, ready to drop into a fresh Word template.

Input:  market-report.pdf  (12 pages, exported from InDesign)

Workflow:
  1. Drop onto /pdf-tools/pdf-to-word -> auto-extracts
  2. Download market-report.txt

Output (abbreviated):
Executive Summary
Demand grew 14% year over year ...

(blank line = page break)

Methodology
We surveyed 1,200 respondents ...

Remove a repeating page header/footer

Running heads land in the text layer once per page. A single regex Find & Replace in Word strips every copy at once.

Output (.txt) repeats this on every page:
ACME CONFIDENTIAL — DO NOT DISTRIBUTE

In Word, Find & Replace (wildcards):
  Find:    ACME CONFIDENTIAL — DO NOT DISTRIBUTE
  Replace: (empty)
  -> Replace All

Result: header gone from all 12 pages in one click.

Collapse double spaces from justified text

Justified PDFs add micro-spacing that the extractor renders as extra spaces. One pass normalises it.

Output (.txt):
The  committee   approved    the  motion.

In Word, Find & Replace:
  Find:    two spaces
  Replace: one space
  -> Replace All (run twice to catch triples)

Result:
The committee approved the motion.

Split the .txt back into per-page chunks

Because pages are separated by a blank line, you can reconstruct page boundaries downstream — handy when feeding sections into another doc.

manuscript.txt  (pages separated by \n\n)

In a text editor or script, split on a blank line:
  pages = text.split(/\n\n+/)
  pages[0] -> page 1 text
  pages[1] -> page 2 text
  ...

Scan detected — OCR then extract

An empty extraction means no text layer. OCR creates one; then this tool returns real text.

Input:  archive-page.pdf  (scanned)
Preview: (empty)

Fix:
  1. /pdf-tools/pdf-ocr (English) -> searchable archive-page PDF
  2. /pdf-tools/pdf-to-word -> archive-page.txt with real text

Edge cases and what actually happens

Empty extraction on a scanned PDF

No text layer

If the PDF is a scan or image, there is no selectable text and the output is empty. Run PDF OCR to add a text layer, then re-extract.

Repeated running headers/footers in the text

Expected

The extractor returns every page's text layer verbatim, so a running head appears once per page. This isn't a bug — strip it with Find & Replace in Word, or accept it if you want the markers.

Hard-wrapped lines instead of paragraphs

Expected

Within a page, text items are joined with spaces and the PDF's visual wrapping carries through. Long paragraphs may arrive broken at the original line widths — re-flow with a Find & Replace that joins a lowercase line-end to the next line.

Two-column paper reads across columns

Reading order may differ

Multi-column layouts can interleave because extraction follows pdf.js item order. Inspect the preview; if scrambled, separate the columns manually in Word or extract the single-column pages individually.

Non-Latin script (Arabic, CJK, Cyrillic)

Usually preserved

Output is UTF-8, so any script the PDF maps via ToUnicode comes across. Right-to-left text may need direction set in Word; CJK is fine as long as the source embedded proper Unicode mappings.

Encrypted PDF

Blocked until decrypted

Text can't be read from an encrypted PDF. Decrypt with Remove PDF Password first, then extract.

File exceeds the tier limit

Rejected

Free is 2 MB / 50 pages. Larger files are blocked. Upgrade to Pro (50 MB / 500 pages) or split with Extract Pages.

Invisible / off-page text appears in the output

Expected

Some PDFs carry hidden text (transparent watermarks, OCR ghost layers, off-canvas content). It's part of the text layer, so it's extracted. Search and remove anything unexpected in Word.

Frequently asked questions

Why is text extraction better than copy-paste from a PDF viewer?

Viewers paste with hard line breaks, inconsistent spacing, and often duplicated headers, and many won't let you select across a whole document at once. This tool extracts the entire text layer in one pass, page by page, into a single clean .txt you control.

What file do I get?

A UTF-8 .txt file named after your PDF (e.g. report.txt). Pages are separated by a blank line. There is no .docx — you open or paste the text into Word/Docs/LibreOffice and style it yourself.

Does the preview show the whole document?

No — the on-screen preview is truncated to the first 5,000 characters to stay responsive. The downloaded .txt contains the complete extraction.

Are headers and footers included?

Yes, they're part of each page's text layer, so they appear (typically once per page). Remove them with a single Find & Replace in Word if you don't want them.

How are footnotes handled?

Footnote text is in the page's text layer, so it's extracted — usually appearing after the body text of the page where it sits, not auto-linked as a Word footnote. Re-attach footnotes manually if you need Word's footnote feature.

Will the reading order always be right?

For single-column documents, yes. Multi-column layouts can interleave because extraction follows pdf.js's item order. Check the preview and re-order blocks in Word if needed.

Does it keep accents and special characters?

Yes. Output is UTF-8, so accents, em-dashes, curly quotes, and non-Latin scripts survive as long as the PDF maps them correctly via its embedded ToUnicode tables.

Can I extract tables this way?

You'll get the cells' text but not a structured table. For rows and columns, use PDF to Excel (CSV, columns by position) or PDF table to JSON.

Is the file uploaded?

No. pdf.js parses the PDF in your browser. The UI confirms "0 bytes uploaded." Good for confidential drafts and unpublished manuscripts.

How do I get text from a scanned PDF?

Run PDF OCR first (it recognises the glyphs and outputs a searchable PDF), then run this tool on the OCR'd file to extract the recognised text.

What's the page limit?

50 pages free, 500 on Pro, 2,000 on Pro Media. For longer documents, split with Extract Pages and extract each part.

Is this the same as the PDF to Text tool?

Yes, functionally — both extract the text layer to .txt. This page targets the "prepare text for a Word document" workflow; PDF to Text targets search/NLP plumbing. Same engine, same output.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to extract text from a pdf into a clean word-ready file

Step 1
Open the converter — Go to the PDF to Word tool. It extracts text locally with pdf.js — no upload.
Step 2
Drop the PDF — Add a single PDF. Extraction begins immediately; there is nothing to configure first.
Step 3
Confirm the text layer in the preview — The first 5,000 characters appear in a scrollable panel. If you see your text, the extraction worked. If it's empty, the PDF is a scan and needs OCR (see the cookbook).
Step 4
Download the full .txt — Click Download to save the complete text (the preview is truncated, the file is not). It's named after your PDF, e.g. manuscript.pdf → manuscript.txt.
Step 5
Bring it into your writing tool — Open or paste the .txt into Word, Google Docs, or LibreOffice. It arrives as one plain-text body you can re-flow and style with your house template.
Step 6
Clean up predictable artefacts — Use Find & Replace to collapse double spaces, re-join hard-wrapped lines, and remove repeated page headers/footers that were part of the text layer. Then apply heading styles.

How the text is structured in the output

The exact shape pdf.js extraction produces (lib/pdf/pdf-text-extract.ts).

Aspect	Behaviour	Practical note
Within a page	Text items joined with a single space	Visual line breaks inside a page are not preserved as newlines — words flow together separated by spaces.
Between pages	Joined with a blank line (`\n\n`)	You can find page boundaries by the double newline; useful for re-splitting later.
Reading order	Order pdf.js reports text items	Correct for single-column; multi-column can interleave.
Encoding	UTF-8	Accents, em-dashes, curly quotes, and non-Latin scripts survive when the PDF maps them correctly.
Headers / footers	Included as plain text	Repeated running heads appear once per page — strip with Find & Replace if unwanted.
Tables	Cell text only, no grid	For structured rows/columns use PDF to Excel or PDF table to JSON.

When to use a sibling tool instead

Pick the output format that matches what you'll do with the text next.

You want…	Use	Output
Plain text for Word/Docs	PDF to Word (this tool)	`.txt`
Same text, plumbing/NLP framing	PDF to Text	`.txt`
Markdown with page headings	PDF to Markdown	`.md`
Structured table rows	PDF to Excel	CSV
Searchable text from a scan	PDF OCR	searchable PDF

Cookbook

Extraction-and-cleanup recipes for people whose goal is the words, not the design. Output blocks approximate the .txt content.

Lift the body text out of a 12-page report

A digital-native report PDF: extraction returns the full prose, pages separated by blank lines, ready to drop into a fresh Word template.

Input:  market-report.pdf  (12 pages, exported from InDesign)

Workflow:
  1. Drop onto /pdf-tools/pdf-to-word -> auto-extracts
  2. Download market-report.txt

Output (abbreviated):
Executive Summary
Demand grew 14% year over year ...

(blank line = page break)

Methodology
We surveyed 1,200 respondents ...

Remove a repeating page header/footer

Running heads land in the text layer once per page. A single regex Find & Replace in Word strips every copy at once.

Output (.txt) repeats this on every page:
ACME CONFIDENTIAL — DO NOT DISTRIBUTE

In Word, Find & Replace (wildcards):
  Find:    ACME CONFIDENTIAL — DO NOT DISTRIBUTE
  Replace: (empty)
  -> Replace All

Result: header gone from all 12 pages in one click.

Collapse double spaces from justified text

Justified PDFs add micro-spacing that the extractor renders as extra spaces. One pass normalises it.

Output (.txt):
The  committee   approved    the  motion.

In Word, Find & Replace:
  Find:    two spaces
  Replace: one space
  -> Replace All (run twice to catch triples)

Result:
The committee approved the motion.

Split the .txt back into per-page chunks

Because pages are separated by a blank line, you can reconstruct page boundaries downstream — handy when feeding sections into another doc.

manuscript.txt  (pages separated by \n\n)

In a text editor or script, split on a blank line:
  pages = text.split(/\n\n+/)
  pages[0] -> page 1 text
  pages[1] -> page 2 text
  ...

Scan detected — OCR then extract

An empty extraction means no text layer. OCR creates one; then this tool returns real text.

Input:  archive-page.pdf  (scanned)
Preview: (empty)

Fix:
  1. /pdf-tools/pdf-ocr (English) -> searchable archive-page PDF
  2. /pdf-tools/pdf-to-word -> archive-page.txt with real text

Edge cases and what actually happens

Empty extraction on a scanned PDF

No text layer

If the PDF is a scan or image, there is no selectable text and the output is empty. Run PDF OCR to add a text layer, then re-extract.

Repeated running headers/footers in the text

Expected

The extractor returns every page's text layer verbatim, so a running head appears once per page. This isn't a bug — strip it with Find & Replace in Word, or accept it if you want the markers.

Hard-wrapped lines instead of paragraphs

Expected

Two-column paper reads across columns

Reading order may differ

Non-Latin script (Arabic, CJK, Cyrillic)

Usually preserved

Output is UTF-8, so any script the PDF maps via ToUnicode comes across. Right-to-left text may need direction set in Word; CJK is fine as long as the source embedded proper Unicode mappings.

Encrypted PDF

Blocked until decrypted

Text can't be read from an encrypted PDF. Decrypt with Remove PDF Password first, then extract.

File exceeds the tier limit

Rejected

Free is 2 MB / 50 pages. Larger files are blocked. Upgrade to Pro (50 MB / 500 pages) or split with Extract Pages.

Invisible / off-page text appears in the output

Expected

Some PDFs carry hidden text (transparent watermarks, OCR ghost layers, off-canvas content). It's part of the text layer, so it's extracted. Search and remove anything unexpected in Word.

Frequently asked questions

Why is text extraction better than copy-paste from a PDF viewer?

What file do I get?

A UTF-8 .txt file named after your PDF (e.g. report.txt). Pages are separated by a blank line. There is no .docx — you open or paste the text into Word/Docs/LibreOffice and style it yourself.

Does the preview show the whole document?

No — the on-screen preview is truncated to the first 5,000 characters to stay responsive. The downloaded .txt contains the complete extraction.

Are headers and footers included?

Yes, they're part of each page's text layer, so they appear (typically once per page). Remove them with a single Find & Replace in Word if you don't want them.

How are footnotes handled?

Will the reading order always be right?

For single-column documents, yes. Multi-column layouts can interleave because extraction follows pdf.js's item order. Check the preview and re-order blocks in Word if needed.

Does it keep accents and special characters?

Yes. Output is UTF-8, so accents, em-dashes, curly quotes, and non-Latin scripts survive as long as the PDF maps them correctly via its embedded ToUnicode tables.

Can I extract tables this way?

You'll get the cells' text but not a structured table. For rows and columns, use PDF to Excel (CSV, columns by position) or PDF table to JSON.

Is the file uploaded?

No. pdf.js parses the PDF in your browser. The UI confirms "0 bytes uploaded." Good for confidential drafts and unpublished manuscripts.

How do I get text from a scanned PDF?

Run PDF OCR first (it recognises the glyphs and outputs a searchable PDF), then run this tool on the OCR'd file to extract the recognised text.

What's the page limit?

50 pages free, 500 on Pro, 2,000 on Pro Media. For longer documents, split with Extract Pages and extract each part.

Is this the same as the PDF to Text tool?

Yes, functionally — both extract the text layer to .txt. This page targets the "prepare text for a Word document" workflow; PDF to Text targets search/NLP plumbing. Same engine, same output.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Extract Text from a PDF into a Clean Word-Ready File

How to extract text from a pdf into a clean word-ready file

How the text is structured in the output

When to use a sibling tool instead

Cookbook

Lift the body text out of a 12-page report

Remove a repeating page header/footer

Collapse double spaces from justified text

Split the .txt back into per-page chunks

Scan detected — OCR then extract

Edge cases and what actually happens

Empty extraction on a scanned PDF

Repeated running headers/footers in the text

Hard-wrapped lines instead of paragraphs

Two-column paper reads across columns

Non-Latin script (Arabic, CJK, Cyrillic)

Encrypted PDF

File exceeds the tier limit

Invisible / off-page text appears in the output

Frequently asked questions

Why is text extraction better than copy-paste from a PDF viewer?

What file do I get?

Does the preview show the whole document?

Are headers and footers included?

How are footnotes handled?

Will the reading order always be right?

Does it keep accents and special characters?

Can I extract tables this way?

Is the file uploaded?

How do I get text from a scanned PDF?

What's the page limit?

Is this the same as the PDF to Text tool?

Privacy first

Related guides

Extract Text from a PDF into a Clean Word-Ready File

How to extract text from a pdf into a clean word-ready file

How the text is structured in the output

When to use a sibling tool instead

Cookbook

Lift the body text out of a 12-page report

Remove a repeating page header/footer

Collapse double spaces from justified text

Split the .txt back into per-page chunks

Scan detected — OCR then extract

Edge cases and what actually happens

Empty extraction on a scanned PDF

Repeated running headers/footers in the text

Hard-wrapped lines instead of paragraphs

Two-column paper reads across columns

Non-Latin script (Arabic, CJK, Cyrillic)

Encrypted PDF

File exceeds the tier limit

Invisible / off-page text appears in the output

Frequently asked questions

Why is text extraction better than copy-paste from a PDF viewer?

What file do I get?

Does the preview show the whole document?

Are headers and footers included?

How are footnotes handled?

Will the reading order always be right?

Does it keep accents and special characters?

Can I extract tables this way?

Is the file uploaded?

How do I get text from a scanned PDF?

What's the page limit?

Is this the same as the PDF to Text tool?

Privacy first

Related guides