Convert a Scanned PDF to Editable Word Text (OCR First)

How to convert a scanned pdf to editable word text

Step 1
Confirm it's actually a scan — Drop the PDF onto PDF to Word first. If the preview is empty, it's image-only and needs OCR. (If text already appears, skip OCR — you're done.)
Step 2
Open PDF OCR and pick the language — Go to PDF OCR and select the document's language from the 10-language menu. The first run downloads ~10 MB of Tesseract language data, then caches it.
Step 3
Run OCR to get a searchable PDF — Process the scan. Tesseract renders and recognises each page and re-emits a searchable PDF that looks identical to the scan but now carries an invisible text layer. Download it.
Step 4
Extract the recognised text to .txt — Drop the OCR'd PDF onto PDF to Word. It extracts the new text layer and gives you a UTF-8 .txt.
Step 5
Open in Word and proofread — Open or paste the .txt into Word. Read it against the scan — OCR mistakes cluster around digits, punctuation, and lookalike letters (l/1, O/0, rn/m). Use Find & Replace for systematic errors.
Step 6
Rebuild structure — Apply heading styles, fix wrapped lines, and rebuild any tables (OCR'd tables extract as spaced text — use PDF to Excel on the OCR'd PDF for tabular pages).

The two-step pipeline

There is no single "scanned PDF → Word" button. Each step is a distinct tool.

Step	Tool	Does	Output
1. OCR	PDF OCR	Renders pages, recognises glyphs (Tesseract.js), adds invisible text layer	Searchable PDF
2. Extract	PDF to Word	Reads the new text layer (pdf.js)	UTF-8 `.txt`
3. Edit	Microsoft Word / Docs	Proofread, style, rebuild tables	Your `.docx`

OCR languages & accuracy expectations

Languages from the OCR tool's selector; accuracy depends on scan quality, not the tool.

Factor	Detail
Languages	English, French, German, Spanish, Italian, Portuguese, Dutch, Russian, Chinese (Simplified), Japanese
First-run download	~10 MB of Tesseract language data, then cached
Clean printed scan	High accuracy — proofread, don't retype
Faint / low-DPI / skewed scan	More errors — rescan higher quality if possible
Handwriting	Unreliable — Tesseract targets printed text; expect heavy correction
Tables in the scan	Recognised as text, not grids — use PDF to Excel for structure

Cookbook

Two-step recipes for turning paper into editable Word text. Output blocks approximate the .txt after OCR.

Digitise a clean printed letter

A crisp 300-DPI scan of a typed letter OCRs cleanly; extraction then gives near-perfect text for Word.

Input:  letter-scan.pdf  (1 page, clean 300 DPI)

Step 1  /pdf-tools/pdf-ocr (English) -> letter-scan (searchable).pdf
Step 2  /pdf-tools/pdf-to-word -> letter-scan.txt

Output (.txt):
Dear Ms Alvarez,
Thank you for your application dated 3 March 2026 ...
(proofread, then style in Word)

Non-English scan — pick the right model

Selecting the matching language dramatically improves recognition of accented and non-Latin characters.

Input:  rechnung-scan.pdf  (German invoice scan)

Step 1  /pdf-tools/pdf-ocr -> language: German (deu)
        -> downloads German model (~10 MB, first time)
        -> rechnung-scan (searchable).pdf
Step 2  /pdf-tools/pdf-to-word -> rechnung-scan.txt

Output keeps umlauts: Gesamtbetrag: 1.234,56 EUR (ä ö ü ß intact)

Catch and fix systematic OCR errors

OCR errors are predictable. Fix them in bulk with Word's Find & Replace rather than line by line.

Output (.txt) with typical OCR slips:
Invoice N0. 1OO45  amount $1,2OO.OO  due 0n 5/1

In Word, Find & Replace (do digits first, in context):
  O0  ->  00     (zero confused with letter O)
  N0. ->  No.
  0n  ->  on
Result:
Invoice No. 10045  amount $1,200.00  due on 5/1
(still verify every figure against the scan)

Scanned table — OCR then send to Excel

A scanned spreadsheet won't become a Word table from text extraction. OCR it, then use PDF to Excel for structure.

Input:  inventory-scan.pdf  (scanned table)

Step 1  /pdf-tools/pdf-ocr -> searchable inventory PDF
Step 2a /pdf-tools/pdf-to-excel -> CSV of the table rows/cols
Step 2b (body text) /pdf-tools/pdf-to-word -> .txt
Combine: paste the CSV-as-table + the body text into Word

Multi-page archive batch

For a thick archive, mind the per-file page limit at each step and split if needed.

Input:  archive.pdf  (80 scanned pages) — over the 50-page Free limit

  1. /pdf-tools/pdf-extract-pages -> two parts (1-50, 51-80)
  2. OCR each part /pdf-tools/pdf-ocr
  3. /pdf-tools/pdf-to-word on each -> two .txt files
  4. Concatenate in Word
(Pro tier raises the limit to 500 pages, avoiding the split.)

Edge cases and what actually happens

Expecting a one-click scanned-PDF-to-Word button

Two steps required

The text converter has no OCR toggle. A scan must go through PDF OCR first (producing a searchable PDF), then through PDF to Word. Running the text converter alone on a raw scan returns nothing.

Handwritten pages

Unreliable

Tesseract targets printed text. Cursive and most handwriting recognise poorly and need heavy correction. For handwriting-heavy documents, budget time to transcribe rather than relying on OCR.

Faint, skewed, or low-DPI scan

Reduced accuracy

Recognition quality tracks scan quality. Faint photocopies, rotated pages, and sub-200-DPI scans produce more errors. Rescan at higher DPI and straighten the page if you can before OCR.

Wrong OCR language selected

Garbled accents

Running a French or German scan through the English model mangles accented characters. Pick the matching language in PDF OCR; the first use of each language downloads ~10 MB of data, then caches it.

Digits and lookalike letters misread

Proofread required

OCR commonly confuses O/0, l/1/I, S/5, and rn/m. In amounts, dates, and reference numbers a single slip matters — verify every figure against the scan and fix lookalikes with targeted Find & Replace.

Scanned tables flatten to spaced text

Use PDF to Excel

OCR recognises table cells as text but not as a grid. For tabular scans, run PDF to Excel on the OCR'd PDF to get CSV, then paste a real table into Word.

Archive exceeds the page limit at a step

Rejected

Each step enforces the tier limit (Free 50 pages / 2 MB). For larger archives, split with Extract Pages, process the parts, and recombine in Word — or upgrade to Pro (500 pages).

Mixed PDF — some text pages, some scanned

Partly empty

A PDF that mixes digital and scanned pages extracts text only from the digital ones; scanned pages come back blank. OCR the whole file first so every page gains a text layer, then extract.

Frequently asked questions

Why can't I just convert a scanned PDF to Word directly?

Because a scan is an image — there's no text to extract. You must OCR it first to create a text layer. Run PDF OCR (it outputs a searchable PDF), then run PDF to Word on that to get a .txt for Word.

Is there an OCR option inside the PDF to Word tool?

No. The text converter has no settings at all and no OCR toggle. OCR is a separate tool and a separate pass. This page documents the real two-step workflow.

Which languages does the OCR support?

Ten: English, French, German, Spanish, Italian, Portuguese, Dutch, Russian, Simplified Chinese, and Japanese. Select the document's language in PDF OCR before running — the first use of each downloads ~10 MB of language data, then caches it.

What accuracy can I expect?

Clean, standard-font printed scans recognise at high accuracy — you proofread rather than retype. Faint, skewed, low-DPI, or handwritten pages produce noticeably more errors. Always read the result against the scan, especially numbers.

Are my scans uploaded?

No. Both OCR (Tesseract.js) and text extraction (pdf.js) run in your browser; the scan never leaves your machine. That's important for medical, HR, and archival documents.

Does OCR handle handwriting?

Not reliably. Tesseract is built for printed text. Cursive and most handwriting recognise poorly and need heavy correction — for those, transcription is usually faster than fixing OCR.

What's a searchable PDF and why does OCR make one?

OCR re-emits the scan as a PDF that looks identical but has an invisible text layer drawn over the glyphs, so text can be selected, copied, and indexed. The text converter then reads that layer. You can also keep the searchable PDF as a useful artefact in its own right.

How do I fix OCR mistakes efficiently?

OCR errors are systematic (O↔0, l↔1, rn↔m). Fix them in bulk with Word's Find & Replace, then proofread amounts and dates individually against the scan.

Will scanned tables become Word tables?

No — they OCR to spaced text. For tabular scans, run PDF to Excel on the OCR'd PDF to get CSV, then paste a proper table into Word.

My PDF has both real text and scanned pages — what happens?

Text extraction returns the digital pages' text and leaves scanned pages blank. OCR the whole file first so every page gets a text layer, then extract.

What are the size and page limits?

Each step uses the PDF-family limits: Free 2 MB / 50 pages, Pro 50 MB / 500 pages, Pro Media 500 MB / 2,000 pages. For big archives, split with Extract Pages and recombine in Word.

First OCR run is slow — is something wrong?

No. The first time you use a language, the OCR tool downloads ~10 MB of Tesseract data, and recognition itself is compute-heavy (each page is rendered and analysed in your browser). It speeds up after the model is cached.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to convert a scanned pdf to editable word text

Step 1
Confirm it's actually a scan — Drop the PDF onto PDF to Word first. If the preview is empty, it's image-only and needs OCR. (If text already appears, skip OCR — you're done.)
Step 2
Open PDF OCR and pick the language — Go to PDF OCR and select the document's language from the 10-language menu. The first run downloads ~10 MB of Tesseract language data, then caches it.
Step 3
Run OCR to get a searchable PDF — Process the scan. Tesseract renders and recognises each page and re-emits a searchable PDF that looks identical to the scan but now carries an invisible text layer. Download it.
Step 4
Extract the recognised text to .txt — Drop the OCR'd PDF onto PDF to Word. It extracts the new text layer and gives you a UTF-8 .txt.
Step 5
Open in Word and proofread — Open or paste the .txt into Word. Read it against the scan — OCR mistakes cluster around digits, punctuation, and lookalike letters (l/1, O/0, rn/m). Use Find & Replace for systematic errors.
Step 6
Rebuild structure — Apply heading styles, fix wrapped lines, and rebuild any tables (OCR'd tables extract as spaced text — use PDF to Excel on the OCR'd PDF for tabular pages).

The two-step pipeline

There is no single "scanned PDF → Word" button. Each step is a distinct tool.

Step	Tool	Does	Output
1. OCR	PDF OCR	Renders pages, recognises glyphs (Tesseract.js), adds invisible text layer	Searchable PDF
2. Extract	PDF to Word	Reads the new text layer (pdf.js)	UTF-8 `.txt`
3. Edit	Microsoft Word / Docs	Proofread, style, rebuild tables	Your `.docx`

OCR languages & accuracy expectations

Languages from the OCR tool's selector; accuracy depends on scan quality, not the tool.

Factor	Detail
Languages	English, French, German, Spanish, Italian, Portuguese, Dutch, Russian, Chinese (Simplified), Japanese
First-run download	~10 MB of Tesseract language data, then cached
Clean printed scan	High accuracy — proofread, don't retype
Faint / low-DPI / skewed scan	More errors — rescan higher quality if possible
Handwriting	Unreliable — Tesseract targets printed text; expect heavy correction
Tables in the scan	Recognised as text, not grids — use PDF to Excel for structure

Cookbook

Two-step recipes for turning paper into editable Word text. Output blocks approximate the .txt after OCR.

Digitise a clean printed letter

A crisp 300-DPI scan of a typed letter OCRs cleanly; extraction then gives near-perfect text for Word.

Input:  letter-scan.pdf  (1 page, clean 300 DPI)

Step 1  /pdf-tools/pdf-ocr (English) -> letter-scan (searchable).pdf
Step 2  /pdf-tools/pdf-to-word -> letter-scan.txt

Output (.txt):
Dear Ms Alvarez,
Thank you for your application dated 3 March 2026 ...
(proofread, then style in Word)

Non-English scan — pick the right model

Selecting the matching language dramatically improves recognition of accented and non-Latin characters.

Input:  rechnung-scan.pdf  (German invoice scan)

Step 1  /pdf-tools/pdf-ocr -> language: German (deu)
        -> downloads German model (~10 MB, first time)
        -> rechnung-scan (searchable).pdf
Step 2  /pdf-tools/pdf-to-word -> rechnung-scan.txt

Output keeps umlauts: Gesamtbetrag: 1.234,56 EUR (ä ö ü ß intact)

Catch and fix systematic OCR errors

OCR errors are predictable. Fix them in bulk with Word's Find & Replace rather than line by line.

Output (.txt) with typical OCR slips:
Invoice N0. 1OO45  amount $1,2OO.OO  due 0n 5/1

In Word, Find & Replace (do digits first, in context):
  O0  ->  00     (zero confused with letter O)
  N0. ->  No.
  0n  ->  on
Result:
Invoice No. 10045  amount $1,200.00  due on 5/1
(still verify every figure against the scan)

Scanned table — OCR then send to Excel

A scanned spreadsheet won't become a Word table from text extraction. OCR it, then use PDF to Excel for structure.

Input:  inventory-scan.pdf  (scanned table)

Step 1  /pdf-tools/pdf-ocr -> searchable inventory PDF
Step 2a /pdf-tools/pdf-to-excel -> CSV of the table rows/cols
Step 2b (body text) /pdf-tools/pdf-to-word -> .txt
Combine: paste the CSV-as-table + the body text into Word

Multi-page archive batch

For a thick archive, mind the per-file page limit at each step and split if needed.

Input:  archive.pdf  (80 scanned pages) — over the 50-page Free limit

  1. /pdf-tools/pdf-extract-pages -> two parts (1-50, 51-80)
  2. OCR each part /pdf-tools/pdf-ocr
  3. /pdf-tools/pdf-to-word on each -> two .txt files
  4. Concatenate in Word
(Pro tier raises the limit to 500 pages, avoiding the split.)

Edge cases and what actually happens

Expecting a one-click scanned-PDF-to-Word button

Two steps required

The text converter has no OCR toggle. A scan must go through PDF OCR first (producing a searchable PDF), then through PDF to Word. Running the text converter alone on a raw scan returns nothing.

Handwritten pages

Unreliable

Tesseract targets printed text. Cursive and most handwriting recognise poorly and need heavy correction. For handwriting-heavy documents, budget time to transcribe rather than relying on OCR.

Faint, skewed, or low-DPI scan

Reduced accuracy

Recognition quality tracks scan quality. Faint photocopies, rotated pages, and sub-200-DPI scans produce more errors. Rescan at higher DPI and straighten the page if you can before OCR.

Wrong OCR language selected

Garbled accents

Running a French or German scan through the English model mangles accented characters. Pick the matching language in PDF OCR; the first use of each language downloads ~10 MB of data, then caches it.

Digits and lookalike letters misread

Proofread required

Scanned tables flatten to spaced text

Use PDF to Excel

OCR recognises table cells as text but not as a grid. For tabular scans, run PDF to Excel on the OCR'd PDF to get CSV, then paste a real table into Word.

Archive exceeds the page limit at a step

Rejected

Each step enforces the tier limit (Free 50 pages / 2 MB). For larger archives, split with Extract Pages, process the parts, and recombine in Word — or upgrade to Pro (500 pages).

Mixed PDF — some text pages, some scanned

Partly empty

A PDF that mixes digital and scanned pages extracts text only from the digital ones; scanned pages come back blank. OCR the whole file first so every page gains a text layer, then extract.

Frequently asked questions

Why can't I just convert a scanned PDF to Word directly?

Is there an OCR option inside the PDF to Word tool?

No. The text converter has no settings at all and no OCR toggle. OCR is a separate tool and a separate pass. This page documents the real two-step workflow.

Which languages does the OCR support?

What accuracy can I expect?

Are my scans uploaded?

No. Both OCR (Tesseract.js) and text extraction (pdf.js) run in your browser; the scan never leaves your machine. That's important for medical, HR, and archival documents.

Does OCR handle handwriting?

Not reliably. Tesseract is built for printed text. Cursive and most handwriting recognise poorly and need heavy correction — for those, transcription is usually faster than fixing OCR.

What's a searchable PDF and why does OCR make one?

How do I fix OCR mistakes efficiently?

OCR errors are systematic (O↔0, l↔1, rn↔m). Fix them in bulk with Word's Find & Replace, then proofread amounts and dates individually against the scan.

Will scanned tables become Word tables?

No — they OCR to spaced text. For tabular scans, run PDF to Excel on the OCR'd PDF to get CSV, then paste a proper table into Word.

My PDF has both real text and scanned pages — what happens?

Text extraction returns the digital pages' text and leaves scanned pages blank. OCR the whole file first so every page gets a text layer, then extract.

What are the size and page limits?

Each step uses the PDF-family limits: Free 2 MB / 50 pages, Pro 50 MB / 500 pages, Pro Media 500 MB / 2,000 pages. For big archives, split with Extract Pages and recombine in Word.

First OCR run is slow — is something wrong?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Convert a Scanned PDF to Editable Word Text

How to convert a scanned pdf to editable word text

The two-step pipeline

OCR languages & accuracy expectations

Cookbook

Digitise a clean printed letter

Non-English scan — pick the right model

Catch and fix systematic OCR errors

Scanned table — OCR then send to Excel

Multi-page archive batch

Edge cases and what actually happens

Expecting a one-click scanned-PDF-to-Word button

Handwritten pages

Faint, skewed, or low-DPI scan

Wrong OCR language selected

Digits and lookalike letters misread

Scanned tables flatten to spaced text

Archive exceeds the page limit at a step

Mixed PDF — some text pages, some scanned

Frequently asked questions

Why can't I just convert a scanned PDF to Word directly?

Is there an OCR option inside the PDF to Word tool?

Which languages does the OCR support?

What accuracy can I expect?

Are my scans uploaded?

Does OCR handle handwriting?

What's a searchable PDF and why does OCR make one?

How do I fix OCR mistakes efficiently?

Will scanned tables become Word tables?

My PDF has both real text and scanned pages — what happens?

What are the size and page limits?

First OCR run is slow — is something wrong?

Privacy first

Related guides

Convert a Scanned PDF to Editable Word Text

How to convert a scanned pdf to editable word text

The two-step pipeline

OCR languages & accuracy expectations

Cookbook

Digitise a clean printed letter

Non-English scan — pick the right model

Catch and fix systematic OCR errors

Scanned table — OCR then send to Excel

Multi-page archive batch

Edge cases and what actually happens

Expecting a one-click scanned-PDF-to-Word button

Handwritten pages

Faint, skewed, or low-DPI scan

Wrong OCR language selected

Digits and lookalike letters misread

Scanned tables flatten to spaced text

Archive exceeds the page limit at a step

Mixed PDF — some text pages, some scanned

Frequently asked questions

Why can't I just convert a scanned PDF to Word directly?

Is there an OCR option inside the PDF to Word tool?

Which languages does the OCR support?

What accuracy can I expect?

Are my scans uploaded?

Does OCR handle handwriting?

What's a searchable PDF and why does OCR make one?

How do I fix OCR mistakes efficiently?

Will scanned tables become Word tables?

My PDF has both real text and scanned pages — what happens?

What are the size and page limits?

First OCR run is slow — is something wrong?

Privacy first

Related guides