How to convert a scanned pdf to editable word text
- Step 1Confirm it's actually a scan — Drop the PDF onto PDF to Word first. If the preview is empty, it's image-only and needs OCR. (If text already appears, skip OCR — you're done.)
- Step 2Open PDF OCR and pick the language — Go to PDF OCR and select the document's language from the 10-language menu. The first run downloads ~10 MB of Tesseract language data, then caches it.
- Step 3Run OCR to get a searchable PDF — Process the scan. Tesseract renders and recognises each page and re-emits a searchable PDF that looks identical to the scan but now carries an invisible text layer. Download it.
- Step 4Extract the recognised text to .txt — Drop the OCR'd PDF onto PDF to Word. It extracts the new text layer and gives you a UTF-8
.txt. - Step 5Open in Word and proofread — Open or paste the
.txtinto Word. Read it against the scan — OCR mistakes cluster around digits, punctuation, and lookalike letters (l/1, O/0, rn/m). Use Find & Replace for systematic errors. - Step 6Rebuild structure — Apply heading styles, fix wrapped lines, and rebuild any tables (OCR'd tables extract as spaced text — use PDF to Excel on the OCR'd PDF for tabular pages).
The two-step pipeline
There is no single "scanned PDF → Word" button. Each step is a distinct tool.
| Step | Tool | Does | Output |
|---|---|---|---|
| 1. OCR | PDF OCR | Renders pages, recognises glyphs (Tesseract.js), adds invisible text layer | Searchable PDF |
| 2. Extract | PDF to Word | Reads the new text layer (pdf.js) | UTF-8 .txt |
| 3. Edit | Microsoft Word / Docs | Proofread, style, rebuild tables | Your .docx |
OCR languages & accuracy expectations
Languages from the OCR tool's selector; accuracy depends on scan quality, not the tool.
| Factor | Detail |
|---|---|
| Languages | English, French, German, Spanish, Italian, Portuguese, Dutch, Russian, Chinese (Simplified), Japanese |
| First-run download | ~10 MB of Tesseract language data, then cached |
| Clean printed scan | High accuracy — proofread, don't retype |
| Faint / low-DPI / skewed scan | More errors — rescan higher quality if possible |
| Handwriting | Unreliable — Tesseract targets printed text; expect heavy correction |
| Tables in the scan | Recognised as text, not grids — use PDF to Excel for structure |
Cookbook
Two-step recipes for turning paper into editable Word text. Output blocks approximate the .txt after OCR.
Digitise a clean printed letter
A crisp 300-DPI scan of a typed letter OCRs cleanly; extraction then gives near-perfect text for Word.
Input: letter-scan.pdf (1 page, clean 300 DPI) Step 1 /pdf-tools/pdf-ocr (English) -> letter-scan (searchable).pdf Step 2 /pdf-tools/pdf-to-word -> letter-scan.txt Output (.txt): Dear Ms Alvarez, Thank you for your application dated 3 March 2026 ... (proofread, then style in Word)
Non-English scan — pick the right model
Selecting the matching language dramatically improves recognition of accented and non-Latin characters.
Input: rechnung-scan.pdf (German invoice scan)
Step 1 /pdf-tools/pdf-ocr -> language: German (deu)
-> downloads German model (~10 MB, first time)
-> rechnung-scan (searchable).pdf
Step 2 /pdf-tools/pdf-to-word -> rechnung-scan.txt
Output keeps umlauts: Gesamtbetrag: 1.234,56 EUR (ä ö ü ß intact)Catch and fix systematic OCR errors
OCR errors are predictable. Fix them in bulk with Word's Find & Replace rather than line by line.
Output (.txt) with typical OCR slips: Invoice N0. 1OO45 amount $1,2OO.OO due 0n 5/1 In Word, Find & Replace (do digits first, in context): O0 -> 00 (zero confused with letter O) N0. -> No. 0n -> on Result: Invoice No. 10045 amount $1,200.00 due on 5/1 (still verify every figure against the scan)
Scanned table — OCR then send to Excel
A scanned spreadsheet won't become a Word table from text extraction. OCR it, then use PDF to Excel for structure.
Input: inventory-scan.pdf (scanned table) Step 1 /pdf-tools/pdf-ocr -> searchable inventory PDF Step 2a /pdf-tools/pdf-to-excel -> CSV of the table rows/cols Step 2b (body text) /pdf-tools/pdf-to-word -> .txt Combine: paste the CSV-as-table + the body text into Word
Multi-page archive batch
For a thick archive, mind the per-file page limit at each step and split if needed.
Input: archive.pdf (80 scanned pages) — over the 50-page Free limit 1. /pdf-tools/pdf-extract-pages -> two parts (1-50, 51-80) 2. OCR each part /pdf-tools/pdf-ocr 3. /pdf-tools/pdf-to-word on each -> two .txt files 4. Concatenate in Word (Pro tier raises the limit to 500 pages, avoiding the split.)
Edge cases and what actually happens
Expecting a one-click scanned-PDF-to-Word button
Two steps requiredThe text converter has no OCR toggle. A scan must go through PDF OCR first (producing a searchable PDF), then through PDF to Word. Running the text converter alone on a raw scan returns nothing.
Handwritten pages
UnreliableTesseract targets printed text. Cursive and most handwriting recognise poorly and need heavy correction. For handwriting-heavy documents, budget time to transcribe rather than relying on OCR.
Faint, skewed, or low-DPI scan
Reduced accuracyRecognition quality tracks scan quality. Faint photocopies, rotated pages, and sub-200-DPI scans produce more errors. Rescan at higher DPI and straighten the page if you can before OCR.
Wrong OCR language selected
Garbled accentsRunning a French or German scan through the English model mangles accented characters. Pick the matching language in PDF OCR; the first use of each language downloads ~10 MB of data, then caches it.
Digits and lookalike letters misread
Proofread requiredOCR commonly confuses O/0, l/1/I, S/5, and rn/m. In amounts, dates, and reference numbers a single slip matters — verify every figure against the scan and fix lookalikes with targeted Find & Replace.
Scanned tables flatten to spaced text
Use PDF to ExcelOCR recognises table cells as text but not as a grid. For tabular scans, run PDF to Excel on the OCR'd PDF to get CSV, then paste a real table into Word.
Archive exceeds the page limit at a step
RejectedEach step enforces the tier limit (Free 50 pages / 2 MB). For larger archives, split with Extract Pages, process the parts, and recombine in Word — or upgrade to Pro (500 pages).
Mixed PDF — some text pages, some scanned
Partly emptyA PDF that mixes digital and scanned pages extracts text only from the digital ones; scanned pages come back blank. OCR the whole file first so every page gains a text layer, then extract.
Frequently asked questions
Why can't I just convert a scanned PDF to Word directly?
Because a scan is an image — there's no text to extract. You must OCR it first to create a text layer. Run PDF OCR (it outputs a searchable PDF), then run PDF to Word on that to get a .txt for Word.
Is there an OCR option inside the PDF to Word tool?
No. The text converter has no settings at all and no OCR toggle. OCR is a separate tool and a separate pass. This page documents the real two-step workflow.
Which languages does the OCR support?
Ten: English, French, German, Spanish, Italian, Portuguese, Dutch, Russian, Simplified Chinese, and Japanese. Select the document's language in PDF OCR before running — the first use of each downloads ~10 MB of language data, then caches it.
What accuracy can I expect?
Clean, standard-font printed scans recognise at high accuracy — you proofread rather than retype. Faint, skewed, low-DPI, or handwritten pages produce noticeably more errors. Always read the result against the scan, especially numbers.
Are my scans uploaded?
No. Both OCR (Tesseract.js) and text extraction (pdf.js) run in your browser; the scan never leaves your machine. That's important for medical, HR, and archival documents.
Does OCR handle handwriting?
Not reliably. Tesseract is built for printed text. Cursive and most handwriting recognise poorly and need heavy correction — for those, transcription is usually faster than fixing OCR.
What's a searchable PDF and why does OCR make one?
OCR re-emits the scan as a PDF that looks identical but has an invisible text layer drawn over the glyphs, so text can be selected, copied, and indexed. The text converter then reads that layer. You can also keep the searchable PDF as a useful artefact in its own right.
How do I fix OCR mistakes efficiently?
OCR errors are systematic (O↔0, l↔1, rn↔m). Fix them in bulk with Word's Find & Replace, then proofread amounts and dates individually against the scan.
Will scanned tables become Word tables?
No — they OCR to spaced text. For tabular scans, run PDF to Excel on the OCR'd PDF to get CSV, then paste a proper table into Word.
My PDF has both real text and scanned pages — what happens?
Text extraction returns the digital pages' text and leaves scanned pages blank. OCR the whole file first so every page gets a text layer, then extract.
What are the size and page limits?
Each step uses the PDF-family limits: Free 2 MB / 50 pages, Pro 50 MB / 500 pages, Pro Media 500 MB / 2,000 pages. For big archives, split with Extract Pages and recombine in Word.
First OCR run is slow — is something wrong?
No. The first time you use a language, the OCR tool downloads ~10 MB of Tesseract data, and recognition itself is compute-heavy (each page is rendered and analysed in your browser). It speeds up after the model is cached.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.