How to extract text from a scanned pdf using ocr
- Step 1Drop the scanned PDF into the OCR tool — Load the image-only PDF. pdf.js and Tesseract.js run in your browser; the document is never uploaded.
- Step 2Select the document's language — Pick the matching language from the dropdown (English (
eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn)) so Tesseract loads the right model. English is the default; the first use of a language downloads ~10 MB of training data, then caches it. - Step 3Run OCR to build the text layer — Each page is rendered, recognised, and rebuilt with an invisible Helvetica text layer over the re-embedded page image.
- Step 4Download the searchable PDF — Save the result. It now contains extractable text, even though it still displays as a scan.
- Step 5Extract the text with a converter — Run the OCR'd PDF through PDF to Plain Text for .txt, PDF to Markdown for Markdown, or PDF Table to JSON for tables.
- Step 6Proofread the extracted output — OCR is never perfect on real-world scans. Spot-check numbers, names, and any text that drove a downstream decision before relying on it.
Why raw extraction fails on a scan, and how OCR fixes it
Extraction tools read embedded text; a scan has none until OCR adds it.
| Input | PDF to Plain Text result | After OCR |
|---|---|---|
| Image-only scan (no text layer) | Empty / whitespace only | Full recognised text per page |
| Photo of a document saved as PDF | Empty | Recognised text (accuracy depends on lighting/focus) |
| Born-digital PDF (already has text) | Full text already | No OCR needed — skip it |
| Mixed: some pages text, some scanned | Only the born-digital pages | OCR fills in the scanned pages' text |
Pick the extraction tool for your output
OCR produces a searchable PDF; chain into one of these for the actual extracted format.
| You want | Run after OCR | Output type |
|---|---|---|
| Plain .txt | PDF to Plain Text | text |
| Markdown with page headers | PDF to Markdown | markdown |
| Tables as JSON objects | PDF Table to JSON | json |
| Tables as CSV for a spreadsheet | PDF to Excel | CSV text |
| RAG-ready overlapping chunks | PDF to Text Chunks | JSON chunks |
Cookbook
End-to-end recipes for getting clean text out of a scanned PDF.
Scan to plain text in two steps
OCR adds the layer; PDF to Plain Text pulls it out. This is the canonical scan-to-text flow.
Step 1 pdf-ocr (lang: eng) -> scan-searchable.pdf Step 2 /pdf-tools/pdf-to-text -> ACME LTD Statement of Account Balance carried forward: 4,210.55
Prove the extraction worked
Run PDF to Plain Text on the raw scan first (empty), then on the OCR output (text). The difference confirms OCR landed.
Raw scan -> PDF to Plain Text -> "" (nothing) OCR'd -> PDF to Plain Text -> Page 1 Purchase Order 8841 Qty Item Price ...
Extract a scanned table
OCR recognises the table text; PDF Table to JSON groups it into rows and columns by position. Verify the column split, since OCR spacing can shift cell boundaries.
pdf-ocr -> /pdf-tools/pdf-table-to-json ->
[
{ "Item": "Widget", "Qty": "12", "Price": "3.50" },
{ "Item": "Gadget", "Qty": "4", "Price": "9.00" }
]
(check that Qty/Price didn't merge on tight columns)Non-English text extraction
Choose the matching language. Latin-script languages extract reliably through the text layer.
Language: German (deu) recognises: "Rechnungsbetrag: 1.299,00 EUR" pdf-to-text output preserves the recognised words
Chunk a scanned report for an LLM
After OCR, send the searchable PDF to the chunker to get overlapping, sentence-aware chunks for retrieval.
pdf-ocr -> /pdf-tools/pdf-to-chunks (targetTokens ~500) ->
[
{ "text": "...", "pages": [1,2], "tokensEst": 498 },
{ "text": "...", "pages": [2,3], "tokensEst": 503 }
]Edge cases and what actually happens
Extraction still returns nothing after OCR
Check inputIf PDF to Plain Text is empty even after OCR, the recognition likely failed — usually a non-Latin script that cannot be encoded into the Helvetica text layer, or a scan too low-quality to recognise. Confirm with selection in a viewer; re-scan at higher DPI or use a desktop OCR engine for that script.
Recognised text has the wrong characters
OCR errorMisreads (rn->m, 0->O, 1->l) are inherent to OCR and there is no confidence filter here — every word is placed as recognised. Always proofread extracted numbers and identifiers before using them.
Table columns merge or shift in extraction
Layout limitOCR positions words by bounding box; PDF Table to JSON groups by Y for rows and X for columns. Tight columns or uneven OCR spacing can merge or split cells. Verify the structure and widen columns at scan time if possible.
Free-tier scan over 2 MB or 50 pages
BlockedScans hit the 2 MB / 50-page free cap quickly. Pro raises it to 50 MB / 500 pages. Or split the file with PDF Split by Range, OCR and extract each part, then concatenate the text.
Only some pages are scanned
PartialOCR re-renders and recognises every page, including born-digital ones, re-rasterising the text pages into images in the process. If only a few pages are scanned, extract just those with PDF Extract Pages, OCR them, and keep the original text pages as-is.
Cyrillic / CJK extraction empty
LimitedRussian, Chinese, and Japanese can be recognised by Tesseract but cannot be drawn into the Helvetica (WinAnsi) text layer, so extraction yields little or nothing for those scripts. Use a desktop OCR tool with a Unicode-capable text layer for non-Latin documents.
First language load delays the result
ExpectedThe ~10 MB training-data download happens once per language before recognition begins. Subsequent extractions in the same browser reuse the cache.
Handwritten content in the scan
Poor accuracyTesseract targets printed text; handwriting extracts unreliably. For handwritten forms and notes, see the handwritten OCR guide and plan on manual transcription review.
Run outside a browser
PassthroughOCR needs a canvas, so in a non-browser context the function returns the buffer unchanged and no text is recognised. Use the live browser tool.
Frequently asked questions
Does this tool output a text file?
No — OCR outputs a searchable PDF. To get a .txt, run the OCR'd PDF through PDF to Plain Text. For Markdown use PDF to Markdown; for tables use PDF Table to JSON or PDF to Excel.
Why does my scanned PDF extract to nothing without OCR?
Because a scan is a page image with no embedded character data. The extractor reads embedded text, finds none, and returns empty. OCR recognises the pixels into actual characters in a text layer, which the extractor can then read.
Which language should I pick?
The document's primary language, from the dropdown: English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn). English is the default. Picking the right model meaningfully improves recognition, especially for accented Latin text. Each language downloads ~10 MB of data on first use, then caches.
How accurate is the extracted text?
Clean 300 DPI+ printed scans extract well; quality falls with low resolution, skew, noise, or faint print. There is no confidence threshold — every recognised word is included verbatim, so proofread anything that matters (totals, IDs, names).
Can it extract text from a phone photo of a document?
Yes, if the photo is sharp and well-lit. Save it as a PDF, OCR it, then extract. Glare, shadows, motion blur, and perspective skew all reduce accuracy — the tool has no deskew or perspective correction.
Will it extract tables correctly?
OCR recognises the table's text; the row/column structure is reconstructed by PDF Table to JSON using word positions. It works for clean grids but can merge or split cells on tight or irregular tables — verify the output.
Can I extract from a multi-language scan?
The OCR pass uses one language model at a time. For a document mixing scripts, pick the dominant language, or split it by section with PDF Split by Range and OCR each part with the appropriate language.
Is the scanned content uploaded for extraction?
No. Recognition (Tesseract.js) and extraction (pdf.js) both run in your browser. The document and its text never leave your device — only an anonymous usage count is recorded when signed in.
How big a scan can I process?
By tier: Free 2 MB / 50 pages, Pro 50 MB / 500 pages, Pro+Media 500 MB / 2,000 pages, Developer 2 GB / 10,000 pages, Enterprise unlimited. Split larger files first with PDF Split by Range.
Why is OCR slower than the other PDF tools?
Most PDF tools edit structure; OCR renders every page to an image and runs a recognition engine on each one, which is CPU-bound. Add the one-time training-data download and the first run feels slow. Later runs reuse the cache.
Can I get just the text without the PDF?
Not in a single step. OCR always yields a PDF first; then a one-click pass through PDF to Plain Text gives you the raw text. Chaining the two tools is the supported workflow.
Can I script scan-to-text extraction?
Yes. Pair the @jadapps/runner, then POST the scan with { "lang": "eng" } to 127.0.0.1:9789/v1/tools/pdf-ocr/run (schema at GET /api/v1/tools/pdf-ocr), then send the result to the pdf-to-text tool. Everything runs locally via the runner — no document leaves your machine.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.