Extract Text from a Scanned PDF Using OCR

How to extract text from a scanned pdf using ocr

Step 1
Drop the scanned PDF into the OCR tool — Load the image-only PDF. pdf.js and Tesseract.js run in your browser; the document is never uploaded.
Step 2
Select the document's language — Pick the matching language from the dropdown (English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn)) so Tesseract loads the right model. English is the default; the first use of a language downloads ~10 MB of training data, then caches it.
Step 3
Run OCR to build the text layer — Each page is rendered, recognised, and rebuilt with an invisible Helvetica text layer over the re-embedded page image.
Step 4
Download the searchable PDF — Save the result. It now contains extractable text, even though it still displays as a scan.
Step 5
Extract the text with a converter — Run the OCR'd PDF through PDF to Plain Text for .txt, PDF to Markdown for Markdown, or PDF Table to JSON for tables.
Step 6
Proofread the extracted output — OCR is never perfect on real-world scans. Spot-check numbers, names, and any text that drove a downstream decision before relying on it.

Why raw extraction fails on a scan, and how OCR fixes it

Extraction tools read embedded text; a scan has none until OCR adds it.

Input	PDF to Plain Text result	After OCR
Image-only scan (no text layer)	Empty / whitespace only	Full recognised text per page
Photo of a document saved as PDF	Empty	Recognised text (accuracy depends on lighting/focus)
Born-digital PDF (already has text)	Full text already	No OCR needed — skip it
Mixed: some pages text, some scanned	Only the born-digital pages	OCR fills in the scanned pages' text

Pick the extraction tool for your output

OCR produces a searchable PDF; chain into one of these for the actual extracted format.

You want	Run after OCR	Output type
Plain .txt	PDF to Plain Text	text
Markdown with page headers	PDF to Markdown	markdown
Tables as JSON objects	PDF Table to JSON	json
Tables as CSV for a spreadsheet	PDF to Excel	CSV text
RAG-ready overlapping chunks	PDF to Text Chunks	JSON chunks

Cookbook

End-to-end recipes for getting clean text out of a scanned PDF.

Scan to plain text in two steps

OCR adds the layer; PDF to Plain Text pulls it out. This is the canonical scan-to-text flow.

Step 1  pdf-ocr (lang: eng)   -> scan-searchable.pdf
Step 2  /pdf-tools/pdf-to-text ->

  ACME LTD
  Statement of Account
  Balance carried forward: 4,210.55

Prove the extraction worked

Run PDF to Plain Text on the raw scan first (empty), then on the OCR output (text). The difference confirms OCR landed.

Raw scan  -> PDF to Plain Text -> ""  (nothing)
OCR'd     -> PDF to Plain Text ->
  Page 1
  Purchase Order 8841
  Qty  Item        Price
  ...

Extract a scanned table

OCR recognises the table text; PDF Table to JSON groups it into rows and columns by position. Verify the column split, since OCR spacing can shift cell boundaries.

pdf-ocr -> /pdf-tools/pdf-table-to-json ->
[
  { "Item": "Widget", "Qty": "12", "Price": "3.50" },
  { "Item": "Gadget", "Qty": "4",  "Price": "9.00" }
]
(check that Qty/Price didn't merge on tight columns)

Non-English text extraction

Choose the matching language. Latin-script languages extract reliably through the text layer.

Language: German (deu)
  recognises: "Rechnungsbetrag: 1.299,00 EUR"
  pdf-to-text output preserves the recognised words

Chunk a scanned report for an LLM

After OCR, send the searchable PDF to the chunker to get overlapping, sentence-aware chunks for retrieval.

pdf-ocr -> /pdf-tools/pdf-to-chunks (targetTokens ~500) ->
[
  { "text": "...", "pages": [1,2], "tokensEst": 498 },
  { "text": "...", "pages": [2,3], "tokensEst": 503 }
]

Edge cases and what actually happens

Extraction still returns nothing after OCR

Check input

If PDF to Plain Text is empty even after OCR, the recognition likely failed — usually a non-Latin script that cannot be encoded into the Helvetica text layer, or a scan too low-quality to recognise. Confirm with selection in a viewer; re-scan at higher DPI or use a desktop OCR engine for that script.

Recognised text has the wrong characters

OCR error

Misreads (rn->m, 0->O, 1->l) are inherent to OCR and there is no confidence filter here — every word is placed as recognised. Always proofread extracted numbers and identifiers before using them.

Table columns merge or shift in extraction

Layout limit

OCR positions words by bounding box; PDF Table to JSON groups by Y for rows and X for columns. Tight columns or uneven OCR spacing can merge or split cells. Verify the structure and widen columns at scan time if possible.

Free-tier scan over 2 MB or 50 pages

Blocked

Scans hit the 2 MB / 50-page free cap quickly. Pro raises it to 50 MB / 500 pages. Or split the file with PDF Split by Range, OCR and extract each part, then concatenate the text.

Only some pages are scanned

Partial

OCR re-renders and recognises every page, including born-digital ones, re-rasterising the text pages into images in the process. If only a few pages are scanned, extract just those with PDF Extract Pages, OCR them, and keep the original text pages as-is.

Cyrillic / CJK extraction empty

Limited

Russian, Chinese, and Japanese can be recognised by Tesseract but cannot be drawn into the Helvetica (WinAnsi) text layer, so extraction yields little or nothing for those scripts. Use a desktop OCR tool with a Unicode-capable text layer for non-Latin documents.

First language load delays the result

Expected

The ~10 MB training-data download happens once per language before recognition begins. Subsequent extractions in the same browser reuse the cache.

Handwritten content in the scan

Poor accuracy

Tesseract targets printed text; handwriting extracts unreliably. For handwritten forms and notes, see the handwritten OCR guide and plan on manual transcription review.

Run outside a browser

Passthrough

OCR needs a canvas, so in a non-browser context the function returns the buffer unchanged and no text is recognised. Use the live browser tool.

Frequently asked questions

Does this tool output a text file?

No — OCR outputs a searchable PDF. To get a .txt, run the OCR'd PDF through PDF to Plain Text. For Markdown use PDF to Markdown; for tables use PDF Table to JSON or PDF to Excel.

Why does my scanned PDF extract to nothing without OCR?

Because a scan is a page image with no embedded character data. The extractor reads embedded text, finds none, and returns empty. OCR recognises the pixels into actual characters in a text layer, which the extractor can then read.

Which language should I pick?

The document's primary language, from the dropdown: English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn). English is the default. Picking the right model meaningfully improves recognition, especially for accented Latin text. Each language downloads ~10 MB of data on first use, then caches.

How accurate is the extracted text?

Clean 300 DPI+ printed scans extract well; quality falls with low resolution, skew, noise, or faint print. There is no confidence threshold — every recognised word is included verbatim, so proofread anything that matters (totals, IDs, names).

Can it extract text from a phone photo of a document?

Yes, if the photo is sharp and well-lit. Save it as a PDF, OCR it, then extract. Glare, shadows, motion blur, and perspective skew all reduce accuracy — the tool has no deskew or perspective correction.

Will it extract tables correctly?

OCR recognises the table's text; the row/column structure is reconstructed by PDF Table to JSON using word positions. It works for clean grids but can merge or split cells on tight or irregular tables — verify the output.

Can I extract from a multi-language scan?

The OCR pass uses one language model at a time. For a document mixing scripts, pick the dominant language, or split it by section with PDF Split by Range and OCR each part with the appropriate language.

Is the scanned content uploaded for extraction?

No. Recognition (Tesseract.js) and extraction (pdf.js) both run in your browser. The document and its text never leave your device — only an anonymous usage count is recorded when signed in.

How big a scan can I process?

By tier: Free 2 MB / 50 pages, Pro 50 MB / 500 pages, Pro+Media 500 MB / 2,000 pages, Developer 2 GB / 10,000 pages, Enterprise unlimited. Split larger files first with PDF Split by Range.

Why is OCR slower than the other PDF tools?

Most PDF tools edit structure; OCR renders every page to an image and runs a recognition engine on each one, which is CPU-bound. Add the one-time training-data download and the first run feels slow. Later runs reuse the cache.

Can I get just the text without the PDF?

Not in a single step. OCR always yields a PDF first; then a one-click pass through PDF to Plain Text gives you the raw text. Chaining the two tools is the supported workflow.

Can I script scan-to-text extraction?

Yes. Pair the @jadapps/runner, then POST the scan with { "lang": "eng" } to 127.0.0.1:9789/v1/tools/pdf-ocr/run (schema at GET /api/v1/tools/pdf-ocr), then send the result to the pdf-to-text tool. Everything runs locally via the runner — no document leaves your machine.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to extract text from a scanned pdf using ocr

Step 1
Drop the scanned PDF into the OCR tool — Load the image-only PDF. pdf.js and Tesseract.js run in your browser; the document is never uploaded.
Step 2
Select the document's language — Pick the matching language from the dropdown (English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn)) so Tesseract loads the right model. English is the default; the first use of a language downloads ~10 MB of training data, then caches it.
Step 3
Run OCR to build the text layer — Each page is rendered, recognised, and rebuilt with an invisible Helvetica text layer over the re-embedded page image.
Step 4
Download the searchable PDF — Save the result. It now contains extractable text, even though it still displays as a scan.
Step 5
Extract the text with a converter — Run the OCR'd PDF through PDF to Plain Text for .txt, PDF to Markdown for Markdown, or PDF Table to JSON for tables.
Step 6
Proofread the extracted output — OCR is never perfect on real-world scans. Spot-check numbers, names, and any text that drove a downstream decision before relying on it.

Why raw extraction fails on a scan, and how OCR fixes it

Extraction tools read embedded text; a scan has none until OCR adds it.

Input	PDF to Plain Text result	After OCR
Image-only scan (no text layer)	Empty / whitespace only	Full recognised text per page
Photo of a document saved as PDF	Empty	Recognised text (accuracy depends on lighting/focus)
Born-digital PDF (already has text)	Full text already	No OCR needed — skip it
Mixed: some pages text, some scanned	Only the born-digital pages	OCR fills in the scanned pages' text

Pick the extraction tool for your output

OCR produces a searchable PDF; chain into one of these for the actual extracted format.

You want	Run after OCR	Output type
Plain .txt	PDF to Plain Text	text
Markdown with page headers	PDF to Markdown	markdown
Tables as JSON objects	PDF Table to JSON	json
Tables as CSV for a spreadsheet	PDF to Excel	CSV text
RAG-ready overlapping chunks	PDF to Text Chunks	JSON chunks

Cookbook

End-to-end recipes for getting clean text out of a scanned PDF.

Scan to plain text in two steps

OCR adds the layer; PDF to Plain Text pulls it out. This is the canonical scan-to-text flow.

Step 1  pdf-ocr (lang: eng)   -> scan-searchable.pdf
Step 2  /pdf-tools/pdf-to-text ->

  ACME LTD
  Statement of Account
  Balance carried forward: 4,210.55

Prove the extraction worked

Run PDF to Plain Text on the raw scan first (empty), then on the OCR output (text). The difference confirms OCR landed.

Raw scan  -> PDF to Plain Text -> ""  (nothing)
OCR'd     -> PDF to Plain Text ->
  Page 1
  Purchase Order 8841
  Qty  Item        Price
  ...

Extract a scanned table

OCR recognises the table text; PDF Table to JSON groups it into rows and columns by position. Verify the column split, since OCR spacing can shift cell boundaries.

pdf-ocr -> /pdf-tools/pdf-table-to-json ->
[
  { "Item": "Widget", "Qty": "12", "Price": "3.50" },
  { "Item": "Gadget", "Qty": "4",  "Price": "9.00" }
]
(check that Qty/Price didn't merge on tight columns)

Non-English text extraction

Choose the matching language. Latin-script languages extract reliably through the text layer.

Language: German (deu)
  recognises: "Rechnungsbetrag: 1.299,00 EUR"
  pdf-to-text output preserves the recognised words

Chunk a scanned report for an LLM

After OCR, send the searchable PDF to the chunker to get overlapping, sentence-aware chunks for retrieval.

pdf-ocr -> /pdf-tools/pdf-to-chunks (targetTokens ~500) ->
[
  { "text": "...", "pages": [1,2], "tokensEst": 498 },
  { "text": "...", "pages": [2,3], "tokensEst": 503 }
]

Edge cases and what actually happens

Extraction still returns nothing after OCR

Check input

Recognised text has the wrong characters

OCR error

Misreads (rn->m, 0->O, 1->l) are inherent to OCR and there is no confidence filter here — every word is placed as recognised. Always proofread extracted numbers and identifiers before using them.

Table columns merge or shift in extraction

Layout limit

Free-tier scan over 2 MB or 50 pages

Blocked

Scans hit the 2 MB / 50-page free cap quickly. Pro raises it to 50 MB / 500 pages. Or split the file with PDF Split by Range, OCR and extract each part, then concatenate the text.

Only some pages are scanned

Partial

Cyrillic / CJK extraction empty

Limited

First language load delays the result

Expected

The ~10 MB training-data download happens once per language before recognition begins. Subsequent extractions in the same browser reuse the cache.

Handwritten content in the scan

Poor accuracy

Tesseract targets printed text; handwriting extracts unreliably. For handwritten forms and notes, see the handwritten OCR guide and plan on manual transcription review.

Run outside a browser

Passthrough

OCR needs a canvas, so in a non-browser context the function returns the buffer unchanged and no text is recognised. Use the live browser tool.

Frequently asked questions

Does this tool output a text file?

No — OCR outputs a searchable PDF. To get a .txt, run the OCR'd PDF through PDF to Plain Text. For Markdown use PDF to Markdown; for tables use PDF Table to JSON or PDF to Excel.

Why does my scanned PDF extract to nothing without OCR?

Which language should I pick?

How accurate is the extracted text?

Can it extract text from a phone photo of a document?

Will it extract tables correctly?

Can I extract from a multi-language scan?

Is the scanned content uploaded for extraction?

No. Recognition (Tesseract.js) and extraction (pdf.js) both run in your browser. The document and its text never leave your device — only an anonymous usage count is recorded when signed in.

How big a scan can I process?

By tier: Free 2 MB / 50 pages, Pro 50 MB / 500 pages, Pro+Media 500 MB / 2,000 pages, Developer 2 GB / 10,000 pages, Enterprise unlimited. Split larger files first with PDF Split by Range.

Why is OCR slower than the other PDF tools?

Can I get just the text without the PDF?

Not in a single step. OCR always yields a PDF first; then a one-click pass through PDF to Plain Text gives you the raw text. Chaining the two tools is the supported workflow.

Can I script scan-to-text extraction?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to extract text from a scanned pdf using ocr

Why raw extraction fails on a scan, and how OCR fixes it

Pick the extraction tool for your output

Cookbook

Scan to plain text in two steps

Prove the extraction worked

Extract a scanned table

Non-English text extraction

Chunk a scanned report for an LLM

Edge cases and what actually happens

Extraction still returns nothing after OCR

Recognised text has the wrong characters

Table columns merge or shift in extraction

Free-tier scan over 2 MB or 50 pages

Only some pages are scanned

Cyrillic / CJK extraction empty

First language load delays the result

Handwritten content in the scan

Run outside a browser

Frequently asked questions

Does this tool output a text file?

Why does my scanned PDF extract to nothing without OCR?

Which language should I pick?

How accurate is the extracted text?

Can it extract text from a phone photo of a document?

Will it extract tables correctly?

Can I extract from a multi-language scan?

Is the scanned content uploaded for extraction?

How big a scan can I process?

Why is OCR slower than the other PDF tools?

Can I get just the text without the PDF?

Can I script scan-to-text extraction?

Privacy first

Related guides

Extract Text from a Scanned PDF Using OCR

How to extract text from a scanned pdf using ocr

Why raw extraction fails on a scan, and how OCR fixes it

Pick the extraction tool for your output

Cookbook

Scan to plain text in two steps

Prove the extraction worked

Extract a scanned table

Non-English text extraction

Chunk a scanned report for an LLM

Edge cases and what actually happens

Extraction still returns nothing after OCR

Recognised text has the wrong characters

Table columns merge or shift in extraction

Free-tier scan over 2 MB or 50 pages

Only some pages are scanned

Cyrillic / CJK extraction empty

First language load delays the result

Handwritten content in the scan

Run outside a browser

Frequently asked questions

Does this tool output a text file?

Why does my scanned PDF extract to nothing without OCR?

Which language should I pick?

How accurate is the extracted text?

Can it extract text from a phone photo of a document?

Will it extract tables correctly?

Can I extract from a multi-language scan?

Is the scanned content uploaded for extraction?

How big a scan can I process?

Why is OCR slower than the other PDF tools?

Can I get just the text without the PDF?

Can I script scan-to-text extraction?

Privacy first

Related guides