OCR a Scanned PDF to Make It Searchable — Free Online

How to make a scanned pdf searchable with ocr

Step 1
Open the OCR tool and drop your scanned PDF — Load the image-only PDF into the OCR tool. Parsing and recognition run in your browser via pdf.js and Tesseract.js — nothing is sent to a server.
Step 2
Pick the OCR language — Choose the page's primary language from the dropdown (English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn)). Default is English. The first time you use a language the browser downloads ~10 MB of Tesseract training data and caches it for next time.
Step 3
Run the recognition pass — Each page is rendered to a 2× canvas with a white background, Tesseract recognises the words and their bounding boxes, and a transparent text layer is drawn over the re-embedded page image.
Step 4
Wait for every page to process — OCR is CPU-bound and runs page-by-page in a single pass — a long or dense scan takes noticeably longer than a structural PDF edit. The page image you get back is a JPEG re-render, not the original page stream.
Step 5
Download the searchable PDF — Save the rebuilt PDF. It looks like the original scan but now carries the invisible text layer behind every page.
Step 6
Verify search works — Open the file in any viewer and press Ctrl+F for a word you can see on the page. A match confirms the text layer landed correctly; if a word is misrecognised, search for a nearby distinctive word instead.

What the OCR tool does to each page

The pipeline is fixed — these are the actual processing steps in the browser, in order.

Stage	What happens	Why it matters for you
Render	pdf.js renders the page to a canvas at scale 2 with a white fill behind it	Higher resolution = better recognition; the white background means transparent or near-white scans still produce clean glyphs
Recognise	Tesseract.js runs on the canvas and returns words with pixel bounding boxes	This is where accuracy is decided — driven by scan quality and the language you selected
Re-embed image	The canvas is encoded as JPEG (quality 0.92) and placed as the page background	The output page is a re-rendered JPEG, not your original page bytes — a re-compression step, not a passthrough
Draw text layer	Each recognised word is drawn in Helvetica at opacity 0, scaled to fit its bounding box	This invisible layer is what makes Ctrl+F, selection, and indexing work without altering the visible page

Tier limits that apply to OCR (PDF family)

OCR is governed by the standard PDF size and page limits — there is no separate per-day OCR quota. Real values from the tier table.

Tier	Max file size	Max pages	Files per batch
Free	2 MB	50 pages	1
Pro	50 MB	500 pages	5
Pro + Media	500 MB	2,000 pages	50
Developer	2 GB	10,000 pages	unlimited
Enterprise	unlimited	unlimited	unlimited

Cookbook

Practical ways to turn a dead scan into a searchable document, and how to tell whether it worked.

Confirm a PDF is image-only before OCR

If text is already embedded you do not need OCR at all. The quickest signal is whether selection works in a viewer; programmatically, a plain-text extract that comes back empty means the page is an image.

Before OCR — run PDF to Plain Text on the scan:
  (output is empty or whitespace only)
  -> image-only PDF, OCR is needed

After OCR — run PDF to Plain Text again:
  Invoice #4471
  Date: 2026-03-02
  Amount due: 1,240.00
  -> text layer present, document is searchable

Search a freshly OCR'd contract

The text layer is invisible but fully indexed. Open the downloaded PDF and search for a clause keyword.

Viewer: Ctrl+F "indemnification"
  1 of 3 matches  (page 7)
  -> highlight lands on the scanned word, even though
     the visible pixels are the original scan image

OCR a non-English scan

Select the matching language so Tesseract loads the right model. Latin-script languages (French, German, Spanish, Italian, Portuguese, Dutch) place cleanly into the Helvetica text layer.

Language dropdown: French (fra)
  First run: downloads ~10 MB fra.traineddata
  Recognises: "Conditions generales de vente"
  Searchable in viewer: Ctrl+F "generales" -> match

Make a scan ready for downstream tools

Most extraction tools read embedded text, so they return nothing on a raw scan. OCR first, then chain into a converter.

Step 1  pdf-ocr            -> searchable PDF
Step 2  pdf-to-text         -> /pdf-tools/pdf-to-text
        pdf-table-to-json   -> /pdf-tools/pdf-table-to-json
        pdf-summary-generator -> /pdf-tools/pdf-summary-generator

Keep file size sane on image-heavy scans

Because OCR re-embeds each page as a JPEG, an already-large scan stays large. Compress afterward if you need to email it.

scan.pdf (18 MB, 40 pages)
  -> pdf-ocr -> scan-searchable.pdf (~17 MB)
  -> /pdf-tools/pdf-compress-lossy (target 1 MB)
  note: lossy compression re-rasterises pages and
        DROPS the OCR text layer — compress only if
        you no longer need search

Edge cases and what actually happens

PDF already has selectable text

By design

OCR does not check first — it re-renders and re-recognises every page regardless. If your PDF is already searchable you waste time and degrade the page to a JPEG re-render. Test selection in a viewer (or run PDF to Plain Text) before OCR; only run OCR when the page is genuinely image-only.

Free-tier file over 2 MB or 50 pages

Blocked

Scans are large, so the 2 MB / 50-page free limit is the most common wall. Pro lifts it to 50 MB / 500 pages and Pro+Media to 500 MB / 2,000 pages. Splitting the scan first with PDF Split by Range keeps each chunk under the cap.

First run is slow / appears to hang

Expected

The first OCR for a given language downloads ~10 MB of Tesseract training data from a CDN before recognition starts, and OCR is CPU-bound per page. A large scan can take minutes. Subsequent runs reuse the cached model and are faster.

Output page looks slightly softer than the original

Expected

Each page is re-rendered at 2× and re-encoded as JPEG (quality 0.92), so the visible image is a re-compression of the original, not the untouched page stream. The appearance is very close but not byte-identical.

Non-Latin language recognised but not searchable

Limited

The invisible text layer is drawn in Helvetica (WinAnsi). Cyrillic (Russian), Chinese, and Japanese glyphs Tesseract recognises cannot be encoded into a Helvetica layer, so those scripts may fail to place or be dropped from the searchable layer. The tool is most reliable for Latin-script documents.

Handwriting on the page

Poor accuracy

Tesseract is a printed-text engine. Handwritten notes recognise unreliably — see the handwritten OCR guide for realistic expectations and a manual-review workflow.

Skewed or low-DPI scan

Degraded

Tilted pages and scans below ~300 DPI lower recognition accuracy. The tool has no deskew or DPI control — rescan straight at 300 DPI+ for best results, since OCR works on the image it is given.

Run in a non-browser / Node context

Passthrough

OCR requires a DOM canvas. Outside a browser (e.g. a Node test run) the function returns the input buffer unchanged rather than erroring — so OCR only happens in the live browser tool.

Mixed-language document

Single language

The dropdown selects one Tesseract model per run. A page mixing, say, English and Japanese will only recognise the selected language well. Run the tool once per dominant language section, or pick the language with the most text.

Frequently asked questions

Does OCR change how the PDF looks?

Almost not at all. The text layer is drawn at opacity 0 so it is invisible, and the original scan is shown as the page image. The one subtlety: that page image is a fresh 2× render re-encoded as JPEG (quality 0.92), so it is a re-compression of the original rather than the untouched page stream — close to identical, but not byte-for-byte.

What OCR accuracy should I expect on a clean scan?

Clean, straight black-on-white scans of printed text at 300 DPI+ recognise very well. Accuracy drops with low resolution, skew, background colour or noise, faint print, and especially handwriting. There is no confidence-threshold setting — every recognised word is placed into the layer as-is.

Can I get the text out as a .txt file?

Not from this tool directly — OCR always outputs a searchable PDF. Once the invisible text layer exists, run the result through PDF to Plain Text for a .txt, PDF to Markdown for Markdown, or PDF Table to JSON for tabular data.

Which languages can it recognise?

Ten, chosen in the OCR-language dropdown: English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn). English is the default. Latin-script languages place cleanly into the searchable layer; Cyrillic, Chinese, and Japanese may be recognised but cannot be encoded into the Helvetica text layer.

Why does the first run take so long?

Tesseract downloads roughly 10 MB of training data for the selected language from a CDN on first use, then caches it in the browser. After that, only the per-page recognition time remains — and OCR is CPU-bound, so dense or multi-page scans still take a while.

Is the searchable text indexed by SharePoint / Google Drive?

Yes. Full-text search in SharePoint, Google Drive, and most document-management systems reads the PDF text layer — which is exactly what this OCR pass adds. Before OCR, those systems index a scan as blank.

Will my document be uploaded?

No. pdf.js, Tesseract.js, and pdf-lib all run in your browser tab. The scan never leaves your device; only an anonymous usage counter is recorded when you are signed in. The one network call is the one-time training-data download.

How many pages can I OCR at once?

It depends on tier: Free allows up to 50 pages and 2 MB, Pro up to 500 pages and 50 MB, Pro+Media up to 2,000 pages and 500 MB, Developer up to 10,000 pages and 2 GB. For very large scans, split with PDF Split by Range first.

Can I select which pages to OCR?

No — OCR processes every page of the uploaded file. To OCR only part of a document, extract those pages first with PDF Extract Pages, OCR the extract, then re-merge if needed.

My PDF already has text — should I still OCR it?

No. If Ctrl+F already finds words, the PDF has a text layer and OCR would only re-rasterise the pages and slow you down. Reserve OCR for image-only PDFs (scans, photos, image exports).

Why is the output file not smaller than the input?

OCR re-embeds each page as a JPEG and adds a text layer, so an image-heavy scan stays roughly the same size or slightly larger. To shrink it for email, run Aggressive PDF Compression afterward — but note that lossy compression re-rasterises pages and removes the searchable text layer.

Can I automate OCR in a pipeline?

Yes. Fetch the tool schema from GET /api/v1/tools/pdf-ocr, pair the @jadapps/runner once, then POST the file plus { "lang": "eng" } to 127.0.0.1:9789/v1/tools/pdf-ocr/run. The scan is processed locally by the runner on your machine — it never reaches JAD's servers. A common pipeline is: scan in -> pdf-ocr -> pdf-to-text -> index.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to make a scanned pdf searchable with ocr

Step 1
Open the OCR tool and drop your scanned PDF — Load the image-only PDF into the OCR tool. Parsing and recognition run in your browser via pdf.js and Tesseract.js — nothing is sent to a server.
Step 2
Pick the OCR language — Choose the page's primary language from the dropdown (English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn)). Default is English. The first time you use a language the browser downloads ~10 MB of Tesseract training data and caches it for next time.
Step 3
Run the recognition pass — Each page is rendered to a 2× canvas with a white background, Tesseract recognises the words and their bounding boxes, and a transparent text layer is drawn over the re-embedded page image.
Step 4
Wait for every page to process — OCR is CPU-bound and runs page-by-page in a single pass — a long or dense scan takes noticeably longer than a structural PDF edit. The page image you get back is a JPEG re-render, not the original page stream.
Step 5
Download the searchable PDF — Save the rebuilt PDF. It looks like the original scan but now carries the invisible text layer behind every page.
Step 6
Verify search works — Open the file in any viewer and press Ctrl+F for a word you can see on the page. A match confirms the text layer landed correctly; if a word is misrecognised, search for a nearby distinctive word instead.

What the OCR tool does to each page

The pipeline is fixed — these are the actual processing steps in the browser, in order.

Stage	What happens	Why it matters for you
Render	pdf.js renders the page to a canvas at scale 2 with a white fill behind it	Higher resolution = better recognition; the white background means transparent or near-white scans still produce clean glyphs
Recognise	Tesseract.js runs on the canvas and returns words with pixel bounding boxes	This is where accuracy is decided — driven by scan quality and the language you selected
Re-embed image	The canvas is encoded as JPEG (quality 0.92) and placed as the page background	The output page is a re-rendered JPEG, not your original page bytes — a re-compression step, not a passthrough
Draw text layer	Each recognised word is drawn in Helvetica at opacity 0, scaled to fit its bounding box	This invisible layer is what makes Ctrl+F, selection, and indexing work without altering the visible page

Tier limits that apply to OCR (PDF family)

OCR is governed by the standard PDF size and page limits — there is no separate per-day OCR quota. Real values from the tier table.

Tier	Max file size	Max pages	Files per batch
Free	2 MB	50 pages	1
Pro	50 MB	500 pages	5
Pro + Media	500 MB	2,000 pages	50
Developer	2 GB	10,000 pages	unlimited
Enterprise	unlimited	unlimited	unlimited

Cookbook

Practical ways to turn a dead scan into a searchable document, and how to tell whether it worked.

Confirm a PDF is image-only before OCR

Before OCR — run PDF to Plain Text on the scan:
  (output is empty or whitespace only)
  -> image-only PDF, OCR is needed

After OCR — run PDF to Plain Text again:
  Invoice #4471
  Date: 2026-03-02
  Amount due: 1,240.00
  -> text layer present, document is searchable

Search a freshly OCR'd contract

The text layer is invisible but fully indexed. Open the downloaded PDF and search for a clause keyword.

Viewer: Ctrl+F "indemnification"
  1 of 3 matches  (page 7)
  -> highlight lands on the scanned word, even though
     the visible pixels are the original scan image

OCR a non-English scan

Select the matching language so Tesseract loads the right model. Latin-script languages (French, German, Spanish, Italian, Portuguese, Dutch) place cleanly into the Helvetica text layer.

Language dropdown: French (fra)
  First run: downloads ~10 MB fra.traineddata
  Recognises: "Conditions generales de vente"
  Searchable in viewer: Ctrl+F "generales" -> match

Make a scan ready for downstream tools

Most extraction tools read embedded text, so they return nothing on a raw scan. OCR first, then chain into a converter.

Step 1  pdf-ocr            -> searchable PDF
Step 2  pdf-to-text         -> /pdf-tools/pdf-to-text
        pdf-table-to-json   -> /pdf-tools/pdf-table-to-json
        pdf-summary-generator -> /pdf-tools/pdf-summary-generator

Keep file size sane on image-heavy scans

Because OCR re-embeds each page as a JPEG, an already-large scan stays large. Compress afterward if you need to email it.

scan.pdf (18 MB, 40 pages)
  -> pdf-ocr -> scan-searchable.pdf (~17 MB)
  -> /pdf-tools/pdf-compress-lossy (target 1 MB)
  note: lossy compression re-rasterises pages and
        DROPS the OCR text layer — compress only if
        you no longer need search

Edge cases and what actually happens

PDF already has selectable text

By design

Free-tier file over 2 MB or 50 pages

Blocked

First run is slow / appears to hang

Expected

Output page looks slightly softer than the original

Expected

Non-Latin language recognised but not searchable

Limited

Handwriting on the page

Poor accuracy

Tesseract is a printed-text engine. Handwritten notes recognise unreliably — see the handwritten OCR guide for realistic expectations and a manual-review workflow.

Skewed or low-DPI scan

Degraded

Tilted pages and scans below ~300 DPI lower recognition accuracy. The tool has no deskew or DPI control — rescan straight at 300 DPI+ for best results, since OCR works on the image it is given.

Run in a non-browser / Node context

Passthrough

OCR requires a DOM canvas. Outside a browser (e.g. a Node test run) the function returns the input buffer unchanged rather than erroring — so OCR only happens in the live browser tool.

Mixed-language document

Single language

Frequently asked questions

Does OCR change how the PDF looks?

What OCR accuracy should I expect on a clean scan?

Can I get the text out as a .txt file?

Which languages can it recognise?

Why does the first run take so long?

Is the searchable text indexed by SharePoint / Google Drive?

Will my document be uploaded?

How many pages can I OCR at once?

Can I select which pages to OCR?

No — OCR processes every page of the uploaded file. To OCR only part of a document, extract those pages first with PDF Extract Pages, OCR the extract, then re-merge if needed.

My PDF already has text — should I still OCR it?

No. If Ctrl+F already finds words, the PDF has a text layer and OCR would only re-rasterise the pages and slow you down. Reserve OCR for image-only PDFs (scans, photos, image exports).

Why is the output file not smaller than the input?

Can I automate OCR in a pipeline?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Make a Scanned PDF Searchable with OCR

How to make a scanned pdf searchable with ocr

What the OCR tool does to each page

Tier limits that apply to OCR (PDF family)

Cookbook

Confirm a PDF is image-only before OCR

Search a freshly OCR'd contract

OCR a non-English scan

Make a scan ready for downstream tools

Keep file size sane on image-heavy scans

Edge cases and what actually happens

PDF already has selectable text

Free-tier file over 2 MB or 50 pages

First run is slow / appears to hang

Output page looks slightly softer than the original

Non-Latin language recognised but not searchable

Handwriting on the page

Skewed or low-DPI scan

Run in a non-browser / Node context

Mixed-language document

Frequently asked questions

Does OCR change how the PDF looks?

What OCR accuracy should I expect on a clean scan?

Can I get the text out as a .txt file?

Which languages can it recognise?

Why does the first run take so long?

Is the searchable text indexed by SharePoint / Google Drive?

Will my document be uploaded?

How many pages can I OCR at once?

Can I select which pages to OCR?

My PDF already has text — should I still OCR it?

Why is the output file not smaller than the input?

Can I automate OCR in a pipeline?

Privacy first

Related guides

Make a Scanned PDF Searchable with OCR

How to make a scanned pdf searchable with ocr

What the OCR tool does to each page

Tier limits that apply to OCR (PDF family)

Cookbook

Confirm a PDF is image-only before OCR

Search a freshly OCR'd contract

OCR a non-English scan

Make a scan ready for downstream tools

Keep file size sane on image-heavy scans

Edge cases and what actually happens

PDF already has selectable text

Free-tier file over 2 MB or 50 pages

First run is slow / appears to hang

Output page looks slightly softer than the original

Non-Latin language recognised but not searchable

Handwriting on the page

Skewed or low-DPI scan

Run in a non-browser / Node context

Mixed-language document

Frequently asked questions

Does OCR change how the PDF looks?

What OCR accuracy should I expect on a clean scan?

Can I get the text out as a .txt file?

Which languages can it recognise?

Why does the first run take so long?

Is the searchable text indexed by SharePoint / Google Drive?

Will my document be uploaded?

How many pages can I OCR at once?

Can I select which pages to OCR?

My PDF already has text — should I still OCR it?

Why is the output file not smaller than the input?

Can I automate OCR in a pipeline?

Privacy first

Related guides