How to make a scanned pdf searchable with ocr
- Step 1Open the OCR tool and drop your scanned PDF — Load the image-only PDF into the OCR tool. Parsing and recognition run in your browser via pdf.js and Tesseract.js — nothing is sent to a server.
- Step 2Pick the OCR language — Choose the page's primary language from the dropdown (English (
eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn)). Default is English. The first time you use a language the browser downloads ~10 MB of Tesseract training data and caches it for next time. - Step 3Run the recognition pass — Each page is rendered to a 2× canvas with a white background, Tesseract recognises the words and their bounding boxes, and a transparent text layer is drawn over the re-embedded page image.
- Step 4Wait for every page to process — OCR is CPU-bound and runs page-by-page in a single pass — a long or dense scan takes noticeably longer than a structural PDF edit. The page image you get back is a JPEG re-render, not the original page stream.
- Step 5Download the searchable PDF — Save the rebuilt PDF. It looks like the original scan but now carries the invisible text layer behind every page.
- Step 6Verify search works — Open the file in any viewer and press Ctrl+F for a word you can see on the page. A match confirms the text layer landed correctly; if a word is misrecognised, search for a nearby distinctive word instead.
What the OCR tool does to each page
The pipeline is fixed — these are the actual processing steps in the browser, in order.
| Stage | What happens | Why it matters for you |
|---|---|---|
| Render | pdf.js renders the page to a canvas at scale 2 with a white fill behind it | Higher resolution = better recognition; the white background means transparent or near-white scans still produce clean glyphs |
| Recognise | Tesseract.js runs on the canvas and returns words with pixel bounding boxes | This is where accuracy is decided — driven by scan quality and the language you selected |
| Re-embed image | The canvas is encoded as JPEG (quality 0.92) and placed as the page background | The output page is a re-rendered JPEG, not your original page bytes — a re-compression step, not a passthrough |
| Draw text layer | Each recognised word is drawn in Helvetica at opacity 0, scaled to fit its bounding box | This invisible layer is what makes Ctrl+F, selection, and indexing work without altering the visible page |
Tier limits that apply to OCR (PDF family)
OCR is governed by the standard PDF size and page limits — there is no separate per-day OCR quota. Real values from the tier table.
| Tier | Max file size | Max pages | Files per batch |
|---|---|---|---|
| Free | 2 MB | 50 pages | 1 |
| Pro | 50 MB | 500 pages | 5 |
| Pro + Media | 500 MB | 2,000 pages | 50 |
| Developer | 2 GB | 10,000 pages | unlimited |
| Enterprise | unlimited | unlimited | unlimited |
Cookbook
Practical ways to turn a dead scan into a searchable document, and how to tell whether it worked.
Confirm a PDF is image-only before OCR
If text is already embedded you do not need OCR at all. The quickest signal is whether selection works in a viewer; programmatically, a plain-text extract that comes back empty means the page is an image.
Before OCR — run PDF to Plain Text on the scan: (output is empty or whitespace only) -> image-only PDF, OCR is needed After OCR — run PDF to Plain Text again: Invoice #4471 Date: 2026-03-02 Amount due: 1,240.00 -> text layer present, document is searchable
Search a freshly OCR'd contract
The text layer is invisible but fully indexed. Open the downloaded PDF and search for a clause keyword.
Viewer: Ctrl+F "indemnification"
1 of 3 matches (page 7)
-> highlight lands on the scanned word, even though
the visible pixels are the original scan imageOCR a non-English scan
Select the matching language so Tesseract loads the right model. Latin-script languages (French, German, Spanish, Italian, Portuguese, Dutch) place cleanly into the Helvetica text layer.
Language dropdown: French (fra) First run: downloads ~10 MB fra.traineddata Recognises: "Conditions generales de vente" Searchable in viewer: Ctrl+F "generales" -> match
Make a scan ready for downstream tools
Most extraction tools read embedded text, so they return nothing on a raw scan. OCR first, then chain into a converter.
Step 1 pdf-ocr -> searchable PDF
Step 2 pdf-to-text -> /pdf-tools/pdf-to-text
pdf-table-to-json -> /pdf-tools/pdf-table-to-json
pdf-summary-generator -> /pdf-tools/pdf-summary-generatorKeep file size sane on image-heavy scans
Because OCR re-embeds each page as a JPEG, an already-large scan stays large. Compress afterward if you need to email it.
scan.pdf (18 MB, 40 pages)
-> pdf-ocr -> scan-searchable.pdf (~17 MB)
-> /pdf-tools/pdf-compress-lossy (target 1 MB)
note: lossy compression re-rasterises pages and
DROPS the OCR text layer — compress only if
you no longer need searchEdge cases and what actually happens
PDF already has selectable text
By designOCR does not check first — it re-renders and re-recognises every page regardless. If your PDF is already searchable you waste time and degrade the page to a JPEG re-render. Test selection in a viewer (or run PDF to Plain Text) before OCR; only run OCR when the page is genuinely image-only.
Free-tier file over 2 MB or 50 pages
BlockedScans are large, so the 2 MB / 50-page free limit is the most common wall. Pro lifts it to 50 MB / 500 pages and Pro+Media to 500 MB / 2,000 pages. Splitting the scan first with PDF Split by Range keeps each chunk under the cap.
First run is slow / appears to hang
ExpectedThe first OCR for a given language downloads ~10 MB of Tesseract training data from a CDN before recognition starts, and OCR is CPU-bound per page. A large scan can take minutes. Subsequent runs reuse the cached model and are faster.
Output page looks slightly softer than the original
ExpectedEach page is re-rendered at 2× and re-encoded as JPEG (quality 0.92), so the visible image is a re-compression of the original, not the untouched page stream. The appearance is very close but not byte-identical.
Non-Latin language recognised but not searchable
LimitedThe invisible text layer is drawn in Helvetica (WinAnsi). Cyrillic (Russian), Chinese, and Japanese glyphs Tesseract recognises cannot be encoded into a Helvetica layer, so those scripts may fail to place or be dropped from the searchable layer. The tool is most reliable for Latin-script documents.
Handwriting on the page
Poor accuracyTesseract is a printed-text engine. Handwritten notes recognise unreliably — see the handwritten OCR guide for realistic expectations and a manual-review workflow.
Skewed or low-DPI scan
DegradedTilted pages and scans below ~300 DPI lower recognition accuracy. The tool has no deskew or DPI control — rescan straight at 300 DPI+ for best results, since OCR works on the image it is given.
Run in a non-browser / Node context
PassthroughOCR requires a DOM canvas. Outside a browser (e.g. a Node test run) the function returns the input buffer unchanged rather than erroring — so OCR only happens in the live browser tool.
Mixed-language document
Single languageThe dropdown selects one Tesseract model per run. A page mixing, say, English and Japanese will only recognise the selected language well. Run the tool once per dominant language section, or pick the language with the most text.
Frequently asked questions
Does OCR change how the PDF looks?
Almost not at all. The text layer is drawn at opacity 0 so it is invisible, and the original scan is shown as the page image. The one subtlety: that page image is a fresh 2× render re-encoded as JPEG (quality 0.92), so it is a re-compression of the original rather than the untouched page stream — close to identical, but not byte-for-byte.
What OCR accuracy should I expect on a clean scan?
Clean, straight black-on-white scans of printed text at 300 DPI+ recognise very well. Accuracy drops with low resolution, skew, background colour or noise, faint print, and especially handwriting. There is no confidence-threshold setting — every recognised word is placed into the layer as-is.
Can I get the text out as a .txt file?
Not from this tool directly — OCR always outputs a searchable PDF. Once the invisible text layer exists, run the result through PDF to Plain Text for a .txt, PDF to Markdown for Markdown, or PDF Table to JSON for tabular data.
Which languages can it recognise?
Ten, chosen in the OCR-language dropdown: English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn). English is the default. Latin-script languages place cleanly into the searchable layer; Cyrillic, Chinese, and Japanese may be recognised but cannot be encoded into the Helvetica text layer.
Why does the first run take so long?
Tesseract downloads roughly 10 MB of training data for the selected language from a CDN on first use, then caches it in the browser. After that, only the per-page recognition time remains — and OCR is CPU-bound, so dense or multi-page scans still take a while.
Is the searchable text indexed by SharePoint / Google Drive?
Yes. Full-text search in SharePoint, Google Drive, and most document-management systems reads the PDF text layer — which is exactly what this OCR pass adds. Before OCR, those systems index a scan as blank.
Will my document be uploaded?
No. pdf.js, Tesseract.js, and pdf-lib all run in your browser tab. The scan never leaves your device; only an anonymous usage counter is recorded when you are signed in. The one network call is the one-time training-data download.
How many pages can I OCR at once?
It depends on tier: Free allows up to 50 pages and 2 MB, Pro up to 500 pages and 50 MB, Pro+Media up to 2,000 pages and 500 MB, Developer up to 10,000 pages and 2 GB. For very large scans, split with PDF Split by Range first.
Can I select which pages to OCR?
No — OCR processes every page of the uploaded file. To OCR only part of a document, extract those pages first with PDF Extract Pages, OCR the extract, then re-merge if needed.
My PDF already has text — should I still OCR it?
No. If Ctrl+F already finds words, the PDF has a text layer and OCR would only re-rasterise the pages and slow you down. Reserve OCR for image-only PDFs (scans, photos, image exports).
Why is the output file not smaller than the input?
OCR re-embeds each page as a JPEG and adds a text layer, so an image-heavy scan stays roughly the same size or slightly larger. To shrink it for email, run Aggressive PDF Compression afterward — but note that lossy compression re-rasterises pages and removes the searchable text layer.
Can I automate OCR in a pipeline?
Yes. Fetch the tool schema from GET /api/v1/tools/pdf-ocr, pair the @jadapps/runner once, then POST the file plus { "lang": "eng" } to 127.0.0.1:9789/v1/tools/pdf-ocr/run. The scan is processed locally by the runner on your machine — it never reaches JAD's servers. A common pipeline is: scan in -> pdf-ocr -> pdf-to-text -> index.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.