How to apply ocr to a pdf for structured data extraction
- Step 1Confirm the source needs OCR — Run PDF to Plain Text on the document. Empty output means it is a scan that needs OCR; if text already comes back, skip OCR and go straight to extraction.
- Step 2Drop the scanned document into the OCR tool — Load it; recognition runs locally in your browser with no upload.
- Step 3Select the document language — Pick the language from the dropdown (English (
eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn)) — English default. First use downloads ~10 MB of Tesseract data, then caches it. - Step 4Run OCR as stage one — Each page is rendered, recognised, and rebuilt with an invisible text layer. This must happen before any extraction step.
- Step 5Extract structured data from the OCR output — Feed the searchable PDF into PDF Table to JSON for tables, PDF to Excel for CSV, or PDF Form Field Extractor for AcroForm fields.
- Step 6Validate the extracted fields — OCR has no confidence score and can misread digits. Validate totals, dates, and IDs — ideally with checksum or range rules — before the data enters a system of record.
Pipeline stages: where OCR fits
OCR is always stage one for scanned inputs. The right extractor depends on the data shape.
| Stage | Tool | Input | Output |
|---|---|---|---|
| 1. Recognise | pdf-ocr | Scanned PDF (image-only) | Searchable PDF |
| 2a. Tables -> JSON | PDF Table to JSON | Searchable PDF | Array of row objects |
| 2b. Tables -> CSV | PDF to Excel | Searchable PDF | CSV text |
| 2c. Form fields | PDF Form Field Extractor | PDF with AcroForm fields | Field name/type/value JSON |
| 2d. Free text | PDF to Plain Text | Searchable PDF | Plain text |
Honest scope: this tool vs. a document-AI service
Pick the right tool for your volume and accuracy needs.
| Need | This OCR tool | Better fit |
|---|---|---|
| Ad-hoc single-document extraction | Ideal — free, private, in-browser | — |
| Confidence scores per field | Not provided | Cloud document AI (e.g. Textract/Document AI) |
| Key-value / layout zoning | Not provided (word-position grouping only) | Form-specific extraction service |
| High-volume automated throughput | Low-volume via local runner | Server-side OCR/document-AI pipeline |
| Non-Latin scripts in the text layer | Limited (Helvetica/WinAnsi only) | Unicode-capable desktop/server OCR |
Cookbook
Pipeline recipes for the documents teams actually need to extract from.
Scanned invoice -> JSON line items
OCR the invoice, then group the recognised table into row objects. Validate the amount column before use.
invoice-scan.pdf -> pdf-ocr (eng) -> searchable.pdf
searchable.pdf -> /pdf-tools/pdf-table-to-json ->
[
{ "Description": "Consulting", "Hours": "10", "Amount": "1,500.00" },
{ "Description": "Travel", "Hours": "", "Amount": "220.00" }
]Scanned report table -> CSV
For a spreadsheet workflow, route the OCR output to PDF to Excel for CSV rows.
report-scan.pdf -> pdf-ocr -> /pdf-tools/pdf-to-excel -> "Region","Q1","Q2" "North","412","507" "South","388","445" (verify numeric columns — OCR can misread 0/O, 1/l)
Extract fields from a filled scanned form
If the scan flattened the AcroForm into an image, the Form Field Extractor finds no fields — OCR the labels/values and extract as a table or text instead.
filled-form-scan.pdf -> /pdf-tools/pdf-form-extractor ->
[] (no live form fields — scan is flattened)
Instead: pdf-ocr -> pdf-table-to-json / pdf-to-text
to read the printed labels and valuesConfirm OCR ran before extraction
Make the OCR step verifiable in the pipeline by checking text is present before extracting.
assert PDF-to-Plain-Text(searchable.pdf) is non-empty -> proceed to extraction else -> re-run pdf-ocr (or fix scan quality) and retry
Low-volume automated extraction via the runner
Chain OCR and extraction through the local runner so documents never leave the machine.
POST 127.0.0.1:9789/v1/tools/pdf-ocr/run
body: scan.pdf, { "lang": "eng" }
-> POST 127.0.0.1:9789/v1/tools/pdf-table-to-json/run
body: (ocr output)
-> rows JSON, processed locallyEdge cases and what actually happens
Extractor returns empty on a scan
Needs OCRPDF Table to JSON and PDF to Plain Text read embedded text; a raw scan has none, so they return nothing. Run OCR first — that is the whole point of stage one.
OCR misreads a digit in a total
Validation requiredThere is no confidence score; 0/O, 1/l/7, 5/S misreads slip through silently. Apply validation rules (sum checks, date ranges, ID formats) downstream and never push OCR'd numbers straight into a system of record unchecked.
Form fields not detected after OCR
Flattened scanA scanned form has no live AcroForm fields — it is an image of a form. PDF Form Field Extractor returns an empty list. Read the printed labels/values via OCR + PDF Table to JSON or PDF to Plain Text instead.
Columns merge in the extracted table
Layout limitTable reconstruction groups OCR words by Y (rows) and X (columns); tight columns or jittery OCR spacing can merge or split cells. Verify the JSON/CSV and, where possible, scan at higher DPI for cleaner spacing.
High volume hits practical limits
Out of scopeThis is interactive, single-document, CPU-bound OCR with per-tier file and page caps (e.g. Pro 50 MB / 500 pages). For high-throughput pipelines use a server-side OCR/document-AI service; use this tool for ad-hoc and low-volume work.
Non-Latin financial document
LimitedCyrillic and CJK text can be recognised but cannot be written into the Helvetica (WinAnsi) text layer, so the searchable layer — and therefore extraction — comes back empty for those scripts. Use a Unicode-capable OCR engine for non-Latin documents.
Free-tier document over the cap
BlockedFree allows 2 MB / 50 pages. Batch invoice runs hit this fast. Upgrade to Pro (50 MB / 500 pages) or split the document with PDF Split by Range before OCR.
First language load before first extraction
ExpectedThe ~10 MB training-data download happens once per language, ahead of recognition. Account for it on the first run of a new pipeline; later runs use the cache.
Run outside a browser
PassthroughOCR needs a canvas. In a non-browser context the buffer is returned unchanged and no recognition occurs. Use the browser tool or the local runner.
Frequently asked questions
Should I apply OCR before or after other steps?
Before extraction, always. OCR is stage one for any scanned input — PDF Table to JSON, PDF to Excel, PDF to Plain Text, and similar tools all need the text layer OCR creates. Compression and conversion to flat images should come last, since lossy compression destroys the text layer.
What DPI gives the best extraction accuracy?
300 DPI is the practical minimum for small print; 400–600 DPI helps with dense tables and fine print. The tool itself renders at a fixed 2× and offers no DPI control, so the win comes from scanning the source document at higher resolution.
Does OCR output the extracted data directly?
No — OCR outputs a searchable PDF. The structured data comes from the extractor you run next: PDF Table to JSON for row objects, PDF to Excel for CSV, or PDF Form Field Extractor for live form fields.
Are there confidence scores I can threshold on?
No. Every recognised word is written into the text layer regardless of confidence — there is no per-word score exposed. Build validation downstream (sum checks, regex on IDs, date-range checks) and review flagged fields manually.
Why does the Form Field Extractor find nothing on my scanned form?
Because a scanned form is an image — the interactive AcroForm fields were flattened away. The extractor only reports live form fields. OCR the scan, then read the printed labels and values with PDF Table to JSON or PDF to Plain Text.
Can this replace AWS Textract or Google Document AI?
For ad-hoc, low-volume, privacy-sensitive extraction, often yes — it is free and runs in your browser. For high volume, per-field confidence, key-value zoning, or trained form models, a cloud document-AI service is the right tool. This is single-document interactive OCR, not a managed pipeline.
How do I keep financial documents private during extraction?
Use the in-browser tool (nothing is uploaded) or, for automation, the local runner — which processes documents on your own machine. Either way the scan and its data never reach JAD's servers; only an anonymous usage count is recorded when signed in.
Which language should I select for an invoice?
The invoice's language, from English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn). English is default. Picking the right model improves recognition of accented text and locale-specific characters. Each language loads ~10 MB of data on first use, then caches.
How large a batch can I OCR per run?
OCR is one file per run, bounded by tier: Free 2 MB / 50 pages, Pro 50 MB / 500 pages, Pro+Media 500 MB / 2,000 pages, Developer 2 GB / 10,000 pages. For many documents, script the runner to process them sequentially.
Will lossy compression help my OCR'd invoices?
Only if you no longer need the text. Aggressive PDF Compression re-rasterises every page to a JPEG and removes the searchable text layer — so compress after extraction, never before.
Can I chunk an OCR'd report for an LLM extraction step?
Yes. After OCR, PDF to Text Chunks produces overlapping, sentence-aware chunks with page ranges and token estimates — ready to feed an LLM that does the structured extraction in your pipeline.
How do I script the OCR-then-extract pipeline?
Fetch schemas from GET /api/v1/tools/pdf-ocr and GET /api/v1/tools/pdf-table-to-json, pair the @jadapps/runner once, then POST the scan to 127.0.0.1:9789/v1/tools/pdf-ocr/run with { "lang": "eng" } and feed the result to pdf-table-to-json/run. Both stages run locally on your machine.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.