OCR a PDF for Data Extraction — Free Online Tool

How to apply ocr to a pdf for structured data extraction

Step 1
Confirm the source needs OCR — Run PDF to Plain Text on the document. Empty output means it is a scan that needs OCR; if text already comes back, skip OCR and go straight to extraction.
Step 2
Drop the scanned document into the OCR tool — Load it; recognition runs locally in your browser with no upload.
Step 3
Select the document language — Pick the language from the dropdown (English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn)) — English default. First use downloads ~10 MB of Tesseract data, then caches it.
Step 4
Run OCR as stage one — Each page is rendered, recognised, and rebuilt with an invisible text layer. This must happen before any extraction step.
Step 5
Extract structured data from the OCR output — Feed the searchable PDF into PDF Table to JSON for tables, PDF to Excel for CSV, or PDF Form Field Extractor for AcroForm fields.
Step 6
Validate the extracted fields — OCR has no confidence score and can misread digits. Validate totals, dates, and IDs — ideally with checksum or range rules — before the data enters a system of record.

Pipeline stages: where OCR fits

OCR is always stage one for scanned inputs. The right extractor depends on the data shape.

Stage	Tool	Input	Output
1. Recognise	pdf-ocr	Scanned PDF (image-only)	Searchable PDF
2a. Tables -> JSON	PDF Table to JSON	Searchable PDF	Array of row objects
2b. Tables -> CSV	PDF to Excel	Searchable PDF	CSV text
2c. Form fields	PDF Form Field Extractor	PDF with AcroForm fields	Field name/type/value JSON
2d. Free text	PDF to Plain Text	Searchable PDF	Plain text

Honest scope: this tool vs. a document-AI service

Pick the right tool for your volume and accuracy needs.

Need	This OCR tool	Better fit
Ad-hoc single-document extraction	Ideal — free, private, in-browser	—
Confidence scores per field	Not provided	Cloud document AI (e.g. Textract/Document AI)
Key-value / layout zoning	Not provided (word-position grouping only)	Form-specific extraction service
High-volume automated throughput	Low-volume via local runner	Server-side OCR/document-AI pipeline
Non-Latin scripts in the text layer	Limited (Helvetica/WinAnsi only)	Unicode-capable desktop/server OCR

Cookbook

Pipeline recipes for the documents teams actually need to extract from.

Scanned invoice -> JSON line items

OCR the invoice, then group the recognised table into row objects. Validate the amount column before use.

invoice-scan.pdf -> pdf-ocr (eng) -> searchable.pdf
searchable.pdf -> /pdf-tools/pdf-table-to-json ->
[
  { "Description": "Consulting", "Hours": "10", "Amount": "1,500.00" },
  { "Description": "Travel",     "Hours": "",   "Amount": "220.00" }
]

Scanned report table -> CSV

For a spreadsheet workflow, route the OCR output to PDF to Excel for CSV rows.

report-scan.pdf -> pdf-ocr -> /pdf-tools/pdf-to-excel ->
"Region","Q1","Q2"
"North","412","507"
"South","388","445"
(verify numeric columns — OCR can misread 0/O, 1/l)

Extract fields from a filled scanned form

If the scan flattened the AcroForm into an image, the Form Field Extractor finds no fields — OCR the labels/values and extract as a table or text instead.

filled-form-scan.pdf -> /pdf-tools/pdf-form-extractor ->
  []  (no live form fields — scan is flattened)

Instead: pdf-ocr -> pdf-table-to-json / pdf-to-text
         to read the printed labels and values

Confirm OCR ran before extraction

Make the OCR step verifiable in the pipeline by checking text is present before extracting.

assert PDF-to-Plain-Text(searchable.pdf) is non-empty
  -> proceed to extraction
else
  -> re-run pdf-ocr (or fix scan quality) and retry

Low-volume automated extraction via the runner

Chain OCR and extraction through the local runner so documents never leave the machine.

POST 127.0.0.1:9789/v1/tools/pdf-ocr/run
  body: scan.pdf, { "lang": "eng" }
-> POST 127.0.0.1:9789/v1/tools/pdf-table-to-json/run
  body: (ocr output)
-> rows JSON, processed locally

Edge cases and what actually happens

Extractor returns empty on a scan

Needs OCR

PDF Table to JSON and PDF to Plain Text read embedded text; a raw scan has none, so they return nothing. Run OCR first — that is the whole point of stage one.

OCR misreads a digit in a total

Validation required

There is no confidence score; 0/O, 1/l/7, 5/S misreads slip through silently. Apply validation rules (sum checks, date ranges, ID formats) downstream and never push OCR'd numbers straight into a system of record unchecked.

Form fields not detected after OCR

Flattened scan

A scanned form has no live AcroForm fields — it is an image of a form. PDF Form Field Extractor returns an empty list. Read the printed labels/values via OCR + PDF Table to JSON or PDF to Plain Text instead.

Columns merge in the extracted table

Layout limit

Table reconstruction groups OCR words by Y (rows) and X (columns); tight columns or jittery OCR spacing can merge or split cells. Verify the JSON/CSV and, where possible, scan at higher DPI for cleaner spacing.

High volume hits practical limits

Out of scope

This is interactive, single-document, CPU-bound OCR with per-tier file and page caps (e.g. Pro 50 MB / 500 pages). For high-throughput pipelines use a server-side OCR/document-AI service; use this tool for ad-hoc and low-volume work.

Non-Latin financial document

Limited

Cyrillic and CJK text can be recognised but cannot be written into the Helvetica (WinAnsi) text layer, so the searchable layer — and therefore extraction — comes back empty for those scripts. Use a Unicode-capable OCR engine for non-Latin documents.

Free-tier document over the cap

Blocked

Free allows 2 MB / 50 pages. Batch invoice runs hit this fast. Upgrade to Pro (50 MB / 500 pages) or split the document with PDF Split by Range before OCR.

First language load before first extraction

Expected

The ~10 MB training-data download happens once per language, ahead of recognition. Account for it on the first run of a new pipeline; later runs use the cache.

Run outside a browser

Passthrough

OCR needs a canvas. In a non-browser context the buffer is returned unchanged and no recognition occurs. Use the browser tool or the local runner.

Frequently asked questions

Should I apply OCR before or after other steps?

Before extraction, always. OCR is stage one for any scanned input — PDF Table to JSON, PDF to Excel, PDF to Plain Text, and similar tools all need the text layer OCR creates. Compression and conversion to flat images should come last, since lossy compression destroys the text layer.

What DPI gives the best extraction accuracy?

300 DPI is the practical minimum for small print; 400–600 DPI helps with dense tables and fine print. The tool itself renders at a fixed 2× and offers no DPI control, so the win comes from scanning the source document at higher resolution.

Does OCR output the extracted data directly?

No — OCR outputs a searchable PDF. The structured data comes from the extractor you run next: PDF Table to JSON for row objects, PDF to Excel for CSV, or PDF Form Field Extractor for live form fields.

Are there confidence scores I can threshold on?

No. Every recognised word is written into the text layer regardless of confidence — there is no per-word score exposed. Build validation downstream (sum checks, regex on IDs, date-range checks) and review flagged fields manually.

Why does the Form Field Extractor find nothing on my scanned form?

Because a scanned form is an image — the interactive AcroForm fields were flattened away. The extractor only reports live form fields. OCR the scan, then read the printed labels and values with PDF Table to JSON or PDF to Plain Text.

Can this replace AWS Textract or Google Document AI?

For ad-hoc, low-volume, privacy-sensitive extraction, often yes — it is free and runs in your browser. For high volume, per-field confidence, key-value zoning, or trained form models, a cloud document-AI service is the right tool. This is single-document interactive OCR, not a managed pipeline.

How do I keep financial documents private during extraction?

Use the in-browser tool (nothing is uploaded) or, for automation, the local runner — which processes documents on your own machine. Either way the scan and its data never reach JAD's servers; only an anonymous usage count is recorded when signed in.

Which language should I select for an invoice?

The invoice's language, from English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn). English is default. Picking the right model improves recognition of accented text and locale-specific characters. Each language loads ~10 MB of data on first use, then caches.

How large a batch can I OCR per run?

OCR is one file per run, bounded by tier: Free 2 MB / 50 pages, Pro 50 MB / 500 pages, Pro+Media 500 MB / 2,000 pages, Developer 2 GB / 10,000 pages. For many documents, script the runner to process them sequentially.

Will lossy compression help my OCR'd invoices?

Only if you no longer need the text. Aggressive PDF Compression re-rasterises every page to a JPEG and removes the searchable text layer — so compress after extraction, never before.

Can I chunk an OCR'd report for an LLM extraction step?

Yes. After OCR, PDF to Text Chunks produces overlapping, sentence-aware chunks with page ranges and token estimates — ready to feed an LLM that does the structured extraction in your pipeline.

How do I script the OCR-then-extract pipeline?

Fetch schemas from GET /api/v1/tools/pdf-ocr and GET /api/v1/tools/pdf-table-to-json, pair the @jadapps/runner once, then POST the scan to 127.0.0.1:9789/v1/tools/pdf-ocr/run with { "lang": "eng" } and feed the result to pdf-table-to-json/run. Both stages run locally on your machine.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to apply ocr to a pdf for structured data extraction

Step 1
Confirm the source needs OCR — Run PDF to Plain Text on the document. Empty output means it is a scan that needs OCR; if text already comes back, skip OCR and go straight to extraction.
Step 2
Drop the scanned document into the OCR tool — Load it; recognition runs locally in your browser with no upload.
Step 3
Select the document language — Pick the language from the dropdown (English (eng), French (fra), German (deu), Spanish (spa), Italian (ita), Portuguese (por), Dutch (nld), Russian (rus), Chinese Simplified (chi_sim), and Japanese (jpn)) — English default. First use downloads ~10 MB of Tesseract data, then caches it.
Step 4
Run OCR as stage one — Each page is rendered, recognised, and rebuilt with an invisible text layer. This must happen before any extraction step.
Step 5
Extract structured data from the OCR output — Feed the searchable PDF into PDF Table to JSON for tables, PDF to Excel for CSV, or PDF Form Field Extractor for AcroForm fields.
Step 6
Validate the extracted fields — OCR has no confidence score and can misread digits. Validate totals, dates, and IDs — ideally with checksum or range rules — before the data enters a system of record.

Pipeline stages: where OCR fits

OCR is always stage one for scanned inputs. The right extractor depends on the data shape.

Stage	Tool	Input	Output
1. Recognise	pdf-ocr	Scanned PDF (image-only)	Searchable PDF
2a. Tables -> JSON	PDF Table to JSON	Searchable PDF	Array of row objects
2b. Tables -> CSV	PDF to Excel	Searchable PDF	CSV text
2c. Form fields	PDF Form Field Extractor	PDF with AcroForm fields	Field name/type/value JSON
2d. Free text	PDF to Plain Text	Searchable PDF	Plain text

Honest scope: this tool vs. a document-AI service

Pick the right tool for your volume and accuracy needs.

Need	This OCR tool	Better fit
Ad-hoc single-document extraction	Ideal — free, private, in-browser	—
Confidence scores per field	Not provided	Cloud document AI (e.g. Textract/Document AI)
Key-value / layout zoning	Not provided (word-position grouping only)	Form-specific extraction service
High-volume automated throughput	Low-volume via local runner	Server-side OCR/document-AI pipeline
Non-Latin scripts in the text layer	Limited (Helvetica/WinAnsi only)	Unicode-capable desktop/server OCR

Cookbook

Pipeline recipes for the documents teams actually need to extract from.

Scanned invoice -> JSON line items

OCR the invoice, then group the recognised table into row objects. Validate the amount column before use.

invoice-scan.pdf -> pdf-ocr (eng) -> searchable.pdf
searchable.pdf -> /pdf-tools/pdf-table-to-json ->
[
  { "Description": "Consulting", "Hours": "10", "Amount": "1,500.00" },
  { "Description": "Travel",     "Hours": "",   "Amount": "220.00" }
]

Scanned report table -> CSV

For a spreadsheet workflow, route the OCR output to PDF to Excel for CSV rows.

report-scan.pdf -> pdf-ocr -> /pdf-tools/pdf-to-excel ->
"Region","Q1","Q2"
"North","412","507"
"South","388","445"
(verify numeric columns — OCR can misread 0/O, 1/l)

Extract fields from a filled scanned form

If the scan flattened the AcroForm into an image, the Form Field Extractor finds no fields — OCR the labels/values and extract as a table or text instead.

filled-form-scan.pdf -> /pdf-tools/pdf-form-extractor ->
  []  (no live form fields — scan is flattened)

Instead: pdf-ocr -> pdf-table-to-json / pdf-to-text
         to read the printed labels and values

Confirm OCR ran before extraction

Make the OCR step verifiable in the pipeline by checking text is present before extracting.

assert PDF-to-Plain-Text(searchable.pdf) is non-empty
  -> proceed to extraction
else
  -> re-run pdf-ocr (or fix scan quality) and retry

Low-volume automated extraction via the runner

Chain OCR and extraction through the local runner so documents never leave the machine.

POST 127.0.0.1:9789/v1/tools/pdf-ocr/run
  body: scan.pdf, { "lang": "eng" }
-> POST 127.0.0.1:9789/v1/tools/pdf-table-to-json/run
  body: (ocr output)
-> rows JSON, processed locally

Edge cases and what actually happens

Extractor returns empty on a scan

Needs OCR

PDF Table to JSON and PDF to Plain Text read embedded text; a raw scan has none, so they return nothing. Run OCR first — that is the whole point of stage one.

OCR misreads a digit in a total

Validation required

Form fields not detected after OCR

Flattened scan

Columns merge in the extracted table

Layout limit

High volume hits practical limits

Out of scope

Non-Latin financial document

Limited

Free-tier document over the cap

Blocked

Free allows 2 MB / 50 pages. Batch invoice runs hit this fast. Upgrade to Pro (50 MB / 500 pages) or split the document with PDF Split by Range before OCR.

First language load before first extraction

Expected

The ~10 MB training-data download happens once per language, ahead of recognition. Account for it on the first run of a new pipeline; later runs use the cache.

Run outside a browser

Passthrough

OCR needs a canvas. In a non-browser context the buffer is returned unchanged and no recognition occurs. Use the browser tool or the local runner.

Frequently asked questions

Should I apply OCR before or after other steps?

What DPI gives the best extraction accuracy?

Does OCR output the extracted data directly?

Are there confidence scores I can threshold on?

Why does the Form Field Extractor find nothing on my scanned form?

Can this replace AWS Textract or Google Document AI?

How do I keep financial documents private during extraction?

Which language should I select for an invoice?

How large a batch can I OCR per run?

Will lossy compression help my OCR'd invoices?

Only if you no longer need the text. Aggressive PDF Compression re-rasterises every page to a JPEG and removes the searchable text layer — so compress after extraction, never before.

Can I chunk an OCR'd report for an LLM extraction step?

Yes. After OCR, PDF to Text Chunks produces overlapping, sentence-aware chunks with page ranges and token estimates — ready to feed an LLM that does the structured extraction in your pipeline.

How do I script the OCR-then-extract pipeline?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Apply OCR to a PDF for Structured Data Extraction

How to apply ocr to a pdf for structured data extraction

Pipeline stages: where OCR fits

Honest scope: this tool vs. a document-AI service

Cookbook

Scanned invoice -> JSON line items

Scanned report table -> CSV

Extract fields from a filled scanned form

Confirm OCR ran before extraction

Low-volume automated extraction via the runner

Edge cases and what actually happens

Extractor returns empty on a scan

OCR misreads a digit in a total

Form fields not detected after OCR

Columns merge in the extracted table

High volume hits practical limits

Non-Latin financial document

Free-tier document over the cap

First language load before first extraction

Run outside a browser

Frequently asked questions

Should I apply OCR before or after other steps?

What DPI gives the best extraction accuracy?

Does OCR output the extracted data directly?

Are there confidence scores I can threshold on?

Why does the Form Field Extractor find nothing on my scanned form?

Can this replace AWS Textract or Google Document AI?

How do I keep financial documents private during extraction?

Which language should I select for an invoice?

How large a batch can I OCR per run?

Will lossy compression help my OCR'd invoices?

Can I chunk an OCR'd report for an LLM extraction step?

How do I script the OCR-then-extract pipeline?

Privacy first

Related guides

Apply OCR to a PDF for Structured Data Extraction

How to apply ocr to a pdf for structured data extraction

Pipeline stages: where OCR fits

Honest scope: this tool vs. a document-AI service

Cookbook

Scanned invoice -> JSON line items

Scanned report table -> CSV

Extract fields from a filled scanned form

Confirm OCR ran before extraction

Low-volume automated extraction via the runner

Edge cases and what actually happens

Extractor returns empty on a scan

OCR misreads a digit in a total

Form fields not detected after OCR

Columns merge in the extracted table

High volume hits practical limits

Non-Latin financial document

Free-tier document over the cap

First language load before first extraction

Run outside a browser

Frequently asked questions

Should I apply OCR before or after other steps?

What DPI gives the best extraction accuracy?

Does OCR output the extracted data directly?

Are there confidence scores I can threshold on?

Why does the Form Field Extractor find nothing on my scanned form?

Can this replace AWS Textract or Google Document AI?

How do I keep financial documents private during extraction?

Which language should I select for an invoice?

How large a batch can I OCR per run?

Will lossy compression help my OCR'd invoices?

Can I chunk an OCR'd report for an LLM extraction step?

How do I script the OCR-then-extract pipeline?

Privacy first

Related guides