Scrape PDF Table Data to JSON — Free Online Tool

How to scrape table data from a pdf into json

Step 1
Open the tool and drop the source PDF — Load the document into the PDF Table to JSON tool. Extraction starts immediately in the browser — no configuration.
Step 2
Check whether you actually got rows — If the preview is empty, the PDF is scanned/image-only with no text layer. Run PDF OCR first, then re-extract. A non-empty preview means the text layer was read.
Step 3
Confirm the columns and download — Verify the keys match the table's real columns, then download <name>.json — a flat array covering every page.
Step 4
Strip repeated headers, footnotes, and source notes — Public-sector tables reprint headers per page and append footnote/source lines. Filter repeated header rows (r.col !== 'col') and drop trailing note rows where the data columns are empty.
Step 5
Cast and validate the strings — Coerce numeric and date columns (Number(), Date) and remove footnote markers stuck to values ("12.3*" → 12.3). Validate with a schema so a malformed row fails loudly instead of polluting your dataset.
Step 6
Load into your store or analysis tool — Insert the cleaned array into Postgres/MongoDB, read it with pd.read_json/pl.read_json, or query it directly with DuckDB.

What a scrape gets you vs. what you finish

The extraction does structure; you do cleaning and typing. This maps the common public-sector PDF quirks to the step that resolves each.

PDF quirk	How it extracts	Your cleanup
Digitally-generated table	Clean rows, keys from the first line	Light cast + validate
Scanned / image-only PDF	Empty array (no text layer)	Run PDF OCR first, then extract
Repeated page headers	Header line becomes a data row from page 2	Filter `r.col !== 'col'`
Footnote markers on values	Verbatim, e.g. `"12.3*"`	Strip markers before casting
Merged / spanning cells	Not detected; only the text present is placed	Forward-fill the spanned value across rows
Source / note lines	Near-empty trailing rows	Filter where data columns are empty

Tier limits before a scrape runs

Government statistical PDFs can be long; check page count against your tier before extracting.

Tier	Max file size	Max pages
Free	2 MB	50
Pro	50 MB	500
Pro + Media	500 MB	2,000
Developer	2 GB	10,000

Cookbook

A real government statistical table scraped to JSON, plus the load patterns for the usual destinations.

A statistical annex with footnote markers

Note the footnote marker stuck to a value ("4.1p" for provisional) and the source line that becomes a near-empty trailing row.

PDF:
Region        2025    2024
North         4.1p    3.9
South         5.2     5.0
Source: National Statistics, table 3a

Downloaded JSON:
[
  { "Region": "North",  "2025": "4.1p", "2024": "3.9" },
  { "Region": "South",  "2025": "5.2",  "2024": "5.0" },
  { "Region": "Source: National Statistics, table 3a", "2025": "", "2024": "" }
]

Clean: drop notes, strip markers, cast

Remove the source row, peel footnote letters/symbols off the figures, then convert to numbers.

const rows = JSON.parse(json);
const clean = rows
  .filter(r => r["2025"] !== "" && r.Region !== "Region")    // drop notes + repeated headers
  .map(r => ({
    region: r.Region,
    y2025: Number(String(r["2025"]).replace(/[^0-9.]/g, "")), // "4.1p" -> 4.1
    y2024: Number(String(r["2024"]).replace(/[^0-9.]/g, "")),
  }));

Forward-fill a merged category column

Where a category cell spans several rows, only the first row carries its text; the rest are blank. Carry the value down so every row is self-describing.

let last = "";
for (const r of clean) {
  if (r.category && r.category.trim()) last = r.category;
  else r.category = last;            // fill spanned cells
}

Load into pandas / Polars / DuckDB

Once it's a JSON array, every common analysis tool reads it directly.

# pandas
import pandas as pd
df = pd.read_json("clean.json")

# Polars
import polars as pl
df = pl.read_json("clean.json")

# DuckDB (no load step)
-- SELECT * FROM read_json_auto('clean.json');

Insert into Postgres with jsonb

For a quick landing table, stage the array as jsonb and expand columns in SQL.

CREATE TABLE staging (doc jsonb);
-- \copy staging(doc) FROM 'clean.json'  (or insert via your driver)

INSERT INTO regions (region, y2025, y2024)
SELECT e->>'region',
       (e->>'y2025')::numeric,
       (e->>'y2024')::numeric
FROM staging, jsonb_array_elements(doc) AS e;

Edge cases and what actually happens

Source PDF is a scan with no text layer

Empty array

Older government and archival PDFs are often scanned images. With no text layer, PDF.js finds nothing and the array is empty. Run PDF OCR first to add a searchable layer, then re-extract — and verify figures, since OCR errors on statistical data matter.

Repeated headers across many pages

By design

Long statistical tables reprint the column header on every page; from page 2 onward those become data rows. Filter where a key equals its own header text before loading.

Footnote markers attached to values

Strip needed

Public tables flag provisional/revised figures with letters or symbols stuck to the number ("4.1p", "12.3*"). They extract verbatim. Strip non-numeric characters before casting, and keep the marker in a separate field if it's meaningful to your analysis.

Merged / spanning category cells

Blank then fill

The tool doesn't detect cell spans — a category that visually spans several rows only has text in its first row, leaving the rest blank for that column. Forward-fill the value down in your code so every row is self-describing (see the cookbook).

Source and note lines at the table foot

Filter needed

'Source:' lines, footnotes, and 'p = provisional' legends append as near-empty trailing rows. Filter rows where the data columns are empty to remove them cleanly.

Two tables on one page

Merged

The engine groups a whole page as one region, so two side-by-side or stacked tables merge into one block keyed by the first table's header. Split such pages first with PDF Split, or split the array on a boundary row afterward.

Password-protected or access-restricted PDF

May fail

A permissions-locked PDF can block the text read. Remove the password with PDF Remove Password (you'll need it) or PDF Unlock first, then scrape. Respect the source's terms of use.

Document exceeds the free page limit

Blocked

A long statistical release can top the free 50-page limit and be blocked before extraction. Upgrade to Pro (500 pages), or extract just the table pages with PDF Extract Pages first.

Columns drift or wrap in dense tables

Misaligned

Very dense tables with tight columns can mis-assign cells by horizontal position, and long labels wrap to extra rows. Inspect the preview, merge wrapped rows, and remap drifted columns; for persistently messy sources, PDF to Excel gives a CSV you can realign in a spreadsheet.

Frequently asked questions

Does this work on PDFs from government data portals?

Yes, for any digitally-generated PDF — the text layer reads cleanly and you get a flat array of rows. The exception is scanned/image-only PDFs (common from older official sources): those have no text layer and return an empty array until you run PDF OCR first.

How do I know if my PDF is scanned?

If the extraction preview is empty (or you can't select text when you open the PDF in a viewer), it's an image-only scan. Run PDF OCR to add a text layer, then re-extract. A non-empty preview confirms the text layer was readable.

Can I scrape a password-protected PDF?

Remove the protection first: use PDF Remove Password if you have the password, or PDF Unlock for owner-restricted files. A permissions lock can otherwise block the text read. Always respect the source's terms of use.

How are merged cells handled?

The tool doesn't detect cell spans. A category cell that visually spans several rows only has text in its first row; the rest are blank for that column. Forward-fill the value down in your code so each row stands alone — the cookbook shows a few lines that do exactly that.

What about footnote markers stuck to numbers?

They extract verbatim ("4.1p", "12.3*") because the tool never edits values. Strip the non-numeric characters before casting, and if the marker carries meaning (provisional, revised), capture it in a separate field during cleanup.

Is anything uploaded when I scrape?

No. Extraction runs entirely in your browser via PDF.js. Embargoed releases and access-restricted documents never leave your device; only anonymous usage counters are recorded when you're signed in.

What does the output look like, exactly?

A single flat JSON array of objects, keys taken from the first line of each page, every value a string. Rows from all pages are concatenated — there's no per-page nesting and no page/table index — so filter repeated headers and trailing notes during cleanup.

Can I load it straight into a database or DataFrame?

After a light clean (drop notes/headers, cast types), yes. pd.read_json / pl.read_json read it directly; DuckDB queries it with read_json_auto; and Postgres can stage it as jsonb and expand with jsonb_array_elements. The cookbook has snippets for each.

What if there are two tables on one page?

They merge into one block keyed by the first table's header, because the engine treats the page as a single region. Split the page first with PDF Split, or split the resulting array on a recognisable boundary row in post-processing.

Would PDF to Excel be better for messy tables?

Sometimes. PDF to Excel runs the same detection but outputs CSV, which is easier to realign by hand in a spreadsheet when columns drift. Use this JSON tool when a script or database is the consumer; use the Excel tool when you'll manually fix alignment first.

What are the size and page limits for a scrape?

Free: 2 MB and 50 pages. Pro: 50 MB / 500 pages. Pro + Media: 500 MB / 2,000 pages. Developer: 2 GB / 10,000 pages. For a long statistical release, upgrade or extract just the table pages first with PDF Extract Pages.

Is the scrape reproducible?

Yes. The same PDF produces the same array every time — the extraction is deterministic and string-typed, so there's no inference that could vary run to run. That makes it safe to pin in a reproducible data pipeline alongside a checksum of the source PDF.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to scrape table data from a pdf into json

Step 1
Open the tool and drop the source PDF — Load the document into the PDF Table to JSON tool. Extraction starts immediately in the browser — no configuration.
Step 2
Check whether you actually got rows — If the preview is empty, the PDF is scanned/image-only with no text layer. Run PDF OCR first, then re-extract. A non-empty preview means the text layer was read.
Step 3
Confirm the columns and download — Verify the keys match the table's real columns, then download <name>.json — a flat array covering every page.
Step 4
Strip repeated headers, footnotes, and source notes — Public-sector tables reprint headers per page and append footnote/source lines. Filter repeated header rows (r.col !== 'col') and drop trailing note rows where the data columns are empty.
Step 5
Cast and validate the strings — Coerce numeric and date columns (Number(), Date) and remove footnote markers stuck to values ("12.3*" → 12.3). Validate with a schema so a malformed row fails loudly instead of polluting your dataset.
Step 6
Load into your store or analysis tool — Insert the cleaned array into Postgres/MongoDB, read it with pd.read_json/pl.read_json, or query it directly with DuckDB.

What a scrape gets you vs. what you finish

The extraction does structure; you do cleaning and typing. This maps the common public-sector PDF quirks to the step that resolves each.

PDF quirk	How it extracts	Your cleanup
Digitally-generated table	Clean rows, keys from the first line	Light cast + validate
Scanned / image-only PDF	Empty array (no text layer)	Run PDF OCR first, then extract
Repeated page headers	Header line becomes a data row from page 2	Filter `r.col !== 'col'`
Footnote markers on values	Verbatim, e.g. `"12.3*"`	Strip markers before casting
Merged / spanning cells	Not detected; only the text present is placed	Forward-fill the spanned value across rows
Source / note lines	Near-empty trailing rows	Filter where data columns are empty

Tier limits before a scrape runs

Government statistical PDFs can be long; check page count against your tier before extracting.

Tier	Max file size	Max pages
Free	2 MB	50
Pro	50 MB	500
Pro + Media	500 MB	2,000
Developer	2 GB	10,000

Cookbook

A real government statistical table scraped to JSON, plus the load patterns for the usual destinations.

A statistical annex with footnote markers

Note the footnote marker stuck to a value ("4.1p" for provisional) and the source line that becomes a near-empty trailing row.

PDF:
Region        2025    2024
North         4.1p    3.9
South         5.2     5.0
Source: National Statistics, table 3a

Downloaded JSON:
[
  { "Region": "North",  "2025": "4.1p", "2024": "3.9" },
  { "Region": "South",  "2025": "5.2",  "2024": "5.0" },
  { "Region": "Source: National Statistics, table 3a", "2025": "", "2024": "" }
]

Clean: drop notes, strip markers, cast

Remove the source row, peel footnote letters/symbols off the figures, then convert to numbers.

const rows = JSON.parse(json);
const clean = rows
  .filter(r => r["2025"] !== "" && r.Region !== "Region")    // drop notes + repeated headers
  .map(r => ({
    region: r.Region,
    y2025: Number(String(r["2025"]).replace(/[^0-9.]/g, "")), // "4.1p" -> 4.1
    y2024: Number(String(r["2024"]).replace(/[^0-9.]/g, "")),
  }));

Forward-fill a merged category column

Where a category cell spans several rows, only the first row carries its text; the rest are blank. Carry the value down so every row is self-describing.

let last = "";
for (const r of clean) {
  if (r.category && r.category.trim()) last = r.category;
  else r.category = last;            // fill spanned cells
}

Load into pandas / Polars / DuckDB

Once it's a JSON array, every common analysis tool reads it directly.

# pandas
import pandas as pd
df = pd.read_json("clean.json")

# Polars
import polars as pl
df = pl.read_json("clean.json")

# DuckDB (no load step)
-- SELECT * FROM read_json_auto('clean.json');

Insert into Postgres with jsonb

For a quick landing table, stage the array as jsonb and expand columns in SQL.

CREATE TABLE staging (doc jsonb);
-- \copy staging(doc) FROM 'clean.json'  (or insert via your driver)

INSERT INTO regions (region, y2025, y2024)
SELECT e->>'region',
       (e->>'y2025')::numeric,
       (e->>'y2024')::numeric
FROM staging, jsonb_array_elements(doc) AS e;

Edge cases and what actually happens

Source PDF is a scan with no text layer

Empty array

Repeated headers across many pages

By design

Long statistical tables reprint the column header on every page; from page 2 onward those become data rows. Filter where a key equals its own header text before loading.

Footnote markers attached to values

Strip needed

Merged / spanning category cells

Blank then fill

Source and note lines at the table foot

Filter needed

'Source:' lines, footnotes, and 'p = provisional' legends append as near-empty trailing rows. Filter rows where the data columns are empty to remove them cleanly.

Two tables on one page

Merged

Password-protected or access-restricted PDF

May fail

A permissions-locked PDF can block the text read. Remove the password with PDF Remove Password (you'll need it) or PDF Unlock first, then scrape. Respect the source's terms of use.

Document exceeds the free page limit

Blocked

A long statistical release can top the free 50-page limit and be blocked before extraction. Upgrade to Pro (500 pages), or extract just the table pages with PDF Extract Pages first.

Columns drift or wrap in dense tables

Misaligned

Frequently asked questions

Does this work on PDFs from government data portals?

How do I know if my PDF is scanned?

Can I scrape a password-protected PDF?

How are merged cells handled?

What about footnote markers stuck to numbers?

Is anything uploaded when I scrape?

No. Extraction runs entirely in your browser via PDF.js. Embargoed releases and access-restricted documents never leave your device; only anonymous usage counters are recorded when you're signed in.

What does the output look like, exactly?

Can I load it straight into a database or DataFrame?

What if there are two tables on one page?

Would PDF to Excel be better for messy tables?

What are the size and page limits for a scrape?

Is the scrape reproducible?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Scrape Table Data from a PDF into JSON

How to scrape table data from a pdf into json

What a scrape gets you vs. what you finish

Tier limits before a scrape runs

Cookbook

A statistical annex with footnote markers

Clean: drop notes, strip markers, cast

Forward-fill a merged category column

Load into pandas / Polars / DuckDB

Insert into Postgres with jsonb

Edge cases and what actually happens

Source PDF is a scan with no text layer

Repeated headers across many pages

Footnote markers attached to values

Merged / spanning category cells

Source and note lines at the table foot

Two tables on one page

Password-protected or access-restricted PDF

Document exceeds the free page limit

Columns drift or wrap in dense tables

Frequently asked questions

Does this work on PDFs from government data portals?

How do I know if my PDF is scanned?

Can I scrape a password-protected PDF?

How are merged cells handled?

What about footnote markers stuck to numbers?

Is anything uploaded when I scrape?

What does the output look like, exactly?

Can I load it straight into a database or DataFrame?

What if there are two tables on one page?

Would PDF to Excel be better for messy tables?

What are the size and page limits for a scrape?

Is the scrape reproducible?

Privacy first

Related guides

Scrape Table Data from a PDF into JSON

How to scrape table data from a pdf into json

What a scrape gets you vs. what you finish

Tier limits before a scrape runs

Cookbook

A statistical annex with footnote markers

Clean: drop notes, strip markers, cast

Forward-fill a merged category column

Load into pandas / Polars / DuckDB

Insert into Postgres with jsonb

Edge cases and what actually happens

Source PDF is a scan with no text layer

Repeated headers across many pages

Footnote markers attached to values

Merged / spanning category cells

Source and note lines at the table foot

Two tables on one page

Password-protected or access-restricted PDF

Document exceeds the free page limit

Columns drift or wrap in dense tables

Frequently asked questions

Does this work on PDFs from government data portals?

How do I know if my PDF is scanned?

Can I scrape a password-protected PDF?

How are merged cells handled?

What about footnote markers stuck to numbers?

Is anything uploaded when I scrape?

What does the output look like, exactly?

Can I load it straight into a database or DataFrame?

What if there are two tables on one page?

Would PDF to Excel be better for messy tables?

What are the size and page limits for a scrape?

Is the scrape reproducible?

Privacy first

Related guides