How to scrape table data from a pdf into json
- Step 1Open the tool and drop the source PDF — Load the document into the PDF Table to JSON tool. Extraction starts immediately in the browser — no configuration.
- Step 2Check whether you actually got rows — If the preview is empty, the PDF is scanned/image-only with no text layer. Run PDF OCR first, then re-extract. A non-empty preview means the text layer was read.
- Step 3Confirm the columns and download — Verify the keys match the table's real columns, then download
<name>.json— a flat array covering every page. - Step 4Strip repeated headers, footnotes, and source notes — Public-sector tables reprint headers per page and append footnote/source lines. Filter repeated header rows (
r.col !== 'col') and drop trailing note rows where the data columns are empty. - Step 5Cast and validate the strings — Coerce numeric and date columns (
Number(),Date) and remove footnote markers stuck to values ("12.3*"→12.3). Validate with a schema so a malformed row fails loudly instead of polluting your dataset. - Step 6Load into your store or analysis tool — Insert the cleaned array into Postgres/MongoDB, read it with
pd.read_json/pl.read_json, or query it directly with DuckDB.
What a scrape gets you vs. what you finish
The extraction does structure; you do cleaning and typing. This maps the common public-sector PDF quirks to the step that resolves each.
| PDF quirk | How it extracts | Your cleanup |
|---|---|---|
| Digitally-generated table | Clean rows, keys from the first line | Light cast + validate |
| Scanned / image-only PDF | Empty array (no text layer) | Run PDF OCR first, then extract |
| Repeated page headers | Header line becomes a data row from page 2 | Filter r.col !== 'col' |
| Footnote markers on values | Verbatim, e.g. "12.3*" | Strip markers before casting |
| Merged / spanning cells | Not detected; only the text present is placed | Forward-fill the spanned value across rows |
| Source / note lines | Near-empty trailing rows | Filter where data columns are empty |
Tier limits before a scrape runs
Government statistical PDFs can be long; check page count against your tier before extracting.
| Tier | Max file size | Max pages |
|---|---|---|
| Free | 2 MB | 50 |
| Pro | 50 MB | 500 |
| Pro + Media | 500 MB | 2,000 |
| Developer | 2 GB | 10,000 |
Cookbook
A real government statistical table scraped to JSON, plus the load patterns for the usual destinations.
A statistical annex with footnote markers
Note the footnote marker stuck to a value ("4.1p" for provisional) and the source line that becomes a near-empty trailing row.
PDF:
Region 2025 2024
North 4.1p 3.9
South 5.2 5.0
Source: National Statistics, table 3a
Downloaded JSON:
[
{ "Region": "North", "2025": "4.1p", "2024": "3.9" },
{ "Region": "South", "2025": "5.2", "2024": "5.0" },
{ "Region": "Source: National Statistics, table 3a", "2025": "", "2024": "" }
]Clean: drop notes, strip markers, cast
Remove the source row, peel footnote letters/symbols off the figures, then convert to numbers.
const rows = JSON.parse(json);
const clean = rows
.filter(r => r["2025"] !== "" && r.Region !== "Region") // drop notes + repeated headers
.map(r => ({
region: r.Region,
y2025: Number(String(r["2025"]).replace(/[^0-9.]/g, "")), // "4.1p" -> 4.1
y2024: Number(String(r["2024"]).replace(/[^0-9.]/g, "")),
}));Forward-fill a merged category column
Where a category cell spans several rows, only the first row carries its text; the rest are blank. Carry the value down so every row is self-describing.
let last = "";
for (const r of clean) {
if (r.category && r.category.trim()) last = r.category;
else r.category = last; // fill spanned cells
}Load into pandas / Polars / DuckDB
Once it's a JSON array, every common analysis tool reads it directly.
# pandas
import pandas as pd
df = pd.read_json("clean.json")
# Polars
import polars as pl
df = pl.read_json("clean.json")
# DuckDB (no load step)
-- SELECT * FROM read_json_auto('clean.json');Insert into Postgres with jsonb
For a quick landing table, stage the array as jsonb and expand columns in SQL.
CREATE TABLE staging (doc jsonb);
-- \copy staging(doc) FROM 'clean.json' (or insert via your driver)
INSERT INTO regions (region, y2025, y2024)
SELECT e->>'region',
(e->>'y2025')::numeric,
(e->>'y2024')::numeric
FROM staging, jsonb_array_elements(doc) AS e;Edge cases and what actually happens
Source PDF is a scan with no text layer
Empty arrayOlder government and archival PDFs are often scanned images. With no text layer, PDF.js finds nothing and the array is empty. Run PDF OCR first to add a searchable layer, then re-extract — and verify figures, since OCR errors on statistical data matter.
Repeated headers across many pages
By designLong statistical tables reprint the column header on every page; from page 2 onward those become data rows. Filter where a key equals its own header text before loading.
Footnote markers attached to values
Strip neededPublic tables flag provisional/revised figures with letters or symbols stuck to the number ("4.1p", "12.3*"). They extract verbatim. Strip non-numeric characters before casting, and keep the marker in a separate field if it's meaningful to your analysis.
Merged / spanning category cells
Blank then fillThe tool doesn't detect cell spans — a category that visually spans several rows only has text in its first row, leaving the rest blank for that column. Forward-fill the value down in your code so every row is self-describing (see the cookbook).
Source and note lines at the table foot
Filter needed'Source:' lines, footnotes, and 'p = provisional' legends append as near-empty trailing rows. Filter rows where the data columns are empty to remove them cleanly.
Two tables on one page
MergedThe engine groups a whole page as one region, so two side-by-side or stacked tables merge into one block keyed by the first table's header. Split such pages first with PDF Split, or split the array on a boundary row afterward.
Password-protected or access-restricted PDF
May failA permissions-locked PDF can block the text read. Remove the password with PDF Remove Password (you'll need it) or PDF Unlock first, then scrape. Respect the source's terms of use.
Document exceeds the free page limit
BlockedA long statistical release can top the free 50-page limit and be blocked before extraction. Upgrade to Pro (500 pages), or extract just the table pages with PDF Extract Pages first.
Columns drift or wrap in dense tables
MisalignedVery dense tables with tight columns can mis-assign cells by horizontal position, and long labels wrap to extra rows. Inspect the preview, merge wrapped rows, and remap drifted columns; for persistently messy sources, PDF to Excel gives a CSV you can realign in a spreadsheet.
Frequently asked questions
Does this work on PDFs from government data portals?
Yes, for any digitally-generated PDF — the text layer reads cleanly and you get a flat array of rows. The exception is scanned/image-only PDFs (common from older official sources): those have no text layer and return an empty array until you run PDF OCR first.
How do I know if my PDF is scanned?
If the extraction preview is empty (or you can't select text when you open the PDF in a viewer), it's an image-only scan. Run PDF OCR to add a text layer, then re-extract. A non-empty preview confirms the text layer was readable.
Can I scrape a password-protected PDF?
Remove the protection first: use PDF Remove Password if you have the password, or PDF Unlock for owner-restricted files. A permissions lock can otherwise block the text read. Always respect the source's terms of use.
How are merged cells handled?
The tool doesn't detect cell spans. A category cell that visually spans several rows only has text in its first row; the rest are blank for that column. Forward-fill the value down in your code so each row stands alone — the cookbook shows a few lines that do exactly that.
What about footnote markers stuck to numbers?
They extract verbatim ("4.1p", "12.3*") because the tool never edits values. Strip the non-numeric characters before casting, and if the marker carries meaning (provisional, revised), capture it in a separate field during cleanup.
Is anything uploaded when I scrape?
No. Extraction runs entirely in your browser via PDF.js. Embargoed releases and access-restricted documents never leave your device; only anonymous usage counters are recorded when you're signed in.
What does the output look like, exactly?
A single flat JSON array of objects, keys taken from the first line of each page, every value a string. Rows from all pages are concatenated — there's no per-page nesting and no page/table index — so filter repeated headers and trailing notes during cleanup.
Can I load it straight into a database or DataFrame?
After a light clean (drop notes/headers, cast types), yes. pd.read_json / pl.read_json read it directly; DuckDB queries it with read_json_auto; and Postgres can stage it as jsonb and expand with jsonb_array_elements. The cookbook has snippets for each.
What if there are two tables on one page?
They merge into one block keyed by the first table's header, because the engine treats the page as a single region. Split the page first with PDF Split, or split the resulting array on a recognisable boundary row in post-processing.
Would PDF to Excel be better for messy tables?
Sometimes. PDF to Excel runs the same detection but outputs CSV, which is easier to realign by hand in a spreadsheet when columns drift. Use this JSON tool when a script or database is the consumer; use the Excel tool when you'll manually fix alignment first.
What are the size and page limits for a scrape?
Free: 2 MB and 50 pages. Pro: 50 MB / 500 pages. Pro + Media: 500 MB / 2,000 pages. Developer: 2 GB / 10,000 pages. For a long statistical release, upgrade or extract just the table pages first with PDF Extract Pages.
Is the scrape reproducible?
Yes. The same PDF produces the same array every time — the extraction is deterministic and string-typed, so there's no inference that could vary run to run. That makes it safe to pin in a reproducible data pipeline alongside a checksum of the source PDF.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.