How to extract a pdf table to json for api or application use
- Step 1Open the tool and drop your PDF — Load your document into the PDF Table to JSON tool. There are no options to set — it begins extracting the moment the file lands, all locally in the browser.
- Step 2Let it read the text layer and build rows — PDF.js pulls the text items, the engine groups them by vertical position into rows and sorts each row's cells by horizontal position. Pages with only one line of text are skipped (a single line can't form a header + data table).
- Step 3Check the key names in the preview — The preview renders the first 20 objects with their keys. Confirm the keys are the real column headers and not a stray title line — if the first detected line isn't your header row, see the edge cases below.
- Step 4Download the JSON array — Save the full array as
<name>.json. It is a top-level[ ... ]of objects — ready to read withJSON.parseor pipe tojq. - Step 5Cast strings to real types in your code — Every value is a string. Before you POST or insert, coerce:
Number(row.price),row.active === 'true',new Date(row.date). Validate with a schema (Zod / Pydantic) so a stray header row or blank cell fails loudly instead of corrupting the API call. - Step 6POST or insert the array — Send the array to your endpoint or loop it into a batch insert. Because rows from every page are in one array, you usually don't need to merge anything — but do filter out repeated header rows if your PDF prints headers on each page.
What the tool produces vs. what it doesn't
Grounded in the actual extraction code (PDF.js text grouping → first-line-as-keys → flat array of string-valued objects). Plan your integration around the left column; handle the right column in your own code.
| Capability | Behaviour | Your integration's job |
|---|---|---|
| Row objects | One object per detected line after the header line, across all pages, in a single flat array | Iterate directly; no un-nesting needed |
| Keys | Taken from the first detected line of each page; an empty cell becomes Column 1, Column 2, … by position | Map/rename keys to your API's field names if they differ |
| Value types | Always strings — no numbers, booleans, dates, or null | Cast and validate (Number(), Date, Zod/Pydantic) before sending |
| Multiple tables / pages | Concatenated into one array; no per-page or per-table nesting and no page/table index field | Filter repeated header rows; split by a marker column if you need per-table grouping |
| Empty trailing cells | Missing cells in a short row are filled with "" up to the header width | Treat "" as missing/optional in your schema |
| Scanned / image-only PDF | No text layer → no rows extracted (empty array) | Run PDF OCR first to add a text layer |
Tier limits that apply before extraction
The runner checks file size and page count before processing. Numbers are from the PDF family tier limits; extraction itself happens in-browser regardless of tier.
| Tier | Max file size | Max pages | Files per job |
|---|---|---|---|
| Free | 2 MB | 50 | 1 |
| Pro | 50 MB | 500 | 5 |
| Pro + Media | 500 MB | 2,000 | 50 |
| Developer | 2 GB | 10,000 | unlimited |
Casting the string output to a typed payload
Because every value is a string, your code owns type conversion. A small, explicit cast layer keeps the API contract honest.
| Target type | Example string in | Cast you write |
|---|---|---|
| Integer / float | "1299.00" | Number(row.price) — guard NaN |
| Boolean | "true" / "Yes" | /^(true|yes|y)$/i.test(row.active) |
| Date (ISO) | "2026-06-01" | new Date(row.date).toISOString() |
| Null / optional | "" | row.note || null |
Cookbook
A real API-integration extraction, start to finish, plus the post-processing patterns that turn the string array into a clean request body.
A clean two-page report becomes one flat array
The most common happy path: a PDF with a clear header row. Note the keys come from the first line, every value is a string, and page 2's rows are appended to the same array — there is no nesting per page.
PDF (page 1, then page 2 with the same columns):
SKU Name Price InStock
A-100 Widget 12.50 42
A-101 Gadget 9.99 0
--- page 2 ---
SKU Name Price InStock
A-102 Sprocket 4.25 inf
Downloaded JSON (one flat array):
[
{ "SKU": "A-100", "Name": "Widget", "Price": "12.50", "InStock": "42" },
{ "SKU": "A-101", "Name": "Gadget", "Price": "9.99", "InStock": "0" },
{ "SKU": "SKU", "Name": "Name", "Price": "Price", "InStock": "InStock" }, <- page-2 header row
{ "SKU": "A-102", "Name": "Sprocket", "Price": "4.25", "InStock": "inf" }
]Drop the repeated header rows before POSTing
Because every page's first line becomes data after the very first page, repeated print-headers leak in as bogus rows (see the third row above). Filter them where a key equals its own header value.
const rows = JSON.parse(json);
const clean = rows.filter(r => r.SKU !== "SKU"); // drop repeated headers
await fetch("https://api.example.com/products", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(clean),
});Cast strings to a typed payload with Zod
The tool never guesses types. Put a schema at the boundary so a missing cell or a stray row fails validation instead of silently posting "Price": "Price".
import { z } from "zod";
const Product = z.object({
SKU: z.string().regex(/^[A-Z]-\d+$/),
Name: z.string().min(1),
Price: z.coerce.number().positive(),
InStock: z.coerce.number().int().nonnegative(),
});
const payload = z.array(Product).parse(clean);
// payload[0].Price is now a real numberReshape keys to match your API contract
PDF headers rarely match your endpoint's field names. Map once, explicitly, so the rest of the pipeline is stable even if the PDF's wording changes.
const body = clean.map(r => ({
sku: r.SKU,
title: r.Name,
priceCents: Math.round(Number(r.Price) * 100),
stock: Number(r.InStock),
}));
// body is the exact JSON your /products endpoint expectsQuick command-line check with jq
Before wiring anything, sanity-check the array shape and unique key set from the terminal.
# how many rows? jq 'length' report.json # what keys does the first object have? jq '.[0] | keys' report.json # drop repeated header rows and count again jq '[.[] | select(.SKU != "SKU")] | length' report.json
Edge cases and what actually happens
First line of the page is a title, not the header row
Wrong keysThe engine uses the first detected line of each page as the keys. If a report prints a title or date line above the actual column headers, that title line becomes your keys and the real header row becomes the first data object. Crop the title region out first, or post-process by promoting the second object's values to keys in your script.
Repeated header rows on multi-page tables
By designOnly the very first line of the document anchors the keys per page; on page 2+ the printed header line is emitted as a data row. This is expected given the page-by-page grouping. Filter rows where a key equals its own header text (r.SKU !== 'SKU') before sending to your API.
Scanned or image-only PDF
Empty arrayThere's no text layer to read, so PDF.js returns no text items and the output array is empty. Run PDF OCR first to add a searchable text layer, then extract. A blank result almost always means the page is an image, not a bug.
Free-tier file over 2 MB or PDF over 50 pages
BlockedThe runner checks size and page count before extracting. On the free tier a file larger than 2 MB or a PDF with more than 50 pages is blocked with an upgrade prompt. Pro raises this to 50 MB / 500 pages. Split a large PDF with PDF Split or extract just the pages you need with PDF Extract Pages first.
Values you expected as numbers arrive as strings
ExpectedThere is no type inference — "12.50" stays a string, (123) stays "(123)". This is intentional so the tool never mangles an ID, a code, or a leading zero. Cast in your own code with Number(), a regex, or a schema's z.coerce.
A cell wraps onto two visual lines
Split rowRows are defined by vertical position, so a long description that wraps to a second line is read as a separate row with mostly empty cells. Merge wrapped continuation rows in post-processing (a row whose key column is empty usually belongs to the row above).
Columns don't line up between pages
Misaligned keysIf a later page has a different column order or count, its rows are still appended to the same flat array but keyed by that page's own first line. Mixed schemas in one array can break a strict validator. Split such documents and extract per section, or normalise keys after extraction.
Encrypted PDF with a permissions password
May failPage counting tolerates light encryption, but the text-layer read can fail on a password-protected file. Remove the password with PDF Remove Password (you'll need the password) or PDF Unlock first, then extract.
Two tables stacked on one page
MergedThe engine treats a whole page as one table region — it groups every line on the page. Two distinct tables on the same page merge into one block of rows with one set of keys (the first table's header). Separate the tables onto different pages, or split the array on a recognisable boundary row afterward.
Frequently asked questions
What exactly does the JSON look like?
A single flat array of objects, e.g. [{ "SKU": "A-100", "Price": "12.50" }, …]. Keys come from the first detected line of each page; every value is a string. There is no wrapper object, no per-page nesting, and no metadata — just the array, ready for JSON.parse or a POST body.
Are the values typed (numbers, booleans, dates)?
No. Every value is a string, including things that look numeric. This is deliberate so the tool never corrupts an ID, a leading zero, or a currency-formatted figure. Cast in your own code — Number(row.price), new Date(row.date) — and ideally validate with Zod or Pydantic at the API boundary.
How does it handle a PDF with multiple tables or multiple pages?
All rows from all pages are concatenated into one flat array. There is no separate array per table and no page/table index field. The practical consequence: on multi-page tables the repeated header line becomes a data row from page 2 onward, so filter those out (r.SKU !== 'SKU') before using the data.
Can I post the output straight to an API?
Almost — the array shape is already what most endpoints want. Two near-universal pre-steps: drop any repeated header rows, and cast string values to the types your API expects. After that, JSON.stringify(rows) is a valid request body.
Where do the JSON keys come from?
From the first detected line of each page. If a cell in that line is empty, it's named by position (Column 1, Column 2, …). If your PDF prints a title above the real headers, that title becomes the keys instead — crop it out or remap in code.
Is the file uploaded anywhere?
No. Extraction runs entirely in your browser using PDF.js. The document, its contents, and any API data inside it never leave your device; only anonymous usage counters are recorded when you're signed in.
What are the size and page limits?
Free handles up to 2 MB and 50 pages per file; Pro raises that to 50 MB and 500 pages, Pro + Media to 500 MB / 2,000 pages, and Developer to 2 GB / 10,000 pages. Oversized files are blocked before extraction with an upgrade prompt.
It returned an empty array — why?
The PDF has no text layer (it's scanned or image-only), or every page had only a single line of text. Run PDF OCR first to add a text layer for scanned documents, then extract again.
Can I get CSV instead of JSON for a spreadsheet import?
Yes — use PDF to Excel, which runs the same row/column detection but emits CSV. Use this JSON tool when an API or a script is the consumer; use the Excel tool when a spreadsheet is.
How do I extract PDF form fields instead of a printed table?
If the data lives in fillable AcroForm fields rather than a printed grid, use the PDF Form Field Extractor, which reads field names, types, and values directly — more reliable than reading the rendered table for forms.
Can I automate this without clicking?
The tool itself is a browser interface that auto-runs on drop. On paid tiers, runs can be routed to the local JAD runner when it's available, keeping processing on your machine. For a fully scripted pipeline, pair the same PDF.js extraction logic into your own job and apply the cast/validation steps shown in the cookbook.
Why is the preview only showing 20 rows?
The on-page preview caps at the first 20 objects so a huge table doesn't lock up the browser; it also tells you the total count. The downloaded .json contains every row — the cap is display-only.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.