Extract a PDF Table to JSON for API Use — Free Online

How to extract a pdf table to json for api or application use

Step 1
Open the tool and drop your PDF — Load your document into the PDF Table to JSON tool. There are no options to set — it begins extracting the moment the file lands, all locally in the browser.
Step 2
Let it read the text layer and build rows — PDF.js pulls the text items, the engine groups them by vertical position into rows and sorts each row's cells by horizontal position. Pages with only one line of text are skipped (a single line can't form a header + data table).
Step 3
Check the key names in the preview — The preview renders the first 20 objects with their keys. Confirm the keys are the real column headers and not a stray title line — if the first detected line isn't your header row, see the edge cases below.
Step 4
Download the JSON array — Save the full array as <name>.json. It is a top-level [ ... ] of objects — ready to read with JSON.parse or pipe to jq.
Step 5
Cast strings to real types in your code — Every value is a string. Before you POST or insert, coerce: Number(row.price), row.active === 'true', new Date(row.date). Validate with a schema (Zod / Pydantic) so a stray header row or blank cell fails loudly instead of corrupting the API call.
Step 6
POST or insert the array — Send the array to your endpoint or loop it into a batch insert. Because rows from every page are in one array, you usually don't need to merge anything — but do filter out repeated header rows if your PDF prints headers on each page.

What the tool produces vs. what it doesn't

Grounded in the actual extraction code (PDF.js text grouping → first-line-as-keys → flat array of string-valued objects). Plan your integration around the left column; handle the right column in your own code.

Capability	Behaviour	Your integration's job
Row objects	One object per detected line after the header line, across all pages, in a single flat array	Iterate directly; no un-nesting needed
Keys	Taken from the first detected line of each page; an empty cell becomes `Column 1`, `Column 2`, … by position	Map/rename keys to your API's field names if they differ
Value types	Always strings — no numbers, booleans, dates, or null	Cast and validate (`Number()`, `Date`, Zod/Pydantic) before sending
Multiple tables / pages	Concatenated into one array; no per-page or per-table nesting and no page/table index field	Filter repeated header rows; split by a marker column if you need per-table grouping
Empty trailing cells	Missing cells in a short row are filled with `""` up to the header width	Treat `""` as missing/optional in your schema
Scanned / image-only PDF	No text layer → no rows extracted (empty array)	Run PDF OCR first to add a text layer

Tier limits that apply before extraction

The runner checks file size and page count before processing. Numbers are from the PDF family tier limits; extraction itself happens in-browser regardless of tier.

Tier	Max file size	Max pages	Files per job
Free	2 MB	50	1
Pro	50 MB	500	5
Pro + Media	500 MB	2,000	50
Developer	2 GB	10,000	unlimited

Casting the string output to a typed payload

Because every value is a string, your code owns type conversion. A small, explicit cast layer keeps the API contract honest.

Target type	Example string in	Cast you write
Integer / float	`"1299.00"`	`Number(row.price)` — guard `NaN`
Boolean	`"true"` / `"Yes"`	`/^(true\|yes\|y)$/i.test(row.active)`
Date (ISO)	`"2026-06-01"`	`new Date(row.date).toISOString()`
Null / optional	`""`	`row.note \|\| null`

Cookbook

A real API-integration extraction, start to finish, plus the post-processing patterns that turn the string array into a clean request body.

A clean two-page report becomes one flat array

The most common happy path: a PDF with a clear header row. Note the keys come from the first line, every value is a string, and page 2's rows are appended to the same array — there is no nesting per page.

PDF (page 1, then page 2 with the same columns):
SKU        Name             Price   InStock
A-100      Widget           12.50   42
A-101      Gadget           9.99    0
--- page 2 ---
SKU        Name             Price   InStock
A-102      Sprocket         4.25    inf

Downloaded JSON (one flat array):
[
  { "SKU": "A-100", "Name": "Widget",   "Price": "12.50", "InStock": "42"  },
  { "SKU": "A-101", "Name": "Gadget",   "Price": "9.99",  "InStock": "0"   },
  { "SKU": "SKU",   "Name": "Name",     "Price": "Price", "InStock": "InStock" },  <- page-2 header row
  { "SKU": "A-102", "Name": "Sprocket", "Price": "4.25",  "InStock": "inf" }
]

Drop the repeated header rows before POSTing

Because every page's first line becomes data after the very first page, repeated print-headers leak in as bogus rows (see the third row above). Filter them where a key equals its own header value.

const rows = JSON.parse(json);
const clean = rows.filter(r => r.SKU !== "SKU");   // drop repeated headers

await fetch("https://api.example.com/products", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify(clean),
});

Cast strings to a typed payload with Zod

The tool never guesses types. Put a schema at the boundary so a missing cell or a stray row fails validation instead of silently posting "Price": "Price".

import { z } from "zod";

const Product = z.object({
  SKU: z.string().regex(/^[A-Z]-\d+$/),
  Name: z.string().min(1),
  Price: z.coerce.number().positive(),
  InStock: z.coerce.number().int().nonnegative(),
});

const payload = z.array(Product).parse(clean);
// payload[0].Price is now a real number

Reshape keys to match your API contract

PDF headers rarely match your endpoint's field names. Map once, explicitly, so the rest of the pipeline is stable even if the PDF's wording changes.

const body = clean.map(r => ({
  sku:       r.SKU,
  title:     r.Name,
  priceCents: Math.round(Number(r.Price) * 100),
  stock:     Number(r.InStock),
}));
// body is the exact JSON your /products endpoint expects

Quick command-line check with jq

Before wiring anything, sanity-check the array shape and unique key set from the terminal.

# how many rows?
jq 'length' report.json

# what keys does the first object have?
jq '.[0] | keys' report.json

# drop repeated header rows and count again
jq '[.[] | select(.SKU != "SKU")] | length' report.json

Edge cases and what actually happens

First line of the page is a title, not the header row

Wrong keys

The engine uses the first detected line of each page as the keys. If a report prints a title or date line above the actual column headers, that title line becomes your keys and the real header row becomes the first data object. Crop the title region out first, or post-process by promoting the second object's values to keys in your script.

Repeated header rows on multi-page tables

By design

Only the very first line of the document anchors the keys per page; on page 2+ the printed header line is emitted as a data row. This is expected given the page-by-page grouping. Filter rows where a key equals its own header text (r.SKU !== 'SKU') before sending to your API.

Scanned or image-only PDF

Empty array

There's no text layer to read, so PDF.js returns no text items and the output array is empty. Run PDF OCR first to add a searchable text layer, then extract. A blank result almost always means the page is an image, not a bug.

Free-tier file over 2 MB or PDF over 50 pages

Blocked

The runner checks size and page count before extracting. On the free tier a file larger than 2 MB or a PDF with more than 50 pages is blocked with an upgrade prompt. Pro raises this to 50 MB / 500 pages. Split a large PDF with PDF Split or extract just the pages you need with PDF Extract Pages first.

Values you expected as numbers arrive as strings

Expected

There is no type inference — "12.50" stays a string, (123) stays "(123)". This is intentional so the tool never mangles an ID, a code, or a leading zero. Cast in your own code with Number(), a regex, or a schema's z.coerce.

A cell wraps onto two visual lines

Split row

Rows are defined by vertical position, so a long description that wraps to a second line is read as a separate row with mostly empty cells. Merge wrapped continuation rows in post-processing (a row whose key column is empty usually belongs to the row above).

Columns don't line up between pages

Misaligned keys

If a later page has a different column order or count, its rows are still appended to the same flat array but keyed by that page's own first line. Mixed schemas in one array can break a strict validator. Split such documents and extract per section, or normalise keys after extraction.

Encrypted PDF with a permissions password

May fail

Page counting tolerates light encryption, but the text-layer read can fail on a password-protected file. Remove the password with PDF Remove Password (you'll need the password) or PDF Unlock first, then extract.

Two tables stacked on one page

Merged

The engine treats a whole page as one table region — it groups every line on the page. Two distinct tables on the same page merge into one block of rows with one set of keys (the first table's header). Separate the tables onto different pages, or split the array on a recognisable boundary row afterward.

Frequently asked questions

What exactly does the JSON look like?

A single flat array of objects, e.g. [{ "SKU": "A-100", "Price": "12.50" }, …]. Keys come from the first detected line of each page; every value is a string. There is no wrapper object, no per-page nesting, and no metadata — just the array, ready for JSON.parse or a POST body.

Are the values typed (numbers, booleans, dates)?

No. Every value is a string, including things that look numeric. This is deliberate so the tool never corrupts an ID, a leading zero, or a currency-formatted figure. Cast in your own code — Number(row.price), new Date(row.date) — and ideally validate with Zod or Pydantic at the API boundary.

How does it handle a PDF with multiple tables or multiple pages?

All rows from all pages are concatenated into one flat array. There is no separate array per table and no page/table index field. The practical consequence: on multi-page tables the repeated header line becomes a data row from page 2 onward, so filter those out (r.SKU !== 'SKU') before using the data.

Can I post the output straight to an API?

Almost — the array shape is already what most endpoints want. Two near-universal pre-steps: drop any repeated header rows, and cast string values to the types your API expects. After that, JSON.stringify(rows) is a valid request body.

Where do the JSON keys come from?

From the first detected line of each page. If a cell in that line is empty, it's named by position (Column 1, Column 2, …). If your PDF prints a title above the real headers, that title becomes the keys instead — crop it out or remap in code.

Is the file uploaded anywhere?

No. Extraction runs entirely in your browser using PDF.js. The document, its contents, and any API data inside it never leave your device; only anonymous usage counters are recorded when you're signed in.

What are the size and page limits?

Free handles up to 2 MB and 50 pages per file; Pro raises that to 50 MB and 500 pages, Pro + Media to 500 MB / 2,000 pages, and Developer to 2 GB / 10,000 pages. Oversized files are blocked before extraction with an upgrade prompt.

It returned an empty array — why?

The PDF has no text layer (it's scanned or image-only), or every page had only a single line of text. Run PDF OCR first to add a text layer for scanned documents, then extract again.

Can I get CSV instead of JSON for a spreadsheet import?

Yes — use PDF to Excel, which runs the same row/column detection but emits CSV. Use this JSON tool when an API or a script is the consumer; use the Excel tool when a spreadsheet is.

How do I extract PDF form fields instead of a printed table?

If the data lives in fillable AcroForm fields rather than a printed grid, use the PDF Form Field Extractor, which reads field names, types, and values directly — more reliable than reading the rendered table for forms.

Can I automate this without clicking?

The tool itself is a browser interface that auto-runs on drop. On paid tiers, runs can be routed to the local JAD runner when it's available, keeping processing on your machine. For a fully scripted pipeline, pair the same PDF.js extraction logic into your own job and apply the cast/validation steps shown in the cookbook.

Why is the preview only showing 20 rows?

The on-page preview caps at the first 20 objects so a huge table doesn't lock up the browser; it also tells you the total count. The downloaded .json contains every row — the cap is display-only.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to extract a pdf table to json for api or application use

Step 1
Open the tool and drop your PDF — Load your document into the PDF Table to JSON tool. There are no options to set — it begins extracting the moment the file lands, all locally in the browser.
Step 2
Let it read the text layer and build rows — PDF.js pulls the text items, the engine groups them by vertical position into rows and sorts each row's cells by horizontal position. Pages with only one line of text are skipped (a single line can't form a header + data table).
Step 3
Check the key names in the preview — The preview renders the first 20 objects with their keys. Confirm the keys are the real column headers and not a stray title line — if the first detected line isn't your header row, see the edge cases below.
Step 4
Download the JSON array — Save the full array as <name>.json. It is a top-level [ ... ] of objects — ready to read with JSON.parse or pipe to jq.
Step 5
Cast strings to real types in your code — Every value is a string. Before you POST or insert, coerce: Number(row.price), row.active === 'true', new Date(row.date). Validate with a schema (Zod / Pydantic) so a stray header row or blank cell fails loudly instead of corrupting the API call.
Step 6
POST or insert the array — Send the array to your endpoint or loop it into a batch insert. Because rows from every page are in one array, you usually don't need to merge anything — but do filter out repeated header rows if your PDF prints headers on each page.

What the tool produces vs. what it doesn't

Capability	Behaviour	Your integration's job
Row objects	One object per detected line after the header line, across all pages, in a single flat array	Iterate directly; no un-nesting needed
Keys	Taken from the first detected line of each page; an empty cell becomes `Column 1`, `Column 2`, … by position	Map/rename keys to your API's field names if they differ
Value types	Always strings — no numbers, booleans, dates, or null	Cast and validate (`Number()`, `Date`, Zod/Pydantic) before sending
Multiple tables / pages	Concatenated into one array; no per-page or per-table nesting and no page/table index field	Filter repeated header rows; split by a marker column if you need per-table grouping
Empty trailing cells	Missing cells in a short row are filled with `""` up to the header width	Treat `""` as missing/optional in your schema
Scanned / image-only PDF	No text layer → no rows extracted (empty array)	Run PDF OCR first to add a text layer

Tier limits that apply before extraction

The runner checks file size and page count before processing. Numbers are from the PDF family tier limits; extraction itself happens in-browser regardless of tier.

Tier	Max file size	Max pages	Files per job
Free	2 MB	50	1
Pro	50 MB	500	5
Pro + Media	500 MB	2,000	50
Developer	2 GB	10,000	unlimited

Casting the string output to a typed payload

Because every value is a string, your code owns type conversion. A small, explicit cast layer keeps the API contract honest.

Target type	Example string in	Cast you write
Integer / float	`"1299.00"`	`Number(row.price)` — guard `NaN`
Boolean	`"true"` / `"Yes"`	`/^(true\|yes\|y)$/i.test(row.active)`
Date (ISO)	`"2026-06-01"`	`new Date(row.date).toISOString()`
Null / optional	`""`	`row.note \|\| null`

Cookbook

A real API-integration extraction, start to finish, plus the post-processing patterns that turn the string array into a clean request body.

A clean two-page report becomes one flat array

PDF (page 1, then page 2 with the same columns):
SKU        Name             Price   InStock
A-100      Widget           12.50   42
A-101      Gadget           9.99    0
--- page 2 ---
SKU        Name             Price   InStock
A-102      Sprocket         4.25    inf

Downloaded JSON (one flat array):
[
  { "SKU": "A-100", "Name": "Widget",   "Price": "12.50", "InStock": "42"  },
  { "SKU": "A-101", "Name": "Gadget",   "Price": "9.99",  "InStock": "0"   },
  { "SKU": "SKU",   "Name": "Name",     "Price": "Price", "InStock": "InStock" },  <- page-2 header row
  { "SKU": "A-102", "Name": "Sprocket", "Price": "4.25",  "InStock": "inf" }
]

Drop the repeated header rows before POSTing

Because every page's first line becomes data after the very first page, repeated print-headers leak in as bogus rows (see the third row above). Filter them where a key equals its own header value.

const rows = JSON.parse(json);
const clean = rows.filter(r => r.SKU !== "SKU");   // drop repeated headers

await fetch("https://api.example.com/products", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify(clean),
});

Cast strings to a typed payload with Zod

The tool never guesses types. Put a schema at the boundary so a missing cell or a stray row fails validation instead of silently posting "Price": "Price".

import { z } from "zod";

const Product = z.object({
  SKU: z.string().regex(/^[A-Z]-\d+$/),
  Name: z.string().min(1),
  Price: z.coerce.number().positive(),
  InStock: z.coerce.number().int().nonnegative(),
});

const payload = z.array(Product).parse(clean);
// payload[0].Price is now a real number

Reshape keys to match your API contract

PDF headers rarely match your endpoint's field names. Map once, explicitly, so the rest of the pipeline is stable even if the PDF's wording changes.

const body = clean.map(r => ({
  sku:       r.SKU,
  title:     r.Name,
  priceCents: Math.round(Number(r.Price) * 100),
  stock:     Number(r.InStock),
}));
// body is the exact JSON your /products endpoint expects

Quick command-line check with jq

Before wiring anything, sanity-check the array shape and unique key set from the terminal.

# how many rows?
jq 'length' report.json

# what keys does the first object have?
jq '.[0] | keys' report.json

# drop repeated header rows and count again
jq '[.[] | select(.SKU != "SKU")] | length' report.json

Edge cases and what actually happens

First line of the page is a title, not the header row

Wrong keys

Repeated header rows on multi-page tables

By design

Scanned or image-only PDF

Empty array

Free-tier file over 2 MB or PDF over 50 pages

Blocked

Values you expected as numbers arrive as strings

Expected

A cell wraps onto two visual lines

Split row

Columns don't line up between pages

Misaligned keys

Encrypted PDF with a permissions password

May fail

Two tables stacked on one page

Merged

Frequently asked questions

What exactly does the JSON look like?

Are the values typed (numbers, booleans, dates)?

How does it handle a PDF with multiple tables or multiple pages?

Can I post the output straight to an API?

Where do the JSON keys come from?

Is the file uploaded anywhere?

What are the size and page limits?

It returned an empty array — why?

The PDF has no text layer (it's scanned or image-only), or every page had only a single line of text. Run PDF OCR first to add a text layer for scanned documents, then extract again.

Can I get CSV instead of JSON for a spreadsheet import?

Yes — use PDF to Excel, which runs the same row/column detection but emits CSV. Use this JSON tool when an API or a script is the consumer; use the Excel tool when a spreadsheet is.

How do I extract PDF form fields instead of a printed table?

Can I automate this without clicking?

Why is the preview only showing 20 rows?

The on-page preview caps at the first 20 objects so a huge table doesn't lock up the browser; it also tells you the total count. The downloaded .json contains every row — the cap is display-only.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Extract a PDF Table to JSON for API or Application Use

How to extract a pdf table to json for api or application use

What the tool produces vs. what it doesn't

Tier limits that apply before extraction

Casting the string output to a typed payload

Cookbook

A clean two-page report becomes one flat array

Drop the repeated header rows before POSTing

Cast strings to a typed payload with Zod

Reshape keys to match your API contract

Quick command-line check with jq

Edge cases and what actually happens

First line of the page is a title, not the header row

Repeated header rows on multi-page tables

Scanned or image-only PDF

Free-tier file over 2 MB or PDF over 50 pages

Values you expected as numbers arrive as strings

A cell wraps onto two visual lines

Columns don't line up between pages

Encrypted PDF with a permissions password

Two tables stacked on one page

Frequently asked questions

What exactly does the JSON look like?

Are the values typed (numbers, booleans, dates)?

How does it handle a PDF with multiple tables or multiple pages?

Can I post the output straight to an API?

Where do the JSON keys come from?

Is the file uploaded anywhere?

What are the size and page limits?

It returned an empty array — why?

Can I get CSV instead of JSON for a spreadsheet import?

How do I extract PDF form fields instead of a printed table?

Can I automate this without clicking?

Why is the preview only showing 20 rows?

Privacy first

Related guides

Extract a PDF Table to JSON for API or Application Use

How to extract a pdf table to json for api or application use

What the tool produces vs. what it doesn't

Tier limits that apply before extraction

Casting the string output to a typed payload

Cookbook

A clean two-page report becomes one flat array

Drop the repeated header rows before POSTing

Cast strings to a typed payload with Zod

Reshape keys to match your API contract

Quick command-line check with jq

Edge cases and what actually happens

First line of the page is a title, not the header row

Repeated header rows on multi-page tables

Scanned or image-only PDF

Free-tier file over 2 MB or PDF over 50 pages

Values you expected as numbers arrive as strings

A cell wraps onto two visual lines

Columns don't line up between pages

Encrypted PDF with a permissions password

Two tables stacked on one page

Frequently asked questions

What exactly does the JSON look like?

Are the values typed (numbers, booleans, dates)?

How does it handle a PDF with multiple tables or multiple pages?

Can I post the output straight to an API?

Where do the JSON keys come from?

Is the file uploaded anywhere?

What are the size and page limits?

It returned an empty array — why?

Can I get CSV instead of JSON for a spreadsheet import?

How do I extract PDF form fields instead of a printed table?

Can I automate this without clicking?

Why is the preview only showing 20 rows?

Privacy first

Related guides