Extract a PDF Financial Table to JSON — Free Online

How to extract financial tables from a pdf into json

Step 1
Open the tool and drop the report PDF — Load the financial statement into the PDF Table to JSON tool. Extraction starts immediately in your browser — no settings.
Step 2
Confirm the period columns in the preview — Check the first 20 objects: keys should be the line-item label column plus each reporting period (e.g. 2025, 2024). If a heading line above the table became the keys, see the edge cases.
Step 3
Download the JSON array — Save <name>.json — a flat array of line-item objects spanning every page of the statement.
Step 4
Drop spacer and section rows — Statements include blank spacer lines and section headings (Operating expenses). These come through as rows with empty figure cells — filter them where the figure columns are blank.
Step 5
Parse figures: separators and parentheses — Convert each figure string to a number with explicit sign handling: strip commas, and turn (1,234) into -1234. Keep this in one function so the rule is auditable and consistent across the whole statement.
Step 6
Load into your model or API — Map line items to your chart-of-accounts keys and push the parsed numbers into your model, dashboard data source, or finance API.

Financial formatting: how each convention extracts

The tool reads text as-is. This is what common accounting formats look like in the output, and the parse you write for each.

Format in the PDF	String you get	Parse to
Plain figure	`"1,234"`	`Number(s.replace(/,/g,''))` → `1234`
Negative in parentheses	`"(1,234)"`	`-1234` (detect `()`, then strip + negate)
Negative with minus	`"-1,234"`	`-1234`
Zero shown as dash	`"-"` or `"—"`	`0` (treat dash as zero)
Thousands suffix	`"1.2m"` / `"1,200k"`	Scale explicitly — don't `Number()` blindly
Blank spacer / section row	empty figure cells	Filter out before loading

Tier limits for report PDFs

Single statements fit the free tier easily; full annual reports with notes can run long — check page count.

Tier	Max file size	Max pages
Free	2 MB	50
Pro	50 MB	500
Pro + Media	500 MB	2,000
Developer	2 GB	10,000

Cookbook

A real P&L extraction and the exact, auditable parse that turns the string array into model-ready numbers.

A P&L with a parenthesised negative

Watch the loss line: it stays "(1,200)" as a string. The section heading 'Operating expenses' becomes a near-empty row.

PDF:
Line item            2025      2024
Revenue              48,200    41,050
Cost of sales        (29,400)  (25,100)
Gross profit         18,800    15,950
Operating expenses
  Admin              (12,000)  (11,200)
Operating loss       (1,200)   (250)

Downloaded JSON:
[
  { "Line item": "Revenue",            "2025": "48,200",   "2024": "41,050" },
  { "Line item": "Cost of sales",      "2025": "(29,400)", "2024": "(25,100)" },
  { "Line item": "Gross profit",       "2025": "18,800",   "2024": "15,950" },
  { "Line item": "Operating expenses", "2025": "",         "2024": "" },
  { "Line item": "Admin",              "2025": "(12,000)", "2024": "(11,200)" },
  { "Line item": "Operating loss",     "2025": "(1,200)",  "2024": "(250)" }
]

One auditable number-parser

Keep all sign and separator logic in a single function so a reviewer can sign off on it once.

function parseFig(s) {
  if (s == null) return null;
  const t = s.trim();
  if (t === "" || t === "-" || t === "\u2014") return 0;     // blank or dash = 0
  const neg = /^\(.*\)$/.test(t);                              // parentheses = negative
  const n = Number(t.replace(/[(),\s]/g, ""));
  if (Number.isNaN(n)) throw new Error(`Unparseable figure: ${s}`);
  return neg ? -n : n;
}
parseFig("(1,200)"); // -1200
parseFig("48,200");  //  48200

Build a typed line-item array

Drop spacer/section rows, then parse each period column with the one function above.

const rows = JSON.parse(json);
const periods = ["2025", "2024"];
const model = rows
  .filter(r => periods.some(p => r[p] && r[p].trim() !== ""))   // has at least one figure
  .map(r => ({
    item: r["Line item"].trim(),
    "2025": parseFig(r["2025"]),
    "2024": parseFig(r["2024"]),
  }));

Cross-foot to validate the extraction

Because the data is now numeric, you can verify the extraction by checking that totals add up — catching a dropped or misread row before it reaches the model.

const by = Object.fromEntries(model.map(r => [r.item, r]));
const gp = by["Revenue"]["2025"] + by["Cost of sales"]["2025"];
if (gp !== by["Gross profit"]["2025"]) {
  console.warn(`Gross profit mismatch: computed ${gp}, stated ${by["Gross profit"]["2025"]}`);
}

Drop repeated headers from a multi-page statement

A notes section that reprints the period header on each page leaks header rows into the array; remove them where the label equals its own header.

const clean = rows.filter(r => r["Line item"] !== "Line item");

Edge cases and what actually happens

Negative shown in parentheses

Expected

Accounting style writes negatives as (1,234). The tool keeps it as the string "(1,234)" — it does not convert to -1234. This is intentional, because mis-signing a figure corrupts a model silently. Detect the parentheses and negate in your parse step (see the cookbook).

Thousands separators in every figure

Cast carefully

Figures keep their commas ("48,200"). Number() returns NaN on those, so strip separators before converting. For EU-format statements (48.200,00), normalise the dot/comma convention first.

Zero printed as a dash

Treat as zero

Many statements print - or an em-dash for a zero or nil value. That extracts as a dash string. Map it to 0 in your parser rather than letting it become NaN.

Section headings and spacer rows

Filter needed

Headings like 'Operating expenses' and blank spacer lines come through as rows with empty figure cells. Filter rows where no period column has a value before loading the model.

Indented sub-items lose their hierarchy

Flattened

Indentation that signals a sub-line under a heading is purely visual; the tool produces a flat array with no parent/child link. Reconstruct hierarchy from the label text or row order in your own code if you need the tree.

Notes-reference column merges with the label

Misaligned

A small 'Note' number column sitting tight against the line-item label can merge into one cell. Inspect the preview; if you see "Revenue 4", split the trailing note number out in post-processing.

Scanned filing with no text layer

Empty array

Image-only filings (older scanned reports) yield no text and an empty array. Run PDF OCR first, then extract — and verify every figure, as OCR errors on financial digits are high-stakes.

Annual report exceeds the page limit

Blocked

A full annual report can exceed the free 50-page limit. Upgrade to Pro (500 pages) or extract just the statement pages with PDF Extract Pages before running the tool.

XBRL-tagged report

Use a parser

If the filing carries inline XBRL, the tagged numeric facts are more reliable than the rendered table. This tool reads the visual table only; for tagged data, use a dedicated XBRL parser that reads the iXBRL facts directly.

Frequently asked questions

Will negatives in parentheses, like (1,234), convert correctly?

They extract as strings exactly as printed — "(1,234)" — and are not auto-converted to -1234. This is deliberate: silently guessing a sign could corrupt a model. Detect the parentheses and negate in your parse step. The cookbook shows a one-function approach you can review and audit once.

How are thousands separators handled?

Kept as-is in the string ("48,200"). Strip them before Number(). For European formatting (48.200,00), normalise the dot and comma convention first, then parse.

Can I extract several statements from one report at once?

Yes — every page's rows go into one flat array. There's no separate array per statement and no page index, so you'll need to split the P&L, balance sheet, and cash flow apart afterward (by a recognisable boundary row or by extracting the relevant pages first with PDF Extract Pages).

Does the tool keep multi-year columns aligned?

Yes. Each reporting period in the table becomes its own key (2025, 2024), so each line item carries its periods together. Parse each period column with the same number-parsing function for consistency.

How do I remove section headings and blank rows?

They come through as rows with empty figure cells. Filter to rows where at least one period column has a value: rows.filter(r => periods.some(p => r[p]?.trim())). If you want to keep the heading as a category, carry the last seen heading down onto the rows beneath it.

Is my financial data uploaded anywhere?

No. Extraction runs entirely in your browser using PDF.js. Pre-release results, management accounts, and any other sensitive figures never leave your device; only anonymous usage counters are recorded when you're signed in.

Can I verify the extraction was accurate?

Yes — once parsed to numbers, cross-foot the statement: check that revenue plus cost of sales equals gross profit, etc. A mismatch flags a dropped or misread row before it reaches your model. The cookbook includes a short cross-foot check.

What about figures shown as a dash for nil?

A - or em-dash for a zero value extracts as that dash string. Map it to 0 in your parser so it doesn't become NaN. The example parser in the cookbook handles this case.

Is this suitable for XBRL-tagged annual reports?

For the most reliable numbers from an XBRL/iXBRL filing, use a dedicated XBRL parser that reads the tagged facts. This tool extracts from the rendered visual table, which is fine for un-tagged reports or a quick pull, but the tagged data is the authoritative source when it exists.

I need a spreadsheet, not JSON — what should I use?

Use PDF to Excel for CSV output you can open in Excel or Google Sheets. It runs the same row/column detection. Pick JSON when feeding a model or API programmatically; CSV when an analyst will work the numbers by hand.

What are the limits for large reports?

Free: 2 MB and 50 pages. Pro: 50 MB / 500 pages. Pro + Media: 500 MB / 2,000 pages. Developer: 2 GB / 10,000 pages. For a long annual report, either upgrade or extract just the statement pages first.

Why did a sub-line lose its indentation/grouping?

Indentation is visual only; the output is a flat array with no parent/child structure. Rebuild the hierarchy from the label text or row order in your own code if your model needs the tree of headings and sub-lines.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to extract financial tables from a pdf into json

Step 1
Open the tool and drop the report PDF — Load the financial statement into the PDF Table to JSON tool. Extraction starts immediately in your browser — no settings.
Step 2
Confirm the period columns in the preview — Check the first 20 objects: keys should be the line-item label column plus each reporting period (e.g. 2025, 2024). If a heading line above the table became the keys, see the edge cases.
Step 3
Download the JSON array — Save <name>.json — a flat array of line-item objects spanning every page of the statement.
Step 4
Drop spacer and section rows — Statements include blank spacer lines and section headings (Operating expenses). These come through as rows with empty figure cells — filter them where the figure columns are blank.
Step 5
Parse figures: separators and parentheses — Convert each figure string to a number with explicit sign handling: strip commas, and turn (1,234) into -1234. Keep this in one function so the rule is auditable and consistent across the whole statement.
Step 6
Load into your model or API — Map line items to your chart-of-accounts keys and push the parsed numbers into your model, dashboard data source, or finance API.

Financial formatting: how each convention extracts

The tool reads text as-is. This is what common accounting formats look like in the output, and the parse you write for each.

Format in the PDF	String you get	Parse to
Plain figure	`"1,234"`	`Number(s.replace(/,/g,''))` → `1234`
Negative in parentheses	`"(1,234)"`	`-1234` (detect `()`, then strip + negate)
Negative with minus	`"-1,234"`	`-1234`
Zero shown as dash	`"-"` or `"—"`	`0` (treat dash as zero)
Thousands suffix	`"1.2m"` / `"1,200k"`	Scale explicitly — don't `Number()` blindly
Blank spacer / section row	empty figure cells	Filter out before loading

Tier limits for report PDFs

Single statements fit the free tier easily; full annual reports with notes can run long — check page count.

Tier	Max file size	Max pages
Free	2 MB	50
Pro	50 MB	500
Pro + Media	500 MB	2,000
Developer	2 GB	10,000

Cookbook

A real P&L extraction and the exact, auditable parse that turns the string array into model-ready numbers.

A P&L with a parenthesised negative

Watch the loss line: it stays "(1,200)" as a string. The section heading 'Operating expenses' becomes a near-empty row.

PDF:
Line item            2025      2024
Revenue              48,200    41,050
Cost of sales        (29,400)  (25,100)
Gross profit         18,800    15,950
Operating expenses
  Admin              (12,000)  (11,200)
Operating loss       (1,200)   (250)

Downloaded JSON:
[
  { "Line item": "Revenue",            "2025": "48,200",   "2024": "41,050" },
  { "Line item": "Cost of sales",      "2025": "(29,400)", "2024": "(25,100)" },
  { "Line item": "Gross profit",       "2025": "18,800",   "2024": "15,950" },
  { "Line item": "Operating expenses", "2025": "",         "2024": "" },
  { "Line item": "Admin",              "2025": "(12,000)", "2024": "(11,200)" },
  { "Line item": "Operating loss",     "2025": "(1,200)",  "2024": "(250)" }
]

One auditable number-parser

Keep all sign and separator logic in a single function so a reviewer can sign off on it once.

function parseFig(s) {
  if (s == null) return null;
  const t = s.trim();
  if (t === "" || t === "-" || t === "\u2014") return 0;     // blank or dash = 0
  const neg = /^\(.*\)$/.test(t);                              // parentheses = negative
  const n = Number(t.replace(/[(),\s]/g, ""));
  if (Number.isNaN(n)) throw new Error(`Unparseable figure: ${s}`);
  return neg ? -n : n;
}
parseFig("(1,200)"); // -1200
parseFig("48,200");  //  48200

Build a typed line-item array

Drop spacer/section rows, then parse each period column with the one function above.

const rows = JSON.parse(json);
const periods = ["2025", "2024"];
const model = rows
  .filter(r => periods.some(p => r[p] && r[p].trim() !== ""))   // has at least one figure
  .map(r => ({
    item: r["Line item"].trim(),
    "2025": parseFig(r["2025"]),
    "2024": parseFig(r["2024"]),
  }));

Cross-foot to validate the extraction

Because the data is now numeric, you can verify the extraction by checking that totals add up — catching a dropped or misread row before it reaches the model.

const by = Object.fromEntries(model.map(r => [r.item, r]));
const gp = by["Revenue"]["2025"] + by["Cost of sales"]["2025"];
if (gp !== by["Gross profit"]["2025"]) {
  console.warn(`Gross profit mismatch: computed ${gp}, stated ${by["Gross profit"]["2025"]}`);
}

Drop repeated headers from a multi-page statement

A notes section that reprints the period header on each page leaks header rows into the array; remove them where the label equals its own header.

const clean = rows.filter(r => r["Line item"] !== "Line item");

Edge cases and what actually happens

Negative shown in parentheses

Expected

Thousands separators in every figure

Cast carefully

Figures keep their commas ("48,200"). Number() returns NaN on those, so strip separators before converting. For EU-format statements (48.200,00), normalise the dot/comma convention first.

Zero printed as a dash

Treat as zero

Many statements print - or an em-dash for a zero or nil value. That extracts as a dash string. Map it to 0 in your parser rather than letting it become NaN.

Section headings and spacer rows

Filter needed

Headings like 'Operating expenses' and blank spacer lines come through as rows with empty figure cells. Filter rows where no period column has a value before loading the model.

Indented sub-items lose their hierarchy

Flattened

Notes-reference column merges with the label

Misaligned

A small 'Note' number column sitting tight against the line-item label can merge into one cell. Inspect the preview; if you see "Revenue 4", split the trailing note number out in post-processing.

Scanned filing with no text layer

Empty array

Image-only filings (older scanned reports) yield no text and an empty array. Run PDF OCR first, then extract — and verify every figure, as OCR errors on financial digits are high-stakes.

Annual report exceeds the page limit

Blocked

A full annual report can exceed the free 50-page limit. Upgrade to Pro (500 pages) or extract just the statement pages with PDF Extract Pages before running the tool.

XBRL-tagged report

Use a parser

Frequently asked questions

Will negatives in parentheses, like (1,234), convert correctly?

How are thousands separators handled?

Kept as-is in the string ("48,200"). Strip them before Number(). For European formatting (48.200,00), normalise the dot and comma convention first, then parse.

Can I extract several statements from one report at once?

Does the tool keep multi-year columns aligned?

How do I remove section headings and blank rows?

Is my financial data uploaded anywhere?

Can I verify the extraction was accurate?

What about figures shown as a dash for nil?

A - or em-dash for a zero value extracts as that dash string. Map it to 0 in your parser so it doesn't become NaN. The example parser in the cookbook handles this case.

Is this suitable for XBRL-tagged annual reports?

I need a spreadsheet, not JSON — what should I use?

What are the limits for large reports?

Free: 2 MB and 50 pages. Pro: 50 MB / 500 pages. Pro + Media: 500 MB / 2,000 pages. Developer: 2 GB / 10,000 pages. For a long annual report, either upgrade or extract just the statement pages first.

Why did a sub-line lose its indentation/grouping?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Extract Financial Tables from a PDF into JSON

How to extract financial tables from a pdf into json

Financial formatting: how each convention extracts

Tier limits for report PDFs

Cookbook

A P&L with a parenthesised negative

One auditable number-parser

Build a typed line-item array

Cross-foot to validate the extraction

Drop repeated headers from a multi-page statement

Edge cases and what actually happens

Negative shown in parentheses

Thousands separators in every figure

Zero printed as a dash

Section headings and spacer rows

Indented sub-items lose their hierarchy

Notes-reference column merges with the label

Scanned filing with no text layer

Annual report exceeds the page limit

XBRL-tagged report

Frequently asked questions

Will negatives in parentheses, like (1,234), convert correctly?

How are thousands separators handled?

Can I extract several statements from one report at once?

Does the tool keep multi-year columns aligned?

How do I remove section headings and blank rows?

Is my financial data uploaded anywhere?

Can I verify the extraction was accurate?

What about figures shown as a dash for nil?

Is this suitable for XBRL-tagged annual reports?

I need a spreadsheet, not JSON — what should I use?

What are the limits for large reports?

Why did a sub-line lose its indentation/grouping?

Privacy first

Related guides

Extract Financial Tables from a PDF into JSON

How to extract financial tables from a pdf into json

Financial formatting: how each convention extracts

Tier limits for report PDFs

Cookbook

A P&L with a parenthesised negative

One auditable number-parser

Build a typed line-item array

Cross-foot to validate the extraction

Drop repeated headers from a multi-page statement

Edge cases and what actually happens

Negative shown in parentheses

Thousands separators in every figure

Zero printed as a dash

Section headings and spacer rows

Indented sub-items lose their hierarchy

Notes-reference column merges with the label

Scanned filing with no text layer

Annual report exceeds the page limit

XBRL-tagged report

Frequently asked questions

Will negatives in parentheses, like (1,234), convert correctly?

How are thousands separators handled?

Can I extract several statements from one report at once?

Does the tool keep multi-year columns aligned?

How do I remove section headings and blank rows?

Is my financial data uploaded anywhere?

Can I verify the extraction was accurate?

What about figures shown as a dash for nil?

Is this suitable for XBRL-tagged annual reports?

I need a spreadsheet, not JSON — what should I use?

What are the limits for large reports?

Why did a sub-line lose its indentation/grouping?

Privacy first

Related guides