How to extract financial tables from a pdf into json
- Step 1Open the tool and drop the report PDF — Load the financial statement into the PDF Table to JSON tool. Extraction starts immediately in your browser — no settings.
- Step 2Confirm the period columns in the preview — Check the first 20 objects: keys should be the line-item label column plus each reporting period (e.g.
2025,2024). If a heading line above the table became the keys, see the edge cases. - Step 3Download the JSON array — Save
<name>.json— a flat array of line-item objects spanning every page of the statement. - Step 4Drop spacer and section rows — Statements include blank spacer lines and section headings (
Operating expenses). These come through as rows with empty figure cells — filter them where the figure columns are blank. - Step 5Parse figures: separators and parentheses — Convert each figure string to a number with explicit sign handling: strip commas, and turn
(1,234)into-1234. Keep this in one function so the rule is auditable and consistent across the whole statement. - Step 6Load into your model or API — Map line items to your chart-of-accounts keys and push the parsed numbers into your model, dashboard data source, or finance API.
Financial formatting: how each convention extracts
The tool reads text as-is. This is what common accounting formats look like in the output, and the parse you write for each.
| Format in the PDF | String you get | Parse to |
|---|---|---|
| Plain figure | "1,234" | Number(s.replace(/,/g,'')) → 1234 |
| Negative in parentheses | "(1,234)" | -1234 (detect (), then strip + negate) |
| Negative with minus | "-1,234" | -1234 |
| Zero shown as dash | "-" or "—" | 0 (treat dash as zero) |
| Thousands suffix | "1.2m" / "1,200k" | Scale explicitly — don't Number() blindly |
| Blank spacer / section row | empty figure cells | Filter out before loading |
Tier limits for report PDFs
Single statements fit the free tier easily; full annual reports with notes can run long — check page count.
| Tier | Max file size | Max pages |
|---|---|---|
| Free | 2 MB | 50 |
| Pro | 50 MB | 500 |
| Pro + Media | 500 MB | 2,000 |
| Developer | 2 GB | 10,000 |
Cookbook
A real P&L extraction and the exact, auditable parse that turns the string array into model-ready numbers.
A P&L with a parenthesised negative
Watch the loss line: it stays "(1,200)" as a string. The section heading 'Operating expenses' becomes a near-empty row.
PDF:
Line item 2025 2024
Revenue 48,200 41,050
Cost of sales (29,400) (25,100)
Gross profit 18,800 15,950
Operating expenses
Admin (12,000) (11,200)
Operating loss (1,200) (250)
Downloaded JSON:
[
{ "Line item": "Revenue", "2025": "48,200", "2024": "41,050" },
{ "Line item": "Cost of sales", "2025": "(29,400)", "2024": "(25,100)" },
{ "Line item": "Gross profit", "2025": "18,800", "2024": "15,950" },
{ "Line item": "Operating expenses", "2025": "", "2024": "" },
{ "Line item": "Admin", "2025": "(12,000)", "2024": "(11,200)" },
{ "Line item": "Operating loss", "2025": "(1,200)", "2024": "(250)" }
]One auditable number-parser
Keep all sign and separator logic in a single function so a reviewer can sign off on it once.
function parseFig(s) {
if (s == null) return null;
const t = s.trim();
if (t === "" || t === "-" || t === "\u2014") return 0; // blank or dash = 0
const neg = /^\(.*\)$/.test(t); // parentheses = negative
const n = Number(t.replace(/[(),\s]/g, ""));
if (Number.isNaN(n)) throw new Error(`Unparseable figure: ${s}`);
return neg ? -n : n;
}
parseFig("(1,200)"); // -1200
parseFig("48,200"); // 48200Build a typed line-item array
Drop spacer/section rows, then parse each period column with the one function above.
const rows = JSON.parse(json);
const periods = ["2025", "2024"];
const model = rows
.filter(r => periods.some(p => r[p] && r[p].trim() !== "")) // has at least one figure
.map(r => ({
item: r["Line item"].trim(),
"2025": parseFig(r["2025"]),
"2024": parseFig(r["2024"]),
}));Cross-foot to validate the extraction
Because the data is now numeric, you can verify the extraction by checking that totals add up — catching a dropped or misread row before it reaches the model.
const by = Object.fromEntries(model.map(r => [r.item, r]));
const gp = by["Revenue"]["2025"] + by["Cost of sales"]["2025"];
if (gp !== by["Gross profit"]["2025"]) {
console.warn(`Gross profit mismatch: computed ${gp}, stated ${by["Gross profit"]["2025"]}`);
}Drop repeated headers from a multi-page statement
A notes section that reprints the period header on each page leaks header rows into the array; remove them where the label equals its own header.
const clean = rows.filter(r => r["Line item"] !== "Line item");
Edge cases and what actually happens
Negative shown in parentheses
ExpectedAccounting style writes negatives as (1,234). The tool keeps it as the string "(1,234)" — it does not convert to -1234. This is intentional, because mis-signing a figure corrupts a model silently. Detect the parentheses and negate in your parse step (see the cookbook).
Thousands separators in every figure
Cast carefullyFigures keep their commas ("48,200"). Number() returns NaN on those, so strip separators before converting. For EU-format statements (48.200,00), normalise the dot/comma convention first.
Zero printed as a dash
Treat as zeroMany statements print - or an em-dash for a zero or nil value. That extracts as a dash string. Map it to 0 in your parser rather than letting it become NaN.
Section headings and spacer rows
Filter neededHeadings like 'Operating expenses' and blank spacer lines come through as rows with empty figure cells. Filter rows where no period column has a value before loading the model.
Indented sub-items lose their hierarchy
FlattenedIndentation that signals a sub-line under a heading is purely visual; the tool produces a flat array with no parent/child link. Reconstruct hierarchy from the label text or row order in your own code if you need the tree.
Notes-reference column merges with the label
MisalignedA small 'Note' number column sitting tight against the line-item label can merge into one cell. Inspect the preview; if you see "Revenue 4", split the trailing note number out in post-processing.
Scanned filing with no text layer
Empty arrayImage-only filings (older scanned reports) yield no text and an empty array. Run PDF OCR first, then extract — and verify every figure, as OCR errors on financial digits are high-stakes.
Annual report exceeds the page limit
BlockedA full annual report can exceed the free 50-page limit. Upgrade to Pro (500 pages) or extract just the statement pages with PDF Extract Pages before running the tool.
XBRL-tagged report
Use a parserIf the filing carries inline XBRL, the tagged numeric facts are more reliable than the rendered table. This tool reads the visual table only; for tagged data, use a dedicated XBRL parser that reads the iXBRL facts directly.
Frequently asked questions
Will negatives in parentheses, like (1,234), convert correctly?
They extract as strings exactly as printed — "(1,234)" — and are not auto-converted to -1234. This is deliberate: silently guessing a sign could corrupt a model. Detect the parentheses and negate in your parse step. The cookbook shows a one-function approach you can review and audit once.
How are thousands separators handled?
Kept as-is in the string ("48,200"). Strip them before Number(). For European formatting (48.200,00), normalise the dot and comma convention first, then parse.
Can I extract several statements from one report at once?
Yes — every page's rows go into one flat array. There's no separate array per statement and no page index, so you'll need to split the P&L, balance sheet, and cash flow apart afterward (by a recognisable boundary row or by extracting the relevant pages first with PDF Extract Pages).
Does the tool keep multi-year columns aligned?
Yes. Each reporting period in the table becomes its own key (2025, 2024), so each line item carries its periods together. Parse each period column with the same number-parsing function for consistency.
How do I remove section headings and blank rows?
They come through as rows with empty figure cells. Filter to rows where at least one period column has a value: rows.filter(r => periods.some(p => r[p]?.trim())). If you want to keep the heading as a category, carry the last seen heading down onto the rows beneath it.
Is my financial data uploaded anywhere?
No. Extraction runs entirely in your browser using PDF.js. Pre-release results, management accounts, and any other sensitive figures never leave your device; only anonymous usage counters are recorded when you're signed in.
Can I verify the extraction was accurate?
Yes — once parsed to numbers, cross-foot the statement: check that revenue plus cost of sales equals gross profit, etc. A mismatch flags a dropped or misread row before it reaches your model. The cookbook includes a short cross-foot check.
What about figures shown as a dash for nil?
A - or em-dash for a zero value extracts as that dash string. Map it to 0 in your parser so it doesn't become NaN. The example parser in the cookbook handles this case.
Is this suitable for XBRL-tagged annual reports?
For the most reliable numbers from an XBRL/iXBRL filing, use a dedicated XBRL parser that reads the tagged facts. This tool extracts from the rendered visual table, which is fine for un-tagged reports or a quick pull, but the tagged data is the authoritative source when it exists.
I need a spreadsheet, not JSON — what should I use?
Use PDF to Excel for CSV output you can open in Excel or Google Sheets. It runs the same row/column detection. Pick JSON when feeding a model or API programmatically; CSV when an analyst will work the numbers by hand.
What are the limits for large reports?
Free: 2 MB and 50 pages. Pro: 50 MB / 500 pages. Pro + Media: 500 MB / 2,000 pages. Developer: 2 GB / 10,000 pages. For a long annual report, either upgrade or extract just the statement pages first.
Why did a sub-line lose its indentation/grouping?
Indentation is visual only; the output is a flat array with no parent/child structure. Rebuild the hierarchy from the label text or row order in your own code if your model needs the tree of headings and sub-lines.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.