How to extract financial report tables from a pdf into a spreadsheet
- Step 1Confirm the report has a text layer — Select a figure inside a statement table with your cursor. If text highlights, the report is born-digital and extracts well. If it's a scanned filing, run OCR first to add a text layer.
- Step 2Drop the report PDF onto the converter — Add the file above. There are no options — extraction runs as soon as the file is read, and parsing happens in your browser so the financials never leave your device.
- Step 3Review the extracted statements — The preview shows the first ~5,000 characters of CSV. Confirm the statement rows and the period columns came through, and note where headings and footnotes landed.
- Step 4Download the CSV — Click Download — the file saves as
.txtwith CSV contents. Rename to.csvif your importer needs it. - Step 5Import into your model — Use Excel's Data → From Text/CSV so figures aren't coerced. Map the period columns and line-item rows into the relevant statement tabs of your model.
- Step 6Convert negatives and remove non-data rows — Convert parenthetical negatives
(123)to numeric-123, strip currency symbols and thousands separators, and delete heading/footnote rows so the figures calculate in your model.
Financial-statement specifics
How common financial-report conventions come through the extraction.
| Convention | How it extracts |
|---|---|
Parenthetical negatives (1,234) | Kept as text — not converted to -1234. Reformat in your model |
Thousands separators 1,234,567 | Kept as text with commas — strip them so the figure is numeric |
| Currency / unit symbols (£, $, €, 'm) | Kept with the value as text — reformat the column after import |
| Footnote markers (¹, *, (a)) | Extract attached to the cell or as separate short rows — filter out |
| Section headings (e.g. 'Operating expenses') | Extract as their own one-cell rows above the figures |
| Multi-period columns (FY24, FY25) | Preserved as adjacent columns when aligned in the source |
What the output is (and isn't)
Set expectations so the extraction fits a modelling workflow.
| Aspect | Reality |
|---|---|
| File format | CSV-formatted text, downloaded as .txt (rename to .csv) |
| Sheets | None — all pages in one CSV stream, blank line between pages |
| Number typing | Everything is text; you convert to numbers in your model |
| Statement labelling | None — rows returned by position, no "this is the balance sheet" tagging |
| Formulas | Not recoverable — PDFs store results, not formulas |
File-size and page limits by tier
Annual reports can be large; check the tier limit before uploading.
| Tier | Max file size | Max pages |
|---|---|---|
| Free | 2 MB | 50 pages |
| Pro | 50 MB | 500 pages |
| Pro + Media | 500 MB | 2,000 pages |
| Developer | 2 GB | 10,000 pages |
| Enterprise | Unlimited | Unlimited |
Cookbook
Real financial-statement extractions. Output is shown verbatim — note how negatives, separators, and headings come through as text.
A two-period P&L
When period columns are aligned in the source, they reconstruct as adjacent CSV columns. Negatives stay parenthetical text.
CSV output: "","FY25","FY24" "Revenue","48,200","41,900" "Cost of sales","(29,100)","(25,400)" "Gross profit","19,100","16,500" "Operating expenses","(12,300)","(11,200)" "Operating profit","6,800","5,300"
Parenthetical negatives need converting
Accounting-style negatives extract as text. Convert them in Excel so the figures calculate.
Extracted cell: (29,100)
Excel formula to convert a column of these:
=-1*VALUE(SUBSTITUTE(SUBSTITUTE(A2,"(",""),")",""))
→ -29100 (after also removing the comma)Section headings extract as their own rows
A statement subheading sits on its own line, so it becomes a one-cell row above the figures it groups. Keep it as a label or delete it.
CSV output: "Non-current assets" "Property, plant & equipment","34,500" "Intangible assets","8,200" "Current assets" "Inventory","6,100" "Trade receivables","9,400"
Footnote markers attach to cells
Footnote references extract as part of the cell or as a trailing row. Strip them so the figure is clean.
CSV output: "Deferred tax","1,240 (note 14)" "Pension obligation","3,880 *" → remove " (note 14)" and " *" before using the figures.
A scanned filing yields nothing
Older filings scanned to PDF have no text layer. OCR first, then extract.
Input: 2019_annual_report_scan.pdf
Output: (empty)
Fix:
1. /pdf-tools/pdf-ocr
2. re-run this tool → statement rows extract
(verify OCR'd digits carefully on financials)Edge cases and what actually happens
Parenthetical negatives stay as text
Manual fixupThe tool extracts (1,234) exactly as printed and does not convert it to -1234. Excel treats it as text. Convert the column with SUBSTITUTE/VALUE or Find & Replace before the figures feed any calculation in your model.
Thousands separators block numeric use
Manual fixupFigures like 1,234,567 keep their commas as text, so Excel won't sum them until you remove the separators. Strip commas (and currency symbols) and format the column as Number after import.
Footnotes and headings become extra rows
Noisy outputSection headings extract as one-cell rows and footnote markers attach to cells or trail as rows — the tool returns every line of text by position with no statement structure. Filter these out before the data is model-ready.
Scanned or image-based filing
No text foundImage-only filings have no selectable text, so extraction returns nothing. OCR adds a text layer — but verify OCR'd digits carefully on financial figures, where a misread 8/3 materially changes the number.
Misaligned columns where a period is blank
Manual fixupColumns are rebuilt per row from text positions, not a shared grid. A line with a blank period cell has fewer fragments, so later figures shift left into the wrong period column. Re-align those rows after import.
Statement continues across a page break
ExpectedEach page is processed independently and appended with a blank line — there's no stitching of a statement that runs across pages. Concatenate the blocks in your model and remove any repeated header rows.
Large annual report exceeds the free limit
Blocked on free tierFree tier caps at 2 MB / 50 pages, which a full annual report often exceeds. Either upgrade (Pro is 50 MB / 500 pages), or extract only the statement pages first and run the extraction on that subset.
Figures coerced by Excel on open
Excel coercionExcel reformats on a double-click open — long figures can show in scientific notation. Import via Data → From Text/CSV and control column types to keep figures exact for the model.
Frequently asked questions
Does this give me a real Excel workbook?
No — it outputs CSV-formatted text that downloads as a .txt file. CSV imports straight into Excel or Google Sheets, so you get the figures in a spreadsheet; it just isn't a native .xlsx with separate statement tabs or formatting. Build your model structure on the imported data.
Are negative numbers in parentheses recognised?
They're extracted exactly as printed — (1,234) comes through as text, not as -1234. Convert the column in Excel (SUBSTITUTE to strip the brackets and comma, then negate) so the figures calculate. This is the most common post-import step on financial statements.
What happens to currency symbols and thousands separators?
They're kept as part of the cell text (£1,234,567), which means Excel treats the value as text. Strip the symbol and commas with Find & Replace, then format the column as Number or Currency so it sums.
Can I extract footnotes with the tables?
Footnote markers and footnote text extract as part of cells or as separate rows, because the tool returns every line of text on the page by position. Filter or delete them after import so they don't pollute the figures.
Will it label which table is the P&L vs the balance sheet?
No — there's no statement-recognition logic. It returns the tables' text by position, page by page. Section headings extract as rows, which helps you orient, but you assign each block to the right statement tab during import.
Does it work on scanned filings?
Not directly — scans have no text layer and yield nothing. Run OCR first. On financials, double-check OCR'd digits: a single misread character changes the number, so verification matters more here than on prose.
Are formulas preserved?
No — PDFs store the printed results, never the formulas behind them. You get the reported figures; rebuild any calculations in your model.
Are my financials uploaded anywhere?
No. Extraction runs entirely in your browser via pdf.js — sensitive or pre-release figures never reach a server. Only anonymous usage counters are recorded when you're signed in, which matters when you're handling material non-public information.
Why are some period columns misaligned?
Columns are rebuilt from text positions per row, not from a fixed grid. A row with a blank period cell has fewer fragments, so the remaining figures shift into the wrong column. Re-align those specific rows after import.
My annual report is too big for the free tier — what now?
Free tier caps at 2 MB / 50 pages. Either upgrade (Pro is 50 MB / 500 pages, higher tiers above), or extract just the statement pages first and run the extraction on those few pages.
Can I get JSON for a data pipeline instead of a spreadsheet?
Yes — extract a financial table to JSON returns structured objects, which is easier to feed into a database or analysis script than CSV. Use CSV for modelling in Excel, JSON for code.
How do I keep the extraction clean for repeated reports?
On a recurring report (e.g. monthly management accounts) the layout is stable, so the heading/footnote rows land in the same places each time. Save your post-import cleanup steps (column conversions, row deletions) as an Excel macro or a Sheets script and reapply them each period.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.