How to extract tables from a pdf into a spreadsheet
- Step 1Confirm the PDF has selectable text — Open the PDF and try to select a value inside a table with your cursor. If text highlights, it has a text layer and will extract. If you can only draw a box (it's an image), run OCR first to add a text layer, then come back.
- Step 2Drop the PDF onto the converter — Add the file to the tool above. There are no options to set — the extraction starts automatically as soon as the file is read. Parsing happens entirely in your browser.
- Step 3Review the preview — The result panel shows the first ~5,000 characters of the CSV in a scrollable box. Scan it to confirm the columns line up and the rows you expected are present before downloading.
- Step 4Download the CSV file — Click Download. The file saves with a
.txtextension and CSV-formatted contents (every cell quoted, comma-separated). Rename it to.csvif your importer keys off the extension. - Step 5Open or import into your spreadsheet — In Excel use Data → From Text/CSV (don't just double-click, so Excel doesn't coerce IDs or dates). In Google Sheets, File → Import → Upload and choose comma as the separator.
- Step 6Clean up ragged rows — Page headers, footers, and titles extract as their own short rows because the tool treats every line of text on the page as a row. Delete those non-data rows and re-align any column that drifted, then save.
What the tool produces vs. what people assume
The name says Excel; the output is CSV text. Knowing the real shape avoids surprises on download.
| Expectation | Actual behaviour |
|---|---|
Native .xlsx workbook | No — output is CSV-formatted text. Every cell is double-quoted, values comma-separated, rows newline-separated |
| Multiple sheets (one per table) | No — all pages go into one CSV stream; a blank line separates each page's block of rows |
Download named .xlsx or .csv | Downloads as <filename>.txt with CSV contents. Rename to .csv if your importer needs the extension |
| Cell formatting / colours / fonts | Not preserved — CSV carries values only, no styling |
| Formulas restored | No — PDFs store calculated values, not formulas. Only the printed numbers are extracted |
| Configurable column count / page range | No options exist — the tool extracts all pages with no settings |
How a page becomes rows and columns
The exact extraction pipeline, so you can predict the output for a given layout.
| Step | What happens |
|---|---|
| 1. Read text fragments | pdf.js returns every text run on the page with its position (an x/y transform) |
| 2. Group into rows | Fragments are bucketed by their Y-coordinate (rounded to the nearest point). Each Y-bucket becomes one CSV row |
| 3. Order columns | Within a row, fragments are sorted by X-coordinate (left to right) — that order becomes the column order |
| 4. Order rows | Rows are emitted top-to-bottom (highest Y first), matching reading order |
| 5. Emit CSV | Each cell is wrapped in quotes (internal quotes doubled), cells joined by commas, a page with 2+ rows is kept; pages are separated by a blank line |
File-size and page limits by tier
Free-tier blocks trigger in the dropzone before processing. Larger PDFs need a paid tier.
| Tier | Max file size | Max pages |
|---|---|---|
| Free | 2 MB | 50 pages |
| Pro | 50 MB | 500 pages |
| Pro + Media | 500 MB | 2,000 pages |
| Developer | 2 GB | 10,000 pages |
| Enterprise | Unlimited | Unlimited |
Cookbook
Real extractions showing exactly what comes out for a given table layout. Output is shown verbatim — note the quoting and the blank line between pages.
A clean single-page table
A born-digital table with evenly aligned columns is the best case. Each printed row becomes one CSV row; each value lands in its column.
PDF page (Order summary): SKU Item Qty Price A-100 Widget 2 9.99 A-205 Bracket 10 1.50 CSV output: "SKU","Item","Qty","Price" "A-100","Widget","2","9.99" "A-205","Bracket","10","1.50"
A two-page table
Each page is processed independently and its rows appended, separated by a blank line. The header repeats if it repeats in the PDF. There is no single merged sheet — you reconcile the two blocks after import.
CSV output: "Date","Description","Amount" "01/03","Opening balance","1200.00" "02/03","Invoice 4471","-340.00" "Date","Description","Amount" "15/03","Invoice 4480","-90.00" "28/03","Refund","45.00" (blank line between page 1 and page 2 blocks)
Page title and footer extracted as stray rows
The tool treats every line of text on the page as a row — including the report title and the page-number footer. These appear as short rows you delete after import.
CSV output: "Q3 Sales Report — Confidential" "Region","Revenue","Units" "North","42000","310" "South","38500","288" "Page 1 of 4" → delete the title row and the footer row in your sheet.
Quotes inside a cell are escaped
A value containing a double quote is preserved by doubling the quote, per CSV convention, so spreadsheets parse it correctly.
PDF cell value: 6" pipe fitting CSV output: "P-77","6"" pipe fitting","3.20" Excel / Sheets display: 6" pipe fitting (correct)
A scanned table yields nothing
If the PDF is an image with no text layer, pdf.js finds no text fragments, so no rows are produced. The fix is OCR, then re-run extraction.
Input: scan_of_invoice.pdf (photo of a printed table) Output: (empty — no selectable text on the page) Fix: 1. /pdf-tools/pdf-ocr → adds text layer 2. re-run this tool on the OCR'd PDF → rows appear
Edge cases and what actually happens
Scanned / image-only PDF (no text layer)
No text foundpdf.js extracts text fragments, not pixels. A scanned page is an image, so there is nothing to group into rows and the output is empty. Run OCR to add a text layer first, then re-run this tool. For data extraction specifically, OCR PDF for data extraction is the matching workflow.
Columns drift on rows with merged or blank cells
Manual fixupColumns are reconstructed purely from each fragment's X-position within its own row — there is no fixed column grid shared across rows. When a cell is empty or two cells are merged, that row has fewer fragments, so later values shift left and land in the wrong column. Re-align those rows in your spreadsheet after import.
A wrapped cell splits into two rows
By designRows are grouped by Y-position. A cell whose text wraps onto a second visual line sits at a different Y, so it becomes a separate CSV row beneath the first. Merge the two rows manually, or widen the column at the source before exporting the PDF.
Text not laid out as a grid still produces rows
Noisy outputThe tool keeps any page with two or more text rows — it doesn't verify the page actually contains a table. A page of prose produces one CSV row per line of text. If you only want narrative text, use extract text from PDF instead.
Multi-page table looks like separate tables
ExpectedEach page is processed on its own and appended with a blank-line separator; there is no logic that stitches a table continuing across a page break into one block. Concatenate the blocks in your spreadsheet (and delete any repeated header rows) after import.
Numbers and IDs may be coerced by Excel
Excel coercionThe CSV carries every value as text exactly as printed, but Excel applies its own formatting on open — leading zeros drop from codes like 00734, and long numbers can show in scientific notation. Import via Data → From Text/CSV and set those columns to Text to preserve them.
Downloaded file has a .txt extension
By designThe output is CSV-formatted but downloads as <filename>.txt. Excel and Sheets can import it as-is; rename it to .csv if a downstream importer keys off the file extension.
PDF exceeds the free 2 MB / 50-page limit
Blocked on free tierFree tier accepts up to 2 MB and 50 pages; the dropzone blocks larger files before any processing. Pro raises this to 50 MB / 500 pages, with higher ceilings above. Or split the PDF first with a page-extraction tool and process the part you need.
Frequently asked questions
Does this produce a real Excel (.xlsx) file?
No. Despite the name, the output is CSV-formatted text — every cell double-quoted, values comma-separated, a blank line between pages. It downloads as a .txt file. CSV opens natively in Excel, Google Sheets, and Numbers, so you get a spreadsheet immediately; it just isn't a native .xlsx workbook with sheets and formatting.
How does it know where the columns are?
It uses each text fragment's position. Fragments on the same vertical line (Y-coordinate) become one row; within that row they're ordered left to right by horizontal position (X-coordinate) to form columns. There's no machine-learning table model — it's a deterministic position-based reconstruction, which is why clean, grid-aligned tables extract best.
Will it work on a scanned PDF?
Not directly. A scan is an image with no selectable text, so there's nothing to extract and you'll get an empty result. Run OCR to add a text layer first, then re-run this tool. OCR PDF for data extraction is purpose-built for that pipeline.
Are formulas recovered into Excel?
No — PDFs store only the calculated results, never the underlying formulas. You get the printed numbers. If you need live formulas, re-create them in your spreadsheet after importing the values.
Can I pick which page or which table to extract?
No — there are no options. The tool processes every page and outputs all of them. To narrow the input, extract the pages you want first with a tool like extract a single page from a PDF, then run the extraction on that subset.
Why are there extra short rows in my output?
The tool treats every line of text on a page as a row, including titles, headers, and page-number footers. Those appear as their own short rows. Delete the non-data rows in your spreadsheet after import — it's a quick cleanup pass.
My columns are misaligned on some rows — why?
Columns are rebuilt from positions within each individual row, not from a shared grid. A row with an empty cell or a merged cell has fewer fragments, so the remaining values shift. Re-align those specific rows after import, or fix the layout at the source before exporting the PDF.
Is my document uploaded anywhere?
No. Parsing and extraction run entirely in your browser using pdf.js. The PDF never leaves your device; only anonymous usage counters are recorded when you're signed in.
What's the difference between this and PDF table to JSON?
Same position-based detection, different output. This tool emits CSV rows; extract a PDF table to JSON emits an array of objects using the first row as keys — better for feeding an API or a data pipeline. Choose CSV for spreadsheets, JSON for code.
How big a PDF can I process?
Free tier allows up to 2 MB and 50 pages. Pro raises that to 50 MB / 500 pages, Pro+Media to 500 MB / 2,000 pages, and Developer to 2 GB / 10,000 pages. The dropzone blocks oversize files before processing rather than failing midway.
How do I avoid Excel mangling my numbers and codes?
Don't double-click the file — import it via Data → From Text/CSV and mark code/ID columns as Text. That stops Excel dropping leading zeros from values like 00734 and showing long numbers in scientific notation. In Google Sheets, untick "Convert text to numbers, dates, and formulas" during import.
Can I get plain text or Word instead of a table?
Yes. If the document is prose rather than a grid, extract text from the PDF or convert the PDF to editable Word. Use this table tool only when the content is genuinely tabular.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.