How to generate a structured overview of a long pdf document
- Step 1Open the PDF Summary Generator — Go to the PDF Summary Generator. Everything runs locally in your browser — no account is required to run it.
- Step 2Drop in one long PDF — Drag a single PDF onto the dropzone (the tool takes one file at a time). It reads the page count immediately and auto-runs the overview — there is no separate Generate button and no options panel to fill in.
- Step 3Confirm the document has a text layer — The report relies on extractable text. If pages show
(No text content), the PDF is scanned or image-only — run PDF OCR first to add a searchable text layer, then re-run the summary. - Step 4Read the header statistics — The top of the report gives Pages, Word Count (locale-formatted), and Estimated Reading Time in minutes. Reading time is words ÷ 250, rounded up — a quick proxy for how heavy the document is.
- Step 5Scan the Page-by-Page Overview — Under
## Page-by-Page Overview, each### Page Nshows the first ~200 characters of that page. Use it to spot where the executive summary, methodology, appendices, or signature pages start before you open the file. - Step 6Download the Markdown report — Click Download to save it as
<your-file>.md. The on-screen preview is capped at 5,000 characters, but the downloaded file contains every page's preview in full.
What the generated report contains
Exact structure emitted by generateSummary() in lib/pdf/pdf-text-extract.ts. The report is plain Markdown; nothing is paraphrased.
| Section | Markdown produced | How it's computed |
|---|---|---|
| Title | # PDF Summary | Fixed heading on every report |
| Pages | **Pages:** N | Count of pages pdf.js could open (pages.length) |
| Word Count | **Word Count:** 12,345 | fullText.split(/\s+/).length, locale-formatted with thousands separators |
| Reading time | **Estimated Reading Time:** 49 min | Math.ceil(wordCount / 250) — a flat 250-words-per-minute assumption |
| Per-page overview | ### Page N then first ~200 chars + ... | First 200 characters of each page's text, with runs of whitespace collapsed to single spaces and trimmed |
| Empty page | (No text content) | Printed when a page yields no extractable text (image-only / scanned page) |
Tier limits for the summary tool
From lib/tier-limits.ts (PDF family). Page count is checked when the file is added; an over-limit PDF is blocked before the overview runs.
| Tier | Max file size | Max pages | Files per run |
|---|---|---|---|
| Free | 2 MB | 50 pages | 1 |
| Pro | 50 MB | 500 pages | 5 (this tool runs one PDF at a time) |
| Pro + Media | 500 MB | 2,000 pages | 1 at a time |
| Developer | 2 GB | 10,000 pages | 1 at a time |
| Enterprise | Unlimited | Unlimited | 1 at a time |
Cookbook
Real overviews from long documents. The report is deterministic — these are exactly what the tool emits, abbreviated for space.
Triaging a 180-page annual report
You have five vendor annual reports to skim. The overview tells you the size of each and where the financials start before you open one.
# PDF Summary **Pages:** 180 **Word Count:** 64,210 **Estimated Reading Time:** 257 min ## Page-by-Page Overview ### Page 1 Annual Report 2025 Acme Holdings plc Building resilient growth across... ### Page 12 Chief Executive's Review The year under review was defined by margin... ### Page 58 Consolidated Statement of Financial Position As at 31 December 2025...
Word count and reading time as a density check
Two PDFs with the same page count can differ wildly in density. Word count plus the 250-wpm reading-time estimate tells you which is the heavy read.
Report A (slides, sparse): **Pages:** 40 **Word Count:** 3,100 **Estimated Reading Time:** 13 min Report B (dense prose): **Pages:** 40 **Word Count:** 21,800 **Estimated Reading Time:** 88 min Same 40 pages — Report B is ~7x the read.
A scanned document returns no text
An image-only PDF (a photographed report) has no text layer, so every page preview is empty. The fix is to OCR it first.
# PDF Summary **Pages:** 24 **Word Count:** 24 **Estimated Reading Time:** 1 min ## Page-by-Page Overview ### Page 1 (No text content) ### Page 2 (No text content) → Run PDF OCR first, then re-summarise.
Finding the appendix boundary fast
You only need the appendices of a long policy document. The per-page previews let you jump straight to the right page without scrolling the whole file.
### Page 1 Data Protection Policy v6 Effective 1 January 2026 This policy sets... ### Page 41 Appendix A: Records Retention Schedule Category Retention period... ### Page 47 Appendix B: Subject Access Request Workflow On receipt of a request...
From overview to a real narrative summary
This tool reports statistics and previews — it does not write prose. For an LLM-style abstract, extract the full text and paste it into your own local LLM.
1. PDF Summary Generator → confirm size/reading time, find key pages 2. PDF to Text (/pdf-tools/pdf-to-text) → full plain text 3. Paste the text into your own local LLM with a prompt like: "Summarise this report in 5 bullet points." The JAD tool does not call any LLM itself.
Edge cases and what actually happens
Free tier: PDF has more than 50 pages
Blocked (free limit)When you add the file the tool reads its page count; on the free tier a PDF over 50 pages is blocked with a message like 'This PDF has N pages. Free handles up to 50 pages.' Pro raises the cap to 500 pages, Pro + Media to 2,000, Developer to 10,000. Splitting the file with PDF Split into ≤50-page parts lets you summarise each on free.
Free tier: file is larger than 2 MB
Blocked (free limit)Files above 2 MB are blocked on free before processing. Image-heavy reports hit this quickly even when they're short. Pro allows up to 50 MB. The page-by-page text overview itself doesn't depend on images, so a losslessly compressed copy under 2 MB will still summarise identically.
Scanned / image-only PDF
No text contentIf the document is a scan with no text layer, pdf.js extracts nothing and every page prints (No text content), with a word count near zero. This isn't a failure of the tool — there is genuinely no text to read. Run PDF OCR to add a searchable layer first, then re-run the summary.
Expecting an AI-written abstract
By designThe report is deterministic statistics plus the literal opening of each page — it never paraphrases, ranks importance, or extracts 'key findings'. That's intentional: it can't hallucinate. For a narrative summary, extract the text with PDF to Text or PDF to Markdown and feed it to your own LLM.
Page begins with a header, page number, or figure caption
ExpectedThe preview is simply the first ~200 characters in pdf.js reading order — often a running header, page number, or figure label rather than the body text. It's a locator, not a précis. Use it to find the right page, then open the document there.
Encrypted / password-protected PDF
May fail to openpdf.js can't read text from an encrypted file without the password, so extraction can fail or return nothing. Remove the password first with PDF Unlock or Remove Password (you must know the password), then summarise the decrypted copy.
Multi-column layout flattened into one stream
Expectedpdf.js returns text items in the order the PDF stores them, joined with spaces. In a two-column report the 200-character preview may interleave both columns. The page count, word count, and reading time stay accurate; only the per-page preview readability is affected.
Word count differs from the application that authored the PDF
ExpectedWord count is fullText.split(/\s+/).length over the extracted text — it includes headers, footers, and page numbers and splits purely on whitespace, so it won't match Word's or InDesign's count exactly. Treat it as a relative density gauge, not an authoritative figure.
On-screen preview looks cut off
Preview onlyThe in-browser preview pane shows only the first 5,000 characters and appends '... (truncated preview)'. The downloaded .md file is complete and contains every page's overview — download it to see the whole thing.
Frequently asked questions
Does this tool use AI to summarise the PDF?
No. The summary is generated from the document's structure and statistics, not an AI model. It reports page count, word count, an estimated reading time, and the opening ~200 characters of each page. Nothing is paraphrased, so nothing can be hallucinated. For an LLM-written abstract, extract the text with PDF to Text and paste it into your own local LLM.
What is the maximum PDF length I can summarise?
It depends on tier. Free handles up to 50 pages and 2 MB; Pro up to 500 pages and 50 MB; Pro + Media up to 2,000 pages and 500 MB; Developer up to 10,000 pages and 2 GB. The page count is checked when you add the file, so an over-limit PDF is blocked before the overview runs.
How is the reading time calculated?
It's the word count divided by 250 and rounded up — a flat 250-words-per-minute assumption (Math.ceil(wordCount / 250)). It's a quick density proxy, not a personalised estimate; a slow technical read will take longer and a skim will take less.
Can I choose how long the summary is, or pick a format?
No. There are no options, sliders, length presets, or output-format choices. You drop one PDF and it auto-runs, producing a fixed-shape Markdown report (header stats plus a per-page overview). The page previews are always the first ~200 characters of each page.
Why do my pages say '(No text content)'?
Because the PDF has no extractable text layer — it's a scan or image-only export. pdf.js can only read embedded text, not pixels. Run PDF OCR to add a searchable text layer first, then re-run the summary and the previews will populate.
Is my document uploaded anywhere?
No. Text extraction and the whole report are built in your browser with pdf.js — the result panel shows 'Local browser processing · 0 bytes uploaded'. The only thing recorded when you're signed in is an anonymous run counter, never document content.
What format is the output, and how do I save it?
It's Markdown. The Download button saves it as <your-file>.md with a text/markdown MIME type. The on-screen preview is capped at 5,000 characters, but the downloaded file is complete.
Can I summarise several PDFs at once?
Not in one run — this tool takes a single PDF at a time. Summarise each separately and concatenate the .md reports. For a fully automated batch pipeline, see the automation question below.
Will the word count match Microsoft Word's count?
Usually not exactly. The count splits the extracted text on whitespace and includes running headers, footers, and page numbers, whereas Word counts the editable body differently. Use it as a relative measure of density between documents rather than an authoritative figure.
The page previews look jumbled on a two-column report — why?
pdf.js returns text items in the PDF's stored reading order; on a multi-column layout that can interleave columns within the 200-character preview. Page count, word count, and reading time remain accurate — only the per-page snippet readability is affected. For cleaner extraction try PDF to Markdown.
Can it work on an encrypted PDF?
Not directly — pdf.js can't extract text from an encrypted file without the password. Decrypt it first with PDF Unlock or Remove Password (you need the password), then summarise the unlocked copy.
How is this different from PDF to Markdown or PDF to Text?
PDF to Text and PDF to Markdown give you the full content. This tool gives you a one-glance overview — counts, reading time, and the first ~200 characters of each page — so you can triage and locate sections without reading everything. Use this to decide whether to extract, then extract.
Can I run the summary as an automated pipeline?
On a paid tier, yes — fetch the schema from GET /api/v1/tools/pdf-summary-generator, pair the @jadapps/runner once, then POST each file to 127.0.0.1:9789/v1/tools/pdf-summary-generator/run. The runner extracts and builds the report on your own machine, so documents never reach JAD's servers.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.