How to convert a pdf report to markdown for llm and rag pipelines
- Step 1Confirm the report is born-digital — Select a paragraph in the PDF. If it highlights, the text layer is present. If it's a scan, run PDF OCR first — feeding empty pages to an LLM wastes tokens and yields nothing.
- Step 2Drop the report onto the converter — It extracts in your browser with pdf.js and converts automatically. There are no settings — you get
## Page Nheadings and sentence-per-line text. - Step 3Download the .md file — Save it (UTF-8,
text/markdown). This is your raw, page-anchored context source. - Step 4Clean out the noise — Strip repeating running headers, footers, and page numbers that add tokens without meaning. Decide whether to keep or remove the
## Page Nlines — keep them if you want page citations, drop them if they fragment your chunks. - Step 5Chunk for retrieval — Split by
## Page N, by a token-aware splitter (e.g. LangChain's MarkdownTextSplitter), or use PDF to Chunks directly for sentence-aware overlapping chunks with page ranges in JSON. - Step 6Embed and load into your pipeline — Embed the chunks and load them into your vector store, attaching the page number as metadata so retrieved passages can cite their source page.
What helps and hurts LLM context
How the converter's output behaves from a RAG/LLM perspective.
| Aspect | Behaviour | Implication for LLM/RAG |
|---|---|---|
| Body text | Extracted, sentence-per-line | Clean, low-noise context; good for embeddings and prompts. |
## Page N headings | One per page | Use as page-citation anchors in chunk metadata; or as split points. |
| Report's own headings | Plain text, not #/## | You can't split on semantic headings reliably — split on page markers or a token splitter. |
| Tables | Flattened to text | Numeric tables become unreliable context. Extract data with PDF Table to JSON and feed it as structured JSON. |
| Running headers / footers | Repeat each page | Pure noise — strip before embedding to save tokens and avoid near-duplicate chunks. |
| Multi-column layout | May interleave | Can produce jumbled sentences that confuse the model; prefer single-column sources. |
| Images / charts | Dropped | Any insight that lives only in a chart is lost; describe it manually or use a vision model on the image. |
| Scanned pages | Empty | Wasted tokens and empty chunks. OCR first. |
Output format and tier limits
Fixed output. Chunking parameters live in the separate PDF to Chunks tool.
| Property | Value |
|---|---|
| Input | One .pdf at a time |
| Output | One .md file, UTF-8, text/markdown |
| Headings emitted | ## Page N only |
| Chunking / overlap options | None here — see PDF to Chunks |
| Free tier | 2 MB / 50 pages |
| Pro tier | 50 MB / 500 pages |
| Privacy | In-browser; 0 bytes uploaded |
Cookbook
Patterns for turning a report into model-ready context. Sample content is illustrative.
Convert, then split on page markers
The page headings make a simple, deterministic chunk boundary that preserves provenance.
Output (report.md):
## Page 1
Q1 revenue rose 14% year over year.
## Page 2
Churn fell to 2.1% after the onboarding redesign.
Chunk on '## Page N':
chunk 1 → {page: 1, text: 'Q1 revenue rose 14% ...'}
chunk 2 → {page: 2, text: 'Churn fell to 2.1% ...'}Strip running headers and footers
Repeated headers/footers add tokens and create near-duplicate chunks. Remove them before embedding.
Noisy (every page): ## Page 2 Acme Confidential 2026 ← header Churn fell to 2.1% ... Page 2 of 40 ← footer Cleaned: ## Page 2 Churn fell to 2.1% ...
Use PDF to Chunks for sentence-aware overlap
Rather than chunk the Markdown yourself, hand the same report to the chunker for overlapping, page-tagged JSON.
PDF to Chunks (/pdf-tools/pdf-to-chunks):
targetTokens=500, overlap=50
Output JSON:
[ { "id": 0, "page": 1, "pageRange": [1,1],
"text": "...", "tokens": 480 }, ... ]
Load straight into your vector store with page metadata.Feed Markdown as prompt context
For a short report, paste the cleaned Markdown directly into a prompt — page headings help the model cite sections.
System: Answer only from the report below. Cite the page. Context: ## Page 1 Q1 revenue rose 14% ... ## Page 2 Churn fell to 2.1% ... User: What happened to churn, and on what page?
Extract a financial table as JSON, not flattened text
Don't trust flattened tables in LLM context. Pull the numbers as structured data and provide both.
Markdown (flattened, unreliable):
Region Q1 Q2 NA 120 140 EU 90 110
Better: PDF Table to JSON (/pdf-tools/pdf-table-to-json):
[ {"Region":"NA","Q1":"120","Q2":"140"},
{"Region":"EU","Q1":"90","Q2":"110"} ]
→ give the model the JSON for accurate figures.Edge cases and what actually happens
Tables become unreliable LLM context
FlattenedNumeric tables flatten into space-joined text, so the model can misread which value belongs to which cell. For any figures that matter, extract them with PDF Table to JSON and feed the model structured JSON instead of the flattened Markdown.
Running headers/footers create duplicate chunks
NoiseA repeated header/footer extracts on every page and can dominate near-identical chunks, wasting tokens and skewing retrieval. Strip them with a find-and-replace pass before embedding.
You can't split on the report's own headings
By designSection titles arrive as plain text, not #/##, so heading-based splitting won't find them. Split on ## Page N, use a token-aware splitter, or use PDF to Chunks for sentence-aware chunks.
Multi-column report interleaves sentences
May interleaveTwo-column layouts can weave columns together, producing sentences that read as nonsense to a model. Prefer single-column source reports; otherwise re-order before embedding or convert from a single-column export.
Scanned / image-only report
Empty outputNo text layer means empty chunks and wasted tokens. Run PDF OCR first to add text, then convert and chunk.
Charts and figures carry insight that's lost
ExpectedImages aren't extracted, so a finding shown only in a chart never reaches the model. Describe key figures manually in the context, or run a vision model on the exported chart image from PDF to PNG.
Report over 50 pages on free tier
blockedFree caps at 50 pages; longer reports are blocked on drop. Pro allows 500. Slice with PDF Extract Pages and convert sections, or upgrade for the full document.
Report over 2 MB on free tier
blockedFree caps input at 2 MB. Heavier reports are blocked; Pro raises it to 50 MB. Compress with PDF Lossy Compress or convert on Pro.
Sentence splitter mis-breaks on numbers and abbreviations
CosmeticSplitting on ./!/? can break a line mid-sentence at a decimal or 'Inc.'. It doesn't change the words and rarely affects embeddings, but a token-aware re-chunk smooths it out if you care about clean sentence boundaries.
Frequently asked questions
Is Markdown actually better than plain text for an LLM?
Marginally, mainly because of the ## Page N anchors this tool adds — they give passages page provenance you can cite. Beyond that, the body is plain text either way. The bigger wins come from cleaning out headers/footers and chunking well, not from the format itself.
How should I chunk the Markdown for RAG?
Split on ## Page N for page-level chunks, or use a token-aware splitter like LangChain's MarkdownTextSplitter to respect context limits. For sentence-aware overlapping chunks with page ranges as JSON, use PDF to Chunks directly instead of post-processing the Markdown.
Can I split on the report's section headings?
No — section titles come through as plain text, not Markdown headings, so heading-based splitters won't find them. The only real heading is ## Page N. Split on that or on tokens.
Should I clean the Markdown before feeding it to a model?
Yes. Strip repeating running headers, footers, and page numbers — they add tokens, create near-duplicate chunks, and pollute retrieval. Decide whether to keep the ## Page N lines based on whether you want page citations.
Are tables safe to use as LLM context?
Not as-is — they flatten to space-joined text and the model can mis-associate values with cells. For any numbers that matter, extract them with PDF Table to JSON and provide the structured JSON alongside (or instead of) the flattened text.
What's the difference between this and PDF to Chunks?
This produces one Markdown file with ## Page N headings; you do your own cleaning and chunking. PDF to Chunks does sentence-aware, overlapping chunking with a configurable token target and overlap, emitting JSON with per-chunk page ranges and token counts — ready to embed.
Does this tool summarise the report with AI?
No — it only extracts text. For a structural summary there's PDF Summary Generator (built from document structure/stats, not an LLM). For an AI summary, feed this Markdown to your own model.
Is my report sent to any server during conversion?
No. Conversion runs entirely in your browser via pdf.js — important for confidential or proprietary reports. The data only reaches a model when you choose to send it to your own pipeline.
How big a report can I convert?
Free: 2 MB and 50 pages. Pro: 50 MB and 500 pages. Over the cap, the file is blocked on drop; slice it with PDF Extract Pages or upgrade.
My scanned report produced empty chunks — why?
A scan has no embedded text, so there's nothing to extract and your chunks come out empty. Run PDF OCR first to add a text layer, then convert and chunk.
Can I automate this in an ingestion pipeline?
Yes, on Pro. pdf-to-markdown is a runner-builtin: pair the @jadapps/runner once and call it locally from your pipeline (a common chain is convert → strip noise → PDF to Chunks → embed). The report stays on your machine — nothing reaches JAD's servers.
How do I preserve page citations through to the answer?
Keep the ## Page N markers, attach the page number to each chunk's metadata at ingestion time, and have your prompt instruct the model to cite the page of any retrieved passage. The page anchors are the whole reason this tool is better than plain text for cited RAG.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.