Convert a PDF Report to Markdown for LLMs & RAG

How to convert a pdf report to markdown for llm and rag pipelines

Step 1
Confirm the report is born-digital — Select a paragraph in the PDF. If it highlights, the text layer is present. If it's a scan, run PDF OCR first — feeding empty pages to an LLM wastes tokens and yields nothing.
Step 2
Drop the report onto the converter — It extracts in your browser with pdf.js and converts automatically. There are no settings — you get ## Page N headings and sentence-per-line text.
Step 3
Download the .md file — Save it (UTF-8, text/markdown). This is your raw, page-anchored context source.
Step 4
Clean out the noise — Strip repeating running headers, footers, and page numbers that add tokens without meaning. Decide whether to keep or remove the ## Page N lines — keep them if you want page citations, drop them if they fragment your chunks.
Step 5
Chunk for retrieval — Split by ## Page N, by a token-aware splitter (e.g. LangChain's MarkdownTextSplitter), or use PDF to Chunks directly for sentence-aware overlapping chunks with page ranges in JSON.
Step 6
Embed and load into your pipeline — Embed the chunks and load them into your vector store, attaching the page number as metadata so retrieved passages can cite their source page.

What helps and hurts LLM context

How the converter's output behaves from a RAG/LLM perspective.

Aspect	Behaviour	Implication for LLM/RAG
Body text	Extracted, sentence-per-line	Clean, low-noise context; good for embeddings and prompts.
`## Page N` headings	One per page	Use as page-citation anchors in chunk metadata; or as split points.
Report's own headings	Plain text, not `#`/`##`	You can't split on semantic headings reliably — split on page markers or a token splitter.
Tables	Flattened to text	Numeric tables become unreliable context. Extract data with PDF Table to JSON and feed it as structured JSON.
Running headers / footers	Repeat each page	Pure noise — strip before embedding to save tokens and avoid near-duplicate chunks.
Multi-column layout	May interleave	Can produce jumbled sentences that confuse the model; prefer single-column sources.
Images / charts	Dropped	Any insight that lives only in a chart is lost; describe it manually or use a vision model on the image.
Scanned pages	Empty	Wasted tokens and empty chunks. OCR first.

Output format and tier limits

Fixed output. Chunking parameters live in the separate PDF to Chunks tool.

Property	Value
Input	One `.pdf` at a time
Output	One `.md` file, UTF-8, `text/markdown`
Headings emitted	`## Page N` only
Chunking / overlap options	None here — see PDF to Chunks
Free tier	2 MB / 50 pages
Pro tier	50 MB / 500 pages
Privacy	In-browser; 0 bytes uploaded

Cookbook

Patterns for turning a report into model-ready context. Sample content is illustrative.

Convert, then split on page markers

The page headings make a simple, deterministic chunk boundary that preserves provenance.

Output (report.md):
## Page 1
Q1 revenue rose 14% year over year.
## Page 2
Churn fell to 2.1% after the onboarding redesign.

Chunk on '## Page N':
  chunk 1 → {page: 1, text: 'Q1 revenue rose 14% ...'}
  chunk 2 → {page: 2, text: 'Churn fell to 2.1% ...'}

Strip running headers and footers

Repeated headers/footers add tokens and create near-duplicate chunks. Remove them before embedding.

Noisy (every page):
## Page 2
Acme Confidential 2026   ← header
Churn fell to 2.1% ...
Page 2 of 40             ← footer

Cleaned:
## Page 2
Churn fell to 2.1% ...

Use PDF to Chunks for sentence-aware overlap

Rather than chunk the Markdown yourself, hand the same report to the chunker for overlapping, page-tagged JSON.

PDF to Chunks (/pdf-tools/pdf-to-chunks):
  targetTokens=500, overlap=50

Output JSON:
[ { "id": 0, "page": 1, "pageRange": [1,1],
    "text": "...", "tokens": 480 }, ... ]

Load straight into your vector store with page metadata.

Feed Markdown as prompt context

For a short report, paste the cleaned Markdown directly into a prompt — page headings help the model cite sections.

System: Answer only from the report below. Cite the page.

Context:
## Page 1
Q1 revenue rose 14% ...
## Page 2
Churn fell to 2.1% ...

User: What happened to churn, and on what page?

Extract a financial table as JSON, not flattened text

Don't trust flattened tables in LLM context. Pull the numbers as structured data and provide both.

Markdown (flattened, unreliable):
Region Q1 Q2 NA 120 140 EU 90 110

Better: PDF Table to JSON (/pdf-tools/pdf-table-to-json):
[ {"Region":"NA","Q1":"120","Q2":"140"},
  {"Region":"EU","Q1":"90","Q2":"110"} ]
→ give the model the JSON for accurate figures.

Edge cases and what actually happens

Tables become unreliable LLM context

Flattened

Numeric tables flatten into space-joined text, so the model can misread which value belongs to which cell. For any figures that matter, extract them with PDF Table to JSON and feed the model structured JSON instead of the flattened Markdown.

Running headers/footers create duplicate chunks

Noise

A repeated header/footer extracts on every page and can dominate near-identical chunks, wasting tokens and skewing retrieval. Strip them with a find-and-replace pass before embedding.

You can't split on the report's own headings

By design

Section titles arrive as plain text, not #/##, so heading-based splitting won't find them. Split on ## Page N, use a token-aware splitter, or use PDF to Chunks for sentence-aware chunks.

Multi-column report interleaves sentences

May interleave

Two-column layouts can weave columns together, producing sentences that read as nonsense to a model. Prefer single-column source reports; otherwise re-order before embedding or convert from a single-column export.

Scanned / image-only report

Empty output

No text layer means empty chunks and wasted tokens. Run PDF OCR first to add text, then convert and chunk.

Charts and figures carry insight that's lost

Expected

Images aren't extracted, so a finding shown only in a chart never reaches the model. Describe key figures manually in the context, or run a vision model on the exported chart image from PDF to PNG.

Report over 50 pages on free tier

blocked

Free caps at 50 pages; longer reports are blocked on drop. Pro allows 500. Slice with PDF Extract Pages and convert sections, or upgrade for the full document.

Report over 2 MB on free tier

blocked

Free caps input at 2 MB. Heavier reports are blocked; Pro raises it to 50 MB. Compress with PDF Lossy Compress or convert on Pro.

Sentence splitter mis-breaks on numbers and abbreviations

Cosmetic

Splitting on ./!/? can break a line mid-sentence at a decimal or 'Inc.'. It doesn't change the words and rarely affects embeddings, but a token-aware re-chunk smooths it out if you care about clean sentence boundaries.

Frequently asked questions

Is Markdown actually better than plain text for an LLM?

Marginally, mainly because of the ## Page N anchors this tool adds — they give passages page provenance you can cite. Beyond that, the body is plain text either way. The bigger wins come from cleaning out headers/footers and chunking well, not from the format itself.

How should I chunk the Markdown for RAG?

Split on ## Page N for page-level chunks, or use a token-aware splitter like LangChain's MarkdownTextSplitter to respect context limits. For sentence-aware overlapping chunks with page ranges as JSON, use PDF to Chunks directly instead of post-processing the Markdown.

Can I split on the report's section headings?

No — section titles come through as plain text, not Markdown headings, so heading-based splitters won't find them. The only real heading is ## Page N. Split on that or on tokens.

Should I clean the Markdown before feeding it to a model?

Yes. Strip repeating running headers, footers, and page numbers — they add tokens, create near-duplicate chunks, and pollute retrieval. Decide whether to keep the ## Page N lines based on whether you want page citations.

Are tables safe to use as LLM context?

Not as-is — they flatten to space-joined text and the model can mis-associate values with cells. For any numbers that matter, extract them with PDF Table to JSON and provide the structured JSON alongside (or instead of) the flattened text.

What's the difference between this and PDF to Chunks?

This produces one Markdown file with ## Page N headings; you do your own cleaning and chunking. PDF to Chunks does sentence-aware, overlapping chunking with a configurable token target and overlap, emitting JSON with per-chunk page ranges and token counts — ready to embed.

Does this tool summarise the report with AI?

No — it only extracts text. For a structural summary there's PDF Summary Generator (built from document structure/stats, not an LLM). For an AI summary, feed this Markdown to your own model.

Is my report sent to any server during conversion?

No. Conversion runs entirely in your browser via pdf.js — important for confidential or proprietary reports. The data only reaches a model when you choose to send it to your own pipeline.

How big a report can I convert?

Free: 2 MB and 50 pages. Pro: 50 MB and 500 pages. Over the cap, the file is blocked on drop; slice it with PDF Extract Pages or upgrade.

My scanned report produced empty chunks — why?

A scan has no embedded text, so there's nothing to extract and your chunks come out empty. Run PDF OCR first to add a text layer, then convert and chunk.

Can I automate this in an ingestion pipeline?

Yes, on Pro. pdf-to-markdown is a runner-builtin: pair the @jadapps/runner once and call it locally from your pipeline (a common chain is convert → strip noise → PDF to Chunks → embed). The report stays on your machine — nothing reaches JAD's servers.

How do I preserve page citations through to the answer?

Keep the ## Page N markers, attach the page number to each chunk's metadata at ingestion time, and have your prompt instruct the model to cite the page of any retrieved passage. The page anchors are the whole reason this tool is better than plain text for cited RAG.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to convert a pdf report to markdown for llm and rag pipelines

Step 1
Confirm the report is born-digital — Select a paragraph in the PDF. If it highlights, the text layer is present. If it's a scan, run PDF OCR first — feeding empty pages to an LLM wastes tokens and yields nothing.
Step 2
Drop the report onto the converter — It extracts in your browser with pdf.js and converts automatically. There are no settings — you get ## Page N headings and sentence-per-line text.
Step 3
Download the .md file — Save it (UTF-8, text/markdown). This is your raw, page-anchored context source.
Step 4
Clean out the noise — Strip repeating running headers, footers, and page numbers that add tokens without meaning. Decide whether to keep or remove the ## Page N lines — keep them if you want page citations, drop them if they fragment your chunks.
Step 5
Chunk for retrieval — Split by ## Page N, by a token-aware splitter (e.g. LangChain's MarkdownTextSplitter), or use PDF to Chunks directly for sentence-aware overlapping chunks with page ranges in JSON.
Step 6
Embed and load into your pipeline — Embed the chunks and load them into your vector store, attaching the page number as metadata so retrieved passages can cite their source page.

What helps and hurts LLM context

How the converter's output behaves from a RAG/LLM perspective.

Aspect	Behaviour	Implication for LLM/RAG
Body text	Extracted, sentence-per-line	Clean, low-noise context; good for embeddings and prompts.
`## Page N` headings	One per page	Use as page-citation anchors in chunk metadata; or as split points.
Report's own headings	Plain text, not `#`/`##`	You can't split on semantic headings reliably — split on page markers or a token splitter.
Tables	Flattened to text	Numeric tables become unreliable context. Extract data with PDF Table to JSON and feed it as structured JSON.
Running headers / footers	Repeat each page	Pure noise — strip before embedding to save tokens and avoid near-duplicate chunks.
Multi-column layout	May interleave	Can produce jumbled sentences that confuse the model; prefer single-column sources.
Images / charts	Dropped	Any insight that lives only in a chart is lost; describe it manually or use a vision model on the image.
Scanned pages	Empty	Wasted tokens and empty chunks. OCR first.

Output format and tier limits

Fixed output. Chunking parameters live in the separate PDF to Chunks tool.

Property	Value
Input	One `.pdf` at a time
Output	One `.md` file, UTF-8, `text/markdown`
Headings emitted	`## Page N` only
Chunking / overlap options	None here — see PDF to Chunks
Free tier	2 MB / 50 pages
Pro tier	50 MB / 500 pages
Privacy	In-browser; 0 bytes uploaded

Cookbook

Patterns for turning a report into model-ready context. Sample content is illustrative.

Convert, then split on page markers

The page headings make a simple, deterministic chunk boundary that preserves provenance.

Output (report.md):
## Page 1
Q1 revenue rose 14% year over year.
## Page 2
Churn fell to 2.1% after the onboarding redesign.

Chunk on '## Page N':
  chunk 1 → {page: 1, text: 'Q1 revenue rose 14% ...'}
  chunk 2 → {page: 2, text: 'Churn fell to 2.1% ...'}

Strip running headers and footers

Repeated headers/footers add tokens and create near-duplicate chunks. Remove them before embedding.

Noisy (every page):
## Page 2
Acme Confidential 2026   ← header
Churn fell to 2.1% ...
Page 2 of 40             ← footer

Cleaned:
## Page 2
Churn fell to 2.1% ...

Use PDF to Chunks for sentence-aware overlap

Rather than chunk the Markdown yourself, hand the same report to the chunker for overlapping, page-tagged JSON.

PDF to Chunks (/pdf-tools/pdf-to-chunks):
  targetTokens=500, overlap=50

Output JSON:
[ { "id": 0, "page": 1, "pageRange": [1,1],
    "text": "...", "tokens": 480 }, ... ]

Load straight into your vector store with page metadata.

Feed Markdown as prompt context

For a short report, paste the cleaned Markdown directly into a prompt — page headings help the model cite sections.

System: Answer only from the report below. Cite the page.

Context:
## Page 1
Q1 revenue rose 14% ...
## Page 2
Churn fell to 2.1% ...

User: What happened to churn, and on what page?

Extract a financial table as JSON, not flattened text

Don't trust flattened tables in LLM context. Pull the numbers as structured data and provide both.

Markdown (flattened, unreliable):
Region Q1 Q2 NA 120 140 EU 90 110

Better: PDF Table to JSON (/pdf-tools/pdf-table-to-json):
[ {"Region":"NA","Q1":"120","Q2":"140"},
  {"Region":"EU","Q1":"90","Q2":"110"} ]
→ give the model the JSON for accurate figures.

Edge cases and what actually happens

Tables become unreliable LLM context

Flattened

Running headers/footers create duplicate chunks

Noise

A repeated header/footer extracts on every page and can dominate near-identical chunks, wasting tokens and skewing retrieval. Strip them with a find-and-replace pass before embedding.

You can't split on the report's own headings

By design

Section titles arrive as plain text, not #/##, so heading-based splitting won't find them. Split on ## Page N, use a token-aware splitter, or use PDF to Chunks for sentence-aware chunks.

Multi-column report interleaves sentences

May interleave

Scanned / image-only report

Empty output

No text layer means empty chunks and wasted tokens. Run PDF OCR first to add text, then convert and chunk.

Charts and figures carry insight that's lost

Expected

Images aren't extracted, so a finding shown only in a chart never reaches the model. Describe key figures manually in the context, or run a vision model on the exported chart image from PDF to PNG.

Report over 50 pages on free tier

blocked

Free caps at 50 pages; longer reports are blocked on drop. Pro allows 500. Slice with PDF Extract Pages and convert sections, or upgrade for the full document.

Report over 2 MB on free tier

blocked

Free caps input at 2 MB. Heavier reports are blocked; Pro raises it to 50 MB. Compress with PDF Lossy Compress or convert on Pro.

Sentence splitter mis-breaks on numbers and abbreviations

Cosmetic

Frequently asked questions

Is Markdown actually better than plain text for an LLM?

How should I chunk the Markdown for RAG?

Can I split on the report's section headings?

No — section titles come through as plain text, not Markdown headings, so heading-based splitters won't find them. The only real heading is ## Page N. Split on that or on tokens.

Should I clean the Markdown before feeding it to a model?

Are tables safe to use as LLM context?

What's the difference between this and PDF to Chunks?

Does this tool summarise the report with AI?

No — it only extracts text. For a structural summary there's PDF Summary Generator (built from document structure/stats, not an LLM). For an AI summary, feed this Markdown to your own model.

Is my report sent to any server during conversion?

No. Conversion runs entirely in your browser via pdf.js — important for confidential or proprietary reports. The data only reaches a model when you choose to send it to your own pipeline.

How big a report can I convert?

Free: 2 MB and 50 pages. Pro: 50 MB and 500 pages. Over the cap, the file is blocked on drop; slice it with PDF Extract Pages or upgrade.

My scanned report produced empty chunks — why?

A scan has no embedded text, so there's nothing to extract and your chunks come out empty. Run PDF OCR first to add a text layer, then convert and chunk.

Can I automate this in an ingestion pipeline?

How do I preserve page citations through to the answer?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Convert a PDF Report to Markdown for LLM and RAG Pipelines

How to convert a pdf report to markdown for llm and rag pipelines

What helps and hurts LLM context

Output format and tier limits

Cookbook

Convert, then split on page markers

Strip running headers and footers

Use PDF to Chunks for sentence-aware overlap

Feed Markdown as prompt context

Extract a financial table as JSON, not flattened text

Edge cases and what actually happens

Tables become unreliable LLM context

Running headers/footers create duplicate chunks

You can't split on the report's own headings

Multi-column report interleaves sentences

Scanned / image-only report

Charts and figures carry insight that's lost

Report over 50 pages on free tier

Report over 2 MB on free tier

Sentence splitter mis-breaks on numbers and abbreviations

Frequently asked questions

Is Markdown actually better than plain text for an LLM?

How should I chunk the Markdown for RAG?

Can I split on the report's section headings?

Should I clean the Markdown before feeding it to a model?

Are tables safe to use as LLM context?

What's the difference between this and PDF to Chunks?

Does this tool summarise the report with AI?

Is my report sent to any server during conversion?

How big a report can I convert?

My scanned report produced empty chunks — why?

Can I automate this in an ingestion pipeline?

How do I preserve page citations through to the answer?

Privacy first

Related guides

Convert a PDF Report to Markdown for LLM and RAG Pipelines

How to convert a pdf report to markdown for llm and rag pipelines

What helps and hurts LLM context

Output format and tier limits

Cookbook

Convert, then split on page markers

Strip running headers and footers

Use PDF to Chunks for sentence-aware overlap

Feed Markdown as prompt context

Extract a financial table as JSON, not flattened text

Edge cases and what actually happens

Tables become unreliable LLM context

Running headers/footers create duplicate chunks

You can't split on the report's own headings

Multi-column report interleaves sentences

Scanned / image-only report

Charts and figures carry insight that's lost

Report over 50 pages on free tier

Report over 2 MB on free tier

Sentence splitter mis-breaks on numbers and abbreviations

Frequently asked questions

Is Markdown actually better than plain text for an LLM?

How should I chunk the Markdown for RAG?

Can I split on the report's section headings?

Should I clean the Markdown before feeding it to a model?

Are tables safe to use as LLM context?

What's the difference between this and PDF to Chunks?

Does this tool summarise the report with AI?

Is my report sent to any server during conversion?

How big a report can I convert?

My scanned report produced empty chunks — why?

Can I automate this in an ingestion pipeline?

How do I preserve page citations through to the answer?

Privacy first

Related guides