How to convert a pdf document to markdown text
- Step 1Confirm the PDF has a real text layer — Open the PDF and try to select a sentence with your cursor. If text highlights, it is born-digital and will convert. If nothing selects, it is a scan or photo — run PDF OCR first to add a text layer, then come back here.
- Step 2Drop the PDF onto the converter — Use the dropzone above. The tool reads the file in your browser with pdf.js. There is no Settings panel and nothing to configure — conversion starts automatically the moment a valid PDF is added.
- Step 3Watch it auto-convert — The tool extracts text page by page, emits a
## Page Nheading for each page, and splits each page into one-sentence-per-line Markdown. For a clean, born-digital document this takes a second or two. - Step 4Review the preview — The result panel shows the first ~5,000 characters of the generated Markdown plus an output-size stat. Skim it to confirm the text came through readable and the page headings line up with the source.
- Step 5Download the .md file — Click Download. The file saves as
yourfilename.mdwith thetext/markdowntype and UTF-8 encoding. The full output is saved, not just the previewed portion. - Step 6Polish in your Markdown editor — Because original headings, lists, and bold are not reconstructed, expect to promote the section titles to
#/##, re-create bullet lists, and wrap any code in fenced blocks. The conversion gives you clean text to work from, not a finished document.
What the converter preserves — and what it doesn't
The output is generated text, not a faithful re-rendering of the PDF's layout. Knowing the difference saves you from chasing formatting that was never extracted.
| PDF element | In the Markdown? | What actually happens |
|---|---|---|
| Body text (born-digital) | Yes | Read via page.getTextContent(), joined in pdf.js order, split into one-sentence-per-line. |
| Page boundaries | Yes | Each page is preceded by a ## Page N Markdown heading — the only Markdown syntax the tool emits. |
| Original headings (titles, H1/H2) | As plain text only | A title in the PDF becomes a normal text line. It is not turned into a #/## heading — the tool can't tell a heading from body text. |
| Bold / italic | No | Font weight and style are layout attributes, not text. They are dropped; you get unstyled text. |
| Bullet / numbered lists | No | List markers may survive as literal characters in the text, but no Markdown -/1. list structure is created. |
| Tables | No (flattened) | Cells are read as positioned text and collapse into space-joined lines. For real tables use PDF Table to JSON or PDF to Excel. |
| Images / figures / logos | No | Pictures are ignored entirely. There is no image extraction in this tool. |
| Hyperlinks | No | Link annotations are not read; only visible link text comes through, with no [text](url) syntax. |
| Scanned page (no text layer) | Empty | An image-only page yields a ## Page N heading and little or no text. OCR first. |
Output format and tier limits
Everything is fixed — there are no encoding, page-range, or style options.
| Property | Value |
|---|---|
| Input accepted | A single .pdf file (one at a time) |
| Output | One .md file, text/markdown, UTF-8 |
| Filename | Source name with the extension swapped to .md |
| Markdown syntax emitted | ## Page N headings only; everything else is plain text |
| Options | None — auto-converts on drop |
| Free tier | 2 MB and 50 pages per file |
| Pro tier | 50 MB and 500 pages per file |
| Privacy | Processed locally in your browser; 0 bytes uploaded |
Cookbook
Real before/after snippets showing what the generated Markdown actually looks like. Sample content is illustrative.
A clean single-column document
The ideal case: a born-digital report with one column of body text. The text reads in natural order and each page is clearly marked.
Input: notes.pdf (born-digital, 3 pages) Action: drop on the tool → auto-converts Output (notes.md): ## Page 1 Project kickoff happened on Monday. The team agreed on a two-week sprint cadence. ## Page 2 Design review is scheduled for Friday.
Original headings come out as plain text
A PDF title and section heading are visually large in the source, but the tool sees them as ordinary glyph runs. They become text lines, not Markdown headings.
Source PDF shows (visually): ANNUAL REPORT 2026 ← big title 1. Overview ← section heading Revenue grew 14% ... Markdown output: ## Page 1 ANNUAL REPORT 2026 1. Overview Revenue grew 14% ... → promote 'ANNUAL REPORT 2026' to '# ' and 'Overview' to '## ' yourself afterward.
A table flattens into text lines
Tabular content does not become a Markdown table. Cells read as positioned text and merge by reading order, so columns lose their alignment.
Source table: Name Role Start Ada Engineer 2024 Bola Designer 2025 Markdown output (flattened): ## Page 1 Name Role Start Ada Engineer 2024 Bola Designer 2025 → for structured rows use PDF Table to JSON or PDF to Excel.
Multi-page document with page anchors
Page headings make it easy to jump around a long file and trace text back to its source page when you edit.
Input: handbook.pdf (40 pages, born-digital) Output (handbook.md) structure: ## Page 1 ... ## Page 2 ... ## Page 40 ... Search '## Page 23' in your editor to land on page 23's text.
A scanned PDF converts to almost nothing
If the pages are images of text (a scan or phone photo), there is no text layer to read, so the Markdown is just empty page headings. OCR first.
Input: scanned-invoice.pdf (image-only) Output: ## Page 1 ## Page 2 (no body text — pages are pictures) Fix: run PDF OCR (/pdf-tools/pdf-ocr) to add a text layer, then convert the OCR'd PDF here.
Edge cases and what actually happens
Scanned / image-only PDF
Empty outputThere is no embedded text on the page, so getTextContent() returns nothing and you get a ## Page N heading with no body. Run PDF OCR first to add a real text layer, then convert.
PDF headings are not turned into Markdown headings
By designOnly ## Page N is emitted. A visually large title or numbered section heading in the source comes through as an ordinary text line because the tool has no way to distinguish a heading from body text by font size alone. Promote them to #/## yourself after conversion.
Tables are not converted to Markdown tables
FlattenedTable cells are positioned text; they collapse into space-joined lines and lose column structure. This is expected. For structured output use PDF Table to JSON or PDF to Excel.
File larger than 2 MB on the free tier
blockedThe free tier caps input at 2 MB. A larger file is blocked before conversion with an upgrade prompt. Pro raises the cap to 50 MB. To keep it free, split the PDF first with PDF Split by Range and convert each part.
More than 50 pages on the free tier
blockedPage count is checked on drop. Over 50 pages is blocked on free (Pro allows up to 500). Extract a slice with PDF Extract Pages and convert that, or upgrade.
Password-protected (open-password) PDF
fails to openIf the PDF requires a password just to open, pdf.js cannot read its pages and conversion fails. Remove the password first with PDF Remove Password (you must know it), then convert.
Multi-column layout
May interleavepdf.js returns text in its own order, which for two-column pages can interleave the columns mid-line. The text is all there but the reading order may be jumbled. Single-column documents convert cleanly; expect to re-order paragraphs on complex layouts.
Subset font with no Unicode mapping
garbledSome PDFs embed subsetted fonts without a ToUnicode map, so the stored codes don't map to real characters. The text comes out as gibberish. This is a property of the source file, not the converter — re-export the PDF with text-extraction enabled if you control the source.
Images and figures are dropped
ExpectedThis tool extracts text only — embedded pictures, charts, and logos are ignored and never appear in the Markdown. If you need the figures, export them separately with PDF to PNG.
Sentence splitter mishandles abbreviations
CosmeticLines break on ., !, and ?, so an abbreviation like 'Inc.' or a decimal can occasionally start a new line mid-sentence. It's purely cosmetic — the words are all present and correct; rejoin lines in your editor if you prefer paragraphs.
Frequently asked questions
Are the headings from my PDF preserved as Markdown headings?
No. The only Markdown headings in the output are the ## Page N markers the tool adds. A title or section heading from your PDF comes through as ordinary text because the tool can't reliably tell a heading from body text by appearance. Promote them to #/## yourself after conversion.
Will tables in the PDF become Markdown tables?
No. Table cells are positioned text and flatten into space-joined lines with no | column structure. For tabular data, use PDF Table to JSON for structured records or PDF to Excel for a spreadsheet, then format as Markdown if you still need it.
Does bold and italic text survive?
No. Font weight and style are layout attributes, not part of the text stream, so they are dropped. You get unstyled text and add **bold** or *italic* yourself where needed.
Does this work on scanned PDFs?
Not directly. A scan is an image with no text layer, so you'd get page headings and little else. Run PDF OCR first to add a searchable text layer, then convert the OCR'd PDF here.
Can I import the Markdown into Notion or Obsidian?
Yes. The output is plain, standard Markdown with no front matter or extended syntax, so it imports cleanly. For a Notion-specific walkthrough see the PDF to Markdown for Notion guide; Obsidian just needs the .md file dropped into a vault.
Are there any options — encoding, page range, style?
No. The tool converts the whole document automatically on drop, as UTF-8, with ## Page N headings. To work with a subset of pages, extract them first with PDF Extract Pages and convert the result.
Why is each sentence on its own line?
Each page's text is split on sentence-ending punctuation (., !, ?) and one sentence is written per line. This keeps Git diffs small and readable. If you prefer flowing paragraphs, join the lines in your editor — the content is identical either way.
Is my PDF uploaded anywhere?
No. Conversion runs entirely in your browser via pdf.js. The file's bytes never leave your machine — the result panel even states '0 bytes uploaded'. Signed-in users have a single usage counter recorded, never the document content.
What's the largest PDF I can convert?
Free tier: 2 MB and up to 50 pages. Pro: 50 MB and 500 pages. Larger plans go higher still. Files over the page or size cap are blocked on drop with an upgrade prompt; split or extract pages to stay within free limits.
Will hyperlinks come through as Markdown links?
No. The tool reads visible text, not link annotations, so a clickable link appears as its display text with no [text](url) syntax. Re-add links manually where they matter.
How is this different from PDF to Text?
PDF to Text gives you a plain .txt with no structure. This tool produces a .md file that additionally inserts a ## Page N heading before each page and splits text by sentence — handier when you're heading into a Markdown editor or docs pipeline.
Can I automate this without using the web UI?
Yes, on Pro. pdf-to-markdown is a runner-builtin tool: pair the @jadapps/runner once and POST the PDF to your local runner endpoint to get the Markdown back. Processing still happens locally on your machine — the document never reaches JAD's servers.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.