How to convert a pdf article or paper to markdown
- Step 1Verify the article PDF is born-digital — Try to select a paragraph in the PDF. If text highlights, the text layer is present and extractable. Older scanned journal PDFs won't select — run PDF OCR first to add a text layer.
- Step 2Drop the PDF onto the converter — The tool reads it in your browser with pdf.js and converts immediately. There is no options panel — extraction is automatic and covers every page.
- Step 3Review the previewed Markdown — Skim the first ~5,000 characters in the result panel. Check the abstract and first section came through readable and that paragraphs aren't badly interleaved (a risk on two-column journal layouts).
- Step 4Download the .md file — Save it as
article.md(UTF-8,text/markdown). The full text is saved, not just the preview. - Step 5Add YAML front matter and promote headings — At the top of the file add front matter your SSG expects (
title,date,author,description,tags). Then promote the article title to#and section headings to##/###— they arrived as plain text, so this step is manual. - Step 6Re-link citations and add the original PDF — Inline markers like
[1]and the reference list are plain text; turn them into anchor links if your template supports footnotes. Best practice: also publish a link to the original PDF for readers who want the typeset version.
How article elements convert
What an academic or editorial PDF produces in the Markdown, and what you'll re-create by hand.
| Article element | In the Markdown? | Notes |
|---|---|---|
| Abstract & body paragraphs | Yes | Extracted as text and split into one sentence per line under the page heading. |
| Article title & section headings | As plain text | Not promoted to #/##. You set heading levels manually. |
| Bold / italic emphasis | No | Emphasis is a font attribute and is dropped; re-add **/* where it matters. |
| Reference / bibliography list | As plain text | Comes through as lines of text, not a numbered Markdown list. No automatic citation linking. |
| Inline citation markers ([1], (Smith 2024)) | As literal text | Survive as the characters you see; not converted to footnote links. |
| Tables & data | No (flattened) | Cells collapse into space-joined text. Use PDF Table to JSON for the data. |
| Figures, charts, equations-as-images | No | Images are ignored. Export them with PDF to PNG and embed manually. |
| DOIs / URLs in text | As plain text | Visible URL text comes through; clickable link annotations do not become [text](url). |
Output format and tier limits
No encoding, page-range, or template options — the conversion is fixed.
| Property | Value |
|---|---|
| Input | One .pdf at a time |
| Output | One .md file, UTF-8, text/markdown |
| Headings emitted | ## Page N only |
| Front matter | Not added — you write it |
| Free tier | 2 MB / 50 pages |
| Pro tier | 50 MB / 500 pages |
| Privacy | In-browser; 0 bytes uploaded |
Cookbook
Patterns for turning extracted article text into a publishable Markdown post. Sample content is illustrative.
Single-column paper → readable Markdown
A modern, single-column preprint extracts in clean reading order — the easiest case for republishing.
Input: preprint.pdf (single-column, 8 pages) Output (preprint.md): ## Page 1 A Survey of Edge Caching Strategies We review caching policies for edge networks. Results show a 22% latency reduction. ## Page 2 ...
Add front matter for a Hugo/Astro post
The tool never writes front matter. Prepend it yourself before the first ## Page N so your SSG picks up the metadata.
--- title: "A Survey of Edge Caching Strategies" date: 2026-06-06 author: "J. Rivera" description: "Caching policies for edge networks." --- ## Page 1 A Survey of Edge Caching Strategies We review caching policies for edge networks ...
Promote the title and headings by hand
Headings arrive as plain text. A quick editor pass turns them into real Markdown structure.
Before (as extracted): ## Page 1 A Survey of Edge Caching Strategies 1. Introduction Edge caching reduces latency by ... After (your edit): # A Survey of Edge Caching Strategies ## 1. Introduction Edge caching reduces latency by ...
Two-column journal layout interleaves
Classic two-column journal PDFs can mix the left and right columns line by line. The words are all there; the order needs fixing.
Output (two-column page, interleaved): ## Page 3 The model was trained on Each run used a fixed 10,000 samples over seed for reproducibility. → separate and re-order the two columns in your editor, or re-export the PDF as single-column if you can.
Scanned archive PDF needs OCR first
An older journal scan has no text layer, so direct conversion yields empty pages. OCR adds the text, then this tool works.
scanned-1998-paper.pdf → ## Page 1 (empty) ... Fix: 1. PDF OCR (/pdf-tools/pdf-ocr) → adds text layer 2. Convert the OCR'd PDF here → real Markdown text
Edge cases and what actually happens
Two-column journal layout
May interleavepdf.js returns text in its own order; on two-column pages that can weave the columns together line by line. The text is complete but mis-ordered. Single-column articles are clean — for two-column ones, re-order paragraphs in your editor.
Section headings stay as body text
By designOnly ## Page N is emitted. 'Introduction', 'Methods', 'Conclusion' and the article title all come through as plain text lines. Promote them to ##/### yourself — the tool can't infer heading levels from font size.
Reference list isn't a Markdown list
ExpectedThe bibliography extracts as ordinary text lines, not a numbered Markdown list, and citations aren't linked. Convert to footnotes or a 1. list manually if your template needs it.
Equations rendered as images vanish
ExpectedMany papers typeset equations as images, which this tool ignores. Inline equation text that is real glyphs may come through garbled. Export equation images with PDF to PNG and embed them, or re-typeset in MathJax/KaTeX.
Scanned / image-only article
Empty outputNo text layer means near-empty Markdown. Run PDF OCR first to add text, then convert the result.
Article over 50 pages on free tier
blockedFree caps at 50 pages. A long monograph is blocked on drop; Pro allows up to 500. Most single articles are well under 50 pages, but for a long thesis split it with PDF Extract Pages or upgrade.
Article PDF over 2 MB on free tier
blockedImage-heavy papers can exceed the 2 MB free cap and are blocked. Pro raises it to 50 MB. Compress first with PDF Lossy Compress or convert on Pro.
Ligatures in body text (fi, fl)
Usually preservedProfessional typesetting uses ligatures like 'fi'. With a correct ToUnicode map these extract as the right characters; occasionally a font maps them oddly and you'll see a stray glyph. A quick find-and-replace in your editor fixes it.
DOIs and URLs aren't clickable links
ExpectedURLs appear as plain text; clickable link annotations aren't read, so nothing becomes [text](url). Wrap the important DOIs and links in Markdown link syntax during your editorial pass.
Frequently asked questions
Will the article's headings become Markdown headings?
Only the ## Page N markers are real headings. The article title and section headings arrive as plain text because the tool can't tell them apart from body text by appearance. Promote them to #/##/### yourself when you prepare the post.
Do references and in-text citations convert correctly?
They come through as plain text. The bibliography is lines of text (not a numbered list) and markers like [1] survive as literal characters with no automatic linking. If you want footnote links, add them during editing.
What about figures and equations?
Images — including figures, charts, and equations typeset as pictures — are not extracted at all. Export them separately with PDF to PNG and embed them, or re-typeset equations with KaTeX/MathJax in your Markdown.
Does it add YAML front matter for my static site?
No. You add the front matter yourself at the top of the file before the first ## Page 1. Include whatever your generator expects — title, date, author, description, tags.
My two-column paper came out scrambled — why?
pdf.js returns text in its internal order, which on two-column layouts can interleave the columns. The full text is present but mixed. Re-order paragraphs in your editor, or if you control the source, re-export it as a single column before converting.
Can I use the output in Docusaurus or MkDocs?
Yes — both consume standard Markdown. Paste the converted text into a .md/.mdx file, promote the headings, and add the page to your nav. For a docs-site-specific walkthrough see the PDF Documentation to Markdown guide.
How do I handle an older scanned journal PDF?
Run PDF OCR first to add a text layer (English plus several other languages are supported), then convert the OCR'd PDF here. Direct conversion of a scan produces empty pages.
Should I keep a link to the original PDF on the page?
Yes — it's good practice for academic content. Readers often want the typeset version with correct figures and equations. Add a 'Download the original PDF' link near the top of your republished article.
Is bold or italic emphasis preserved?
No. Emphasis is a font property, not text, so it's dropped. Re-apply **bold** and *italic* where the article used them.
Is the article uploaded to a server?
No. Everything runs in your browser via pdf.js — useful for embargoed preprints and unpublished manuscripts. The bytes never leave your machine.
What size article can I convert for free?
Up to 2 MB and 50 pages on the free tier, which covers most single papers. Pro raises it to 50 MB and 500 pages. Over the cap, split with PDF Extract Pages or upgrade.
How does this differ from PDF to HTML?
PDF to HTML gives you HTML output; this gives you Markdown, which is friendlier for static-site content folders and Git-based editorial workflows. Both extract the same underlying text via pdf.js.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.