Convert a PDF Article to Markdown for the Web

How to convert a pdf article or paper to markdown

Step 1
Verify the article PDF is born-digital — Try to select a paragraph in the PDF. If text highlights, the text layer is present and extractable. Older scanned journal PDFs won't select — run PDF OCR first to add a text layer.
Step 2
Drop the PDF onto the converter — The tool reads it in your browser with pdf.js and converts immediately. There is no options panel — extraction is automatic and covers every page.
Step 3
Review the previewed Markdown — Skim the first ~5,000 characters in the result panel. Check the abstract and first section came through readable and that paragraphs aren't badly interleaved (a risk on two-column journal layouts).
Step 4
Download the .md file — Save it as article.md (UTF-8, text/markdown). The full text is saved, not just the preview.
Step 5
Add YAML front matter and promote headings — At the top of the file add front matter your SSG expects (title, date, author, description, tags). Then promote the article title to # and section headings to ## /### — they arrived as plain text, so this step is manual.
Step 6
Re-link citations and add the original PDF — Inline markers like [1] and the reference list are plain text; turn them into anchor links if your template supports footnotes. Best practice: also publish a link to the original PDF for readers who want the typeset version.

How article elements convert

What an academic or editorial PDF produces in the Markdown, and what you'll re-create by hand.

Article element	In the Markdown?	Notes
Abstract & body paragraphs	Yes	Extracted as text and split into one sentence per line under the page heading.
Article title & section headings	As plain text	Not promoted to `#`/`##`. You set heading levels manually.
Bold / italic emphasis	No	Emphasis is a font attribute and is dropped; re-add `*`/`` where it matters.
Reference / bibliography list	As plain text	Comes through as lines of text, not a numbered Markdown list. No automatic citation linking.
Inline citation markers ([1], (Smith 2024))	As literal text	Survive as the characters you see; not converted to footnote links.
Tables & data	No (flattened)	Cells collapse into space-joined text. Use PDF Table to JSON for the data.
Figures, charts, equations-as-images	No	Images are ignored. Export them with PDF to PNG and embed manually.
DOIs / URLs in text	As plain text	Visible URL text comes through; clickable link annotations do not become `[text](url)`.

Output format and tier limits

No encoding, page-range, or template options — the conversion is fixed.

Property	Value
Input	One `.pdf` at a time
Output	One `.md` file, UTF-8, `text/markdown`
Headings emitted	`## Page N` only
Front matter	Not added — you write it
Free tier	2 MB / 50 pages
Pro tier	50 MB / 500 pages
Privacy	In-browser; 0 bytes uploaded

Cookbook

Patterns for turning extracted article text into a publishable Markdown post. Sample content is illustrative.

Single-column paper → readable Markdown

A modern, single-column preprint extracts in clean reading order — the easiest case for republishing.

Input:  preprint.pdf (single-column, 8 pages)

Output (preprint.md):
## Page 1

A Survey of Edge Caching Strategies
We review caching policies for edge networks.
Results show a 22% latency reduction.

## Page 2
...

Add front matter for a Hugo/Astro post

The tool never writes front matter. Prepend it yourself before the first ## Page N so your SSG picks up the metadata.

---
title: "A Survey of Edge Caching Strategies"
date: 2026-06-06
author: "J. Rivera"
description: "Caching policies for edge networks."
---

## Page 1

A Survey of Edge Caching Strategies
We review caching policies for edge networks ...

Promote the title and headings by hand

Headings arrive as plain text. A quick editor pass turns them into real Markdown structure.

Before (as extracted):
## Page 1

A Survey of Edge Caching Strategies
1. Introduction
Edge caching reduces latency by ...

After (your edit):
# A Survey of Edge Caching Strategies

## 1. Introduction
Edge caching reduces latency by ...

Two-column journal layout interleaves

Classic two-column journal PDFs can mix the left and right columns line by line. The words are all there; the order needs fixing.

Output (two-column page, interleaved):
## Page 3

The model was trained on Each run used a fixed
10,000 samples over seed for reproducibility.

→ separate and re-order the two columns in your editor,
  or re-export the PDF as single-column if you can.

Scanned archive PDF needs OCR first

An older journal scan has no text layer, so direct conversion yields empty pages. OCR adds the text, then this tool works.

scanned-1998-paper.pdf  →  ## Page 1 (empty) ...

Fix:
  1. PDF OCR (/pdf-tools/pdf-ocr) → adds text layer
  2. Convert the OCR'd PDF here → real Markdown text

Edge cases and what actually happens

Two-column journal layout

May interleave

pdf.js returns text in its own order; on two-column pages that can weave the columns together line by line. The text is complete but mis-ordered. Single-column articles are clean — for two-column ones, re-order paragraphs in your editor.

Section headings stay as body text

By design

Only ## Page N is emitted. 'Introduction', 'Methods', 'Conclusion' and the article title all come through as plain text lines. Promote them to ##/### yourself — the tool can't infer heading levels from font size.

Reference list isn't a Markdown list

Expected

The bibliography extracts as ordinary text lines, not a numbered Markdown list, and citations aren't linked. Convert to footnotes or a 1. list manually if your template needs it.

Equations rendered as images vanish

Expected

Many papers typeset equations as images, which this tool ignores. Inline equation text that is real glyphs may come through garbled. Export equation images with PDF to PNG and embed them, or re-typeset in MathJax/KaTeX.

Scanned / image-only article

Empty output

No text layer means near-empty Markdown. Run PDF OCR first to add text, then convert the result.

Article over 50 pages on free tier

blocked

Free caps at 50 pages. A long monograph is blocked on drop; Pro allows up to 500. Most single articles are well under 50 pages, but for a long thesis split it with PDF Extract Pages or upgrade.

Article PDF over 2 MB on free tier

blocked

Image-heavy papers can exceed the 2 MB free cap and are blocked. Pro raises it to 50 MB. Compress first with PDF Lossy Compress or convert on Pro.

Ligatures in body text (fi, fl)

Usually preserved

Professional typesetting uses ligatures like 'ﬁ'. With a correct ToUnicode map these extract as the right characters; occasionally a font maps them oddly and you'll see a stray glyph. A quick find-and-replace in your editor fixes it.

DOIs and URLs aren't clickable links

Expected

URLs appear as plain text; clickable link annotations aren't read, so nothing becomes [text](url). Wrap the important DOIs and links in Markdown link syntax during your editorial pass.

Frequently asked questions

Will the article's headings become Markdown headings?

Only the ## Page N markers are real headings. The article title and section headings arrive as plain text because the tool can't tell them apart from body text by appearance. Promote them to #/##/### yourself when you prepare the post.

Do references and in-text citations convert correctly?

They come through as plain text. The bibliography is lines of text (not a numbered list) and markers like [1] survive as literal characters with no automatic linking. If you want footnote links, add them during editing.

What about figures and equations?

Images — including figures, charts, and equations typeset as pictures — are not extracted at all. Export them separately with PDF to PNG and embed them, or re-typeset equations with KaTeX/MathJax in your Markdown.

Does it add YAML front matter for my static site?

No. You add the front matter yourself at the top of the file before the first ## Page 1. Include whatever your generator expects — title, date, author, description, tags.

My two-column paper came out scrambled — why?

pdf.js returns text in its internal order, which on two-column layouts can interleave the columns. The full text is present but mixed. Re-order paragraphs in your editor, or if you control the source, re-export it as a single column before converting.

Can I use the output in Docusaurus or MkDocs?

Yes — both consume standard Markdown. Paste the converted text into a .md/.mdx file, promote the headings, and add the page to your nav. For a docs-site-specific walkthrough see the PDF Documentation to Markdown guide.

How do I handle an older scanned journal PDF?

Run PDF OCR first to add a text layer (English plus several other languages are supported), then convert the OCR'd PDF here. Direct conversion of a scan produces empty pages.

Should I keep a link to the original PDF on the page?

Yes — it's good practice for academic content. Readers often want the typeset version with correct figures and equations. Add a 'Download the original PDF' link near the top of your republished article.

Is bold or italic emphasis preserved?

No. Emphasis is a font property, not text, so it's dropped. Re-apply **bold** and *italic* where the article used them.

Is the article uploaded to a server?

No. Everything runs in your browser via pdf.js — useful for embargoed preprints and unpublished manuscripts. The bytes never leave your machine.

What size article can I convert for free?

Up to 2 MB and 50 pages on the free tier, which covers most single papers. Pro raises it to 50 MB and 500 pages. Over the cap, split with PDF Extract Pages or upgrade.

How does this differ from PDF to HTML?

PDF to HTML gives you HTML output; this gives you Markdown, which is friendlier for static-site content folders and Git-based editorial workflows. Both extract the same underlying text via pdf.js.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to convert a pdf article or paper to markdown

Step 1
Verify the article PDF is born-digital — Try to select a paragraph in the PDF. If text highlights, the text layer is present and extractable. Older scanned journal PDFs won't select — run PDF OCR first to add a text layer.
Step 2
Drop the PDF onto the converter — The tool reads it in your browser with pdf.js and converts immediately. There is no options panel — extraction is automatic and covers every page.
Step 3
Review the previewed Markdown — Skim the first ~5,000 characters in the result panel. Check the abstract and first section came through readable and that paragraphs aren't badly interleaved (a risk on two-column journal layouts).
Step 4
Download the .md file — Save it as article.md (UTF-8, text/markdown). The full text is saved, not just the preview.
Step 5
Add YAML front matter and promote headings — At the top of the file add front matter your SSG expects (title, date, author, description, tags). Then promote the article title to # and section headings to ## /### — they arrived as plain text, so this step is manual.
Step 6
Re-link citations and add the original PDF — Inline markers like [1] and the reference list are plain text; turn them into anchor links if your template supports footnotes. Best practice: also publish a link to the original PDF for readers who want the typeset version.

How article elements convert

What an academic or editorial PDF produces in the Markdown, and what you'll re-create by hand.

Article element	In the Markdown?	Notes
Abstract & body paragraphs	Yes	Extracted as text and split into one sentence per line under the page heading.
Article title & section headings	As plain text	Not promoted to `#`/`##`. You set heading levels manually.
Bold / italic emphasis	No	Emphasis is a font attribute and is dropped; re-add `*`/`` where it matters.
Reference / bibliography list	As plain text	Comes through as lines of text, not a numbered Markdown list. No automatic citation linking.
Inline citation markers ([1], (Smith 2024))	As literal text	Survive as the characters you see; not converted to footnote links.
Tables & data	No (flattened)	Cells collapse into space-joined text. Use PDF Table to JSON for the data.
Figures, charts, equations-as-images	No	Images are ignored. Export them with PDF to PNG and embed manually.
DOIs / URLs in text	As plain text	Visible URL text comes through; clickable link annotations do not become `[text](url)`.

Output format and tier limits

No encoding, page-range, or template options — the conversion is fixed.

Property	Value
Input	One `.pdf` at a time
Output	One `.md` file, UTF-8, `text/markdown`
Headings emitted	`## Page N` only
Front matter	Not added — you write it
Free tier	2 MB / 50 pages
Pro tier	50 MB / 500 pages
Privacy	In-browser; 0 bytes uploaded

Cookbook

Patterns for turning extracted article text into a publishable Markdown post. Sample content is illustrative.

Single-column paper → readable Markdown

A modern, single-column preprint extracts in clean reading order — the easiest case for republishing.

Input:  preprint.pdf (single-column, 8 pages)

Output (preprint.md):
## Page 1

A Survey of Edge Caching Strategies
We review caching policies for edge networks.
Results show a 22% latency reduction.

## Page 2
...

Add front matter for a Hugo/Astro post

The tool never writes front matter. Prepend it yourself before the first ## Page N so your SSG picks up the metadata.

---
title: "A Survey of Edge Caching Strategies"
date: 2026-06-06
author: "J. Rivera"
description: "Caching policies for edge networks."
---

## Page 1

A Survey of Edge Caching Strategies
We review caching policies for edge networks ...

Promote the title and headings by hand

Headings arrive as plain text. A quick editor pass turns them into real Markdown structure.

Before (as extracted):
## Page 1

A Survey of Edge Caching Strategies
1. Introduction
Edge caching reduces latency by ...

After (your edit):
# A Survey of Edge Caching Strategies

## 1. Introduction
Edge caching reduces latency by ...

Two-column journal layout interleaves

Classic two-column journal PDFs can mix the left and right columns line by line. The words are all there; the order needs fixing.

Output (two-column page, interleaved):
## Page 3

The model was trained on Each run used a fixed
10,000 samples over seed for reproducibility.

→ separate and re-order the two columns in your editor,
  or re-export the PDF as single-column if you can.

Scanned archive PDF needs OCR first

An older journal scan has no text layer, so direct conversion yields empty pages. OCR adds the text, then this tool works.

scanned-1998-paper.pdf  →  ## Page 1 (empty) ...

Fix:
  1. PDF OCR (/pdf-tools/pdf-ocr) → adds text layer
  2. Convert the OCR'd PDF here → real Markdown text

Edge cases and what actually happens

Two-column journal layout

May interleave

Section headings stay as body text

By design

Reference list isn't a Markdown list

Expected

The bibliography extracts as ordinary text lines, not a numbered Markdown list, and citations aren't linked. Convert to footnotes or a 1. list manually if your template needs it.

Equations rendered as images vanish

Expected

Scanned / image-only article

Empty output

No text layer means near-empty Markdown. Run PDF OCR first to add text, then convert the result.

Article over 50 pages on free tier

blocked

Free caps at 50 pages. A long monograph is blocked on drop; Pro allows up to 500. Most single articles are well under 50 pages, but for a long thesis split it with PDF Extract Pages or upgrade.

Article PDF over 2 MB on free tier

blocked

Image-heavy papers can exceed the 2 MB free cap and are blocked. Pro raises it to 50 MB. Compress first with PDF Lossy Compress or convert on Pro.

Ligatures in body text (fi, fl)

Usually preserved

DOIs and URLs aren't clickable links

Expected

URLs appear as plain text; clickable link annotations aren't read, so nothing becomes [text](url). Wrap the important DOIs and links in Markdown link syntax during your editorial pass.

Frequently asked questions

Will the article's headings become Markdown headings?

Do references and in-text citations convert correctly?

What about figures and equations?

Does it add YAML front matter for my static site?

No. You add the front matter yourself at the top of the file before the first ## Page 1. Include whatever your generator expects — title, date, author, description, tags.

My two-column paper came out scrambled — why?

Can I use the output in Docusaurus or MkDocs?

How do I handle an older scanned journal PDF?

Run PDF OCR first to add a text layer (English plus several other languages are supported), then convert the OCR'd PDF here. Direct conversion of a scan produces empty pages.

Should I keep a link to the original PDF on the page?

Is bold or italic emphasis preserved?

No. Emphasis is a font property, not text, so it's dropped. Re-apply **bold** and *italic* where the article used them.

Is the article uploaded to a server?

No. Everything runs in your browser via pdf.js — useful for embargoed preprints and unpublished manuscripts. The bytes never leave your machine.

What size article can I convert for free?

Up to 2 MB and 50 pages on the free tier, which covers most single papers. Pro raises it to 50 MB and 500 pages. Over the cap, split with PDF Extract Pages or upgrade.

How does this differ from PDF to HTML?

PDF to HTML gives you HTML output; this gives you Markdown, which is friendlier for static-site content folders and Git-based editorial workflows. Both extract the same underlying text via pdf.js.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Convert a PDF Article or Paper to Markdown

How to convert a pdf article or paper to markdown

How article elements convert

Output format and tier limits

Cookbook

Single-column paper → readable Markdown

Add front matter for a Hugo/Astro post

Promote the title and headings by hand

Two-column journal layout interleaves

Scanned archive PDF needs OCR first

Edge cases and what actually happens

Two-column journal layout

Section headings stay as body text

Reference list isn't a Markdown list

Equations rendered as images vanish

Scanned / image-only article

Article over 50 pages on free tier

Article PDF over 2 MB on free tier

Ligatures in body text (fi, fl)

DOIs and URLs aren't clickable links

Frequently asked questions

Will the article's headings become Markdown headings?

Do references and in-text citations convert correctly?

What about figures and equations?

Does it add YAML front matter for my static site?

My two-column paper came out scrambled — why?

Can I use the output in Docusaurus or MkDocs?

How do I handle an older scanned journal PDF?

Should I keep a link to the original PDF on the page?

Is bold or italic emphasis preserved?

Is the article uploaded to a server?

What size article can I convert for free?

How does this differ from PDF to HTML?

Privacy first

Related guides

Convert a PDF Article or Paper to Markdown

How to convert a pdf article or paper to markdown

How article elements convert

Output format and tier limits

Cookbook

Single-column paper → readable Markdown

Add front matter for a Hugo/Astro post

Promote the title and headings by hand

Two-column journal layout interleaves

Scanned archive PDF needs OCR first

Edge cases and what actually happens

Two-column journal layout

Section headings stay as body text

Reference list isn't a Markdown list

Equations rendered as images vanish

Scanned / image-only article

Article over 50 pages on free tier

Article PDF over 2 MB on free tier

Ligatures in body text (fi, fl)

DOIs and URLs aren't clickable links

Frequently asked questions

Will the article's headings become Markdown headings?

Do references and in-text citations convert correctly?

What about figures and equations?

Does it add YAML front matter for my static site?

My two-column paper came out scrambled — why?

Can I use the output in Docusaurus or MkDocs?

How do I handle an older scanned journal PDF?

Should I keep a link to the original PDF on the page?

Is bold or italic emphasis preserved?

Is the article uploaded to a server?

What size article can I convert for free?

How does this differ from PDF to HTML?

Privacy first

Related guides