How to convert a pdf article to html for web publishing
- Step 1Drop the article PDF onto the converter — Add the article to the PDF to HTML converter. Conversion starts automatically — there's nothing to configure.
- Step 2pdf.js extracts the text — Every page's text is pulled and wrapped in a
<div class="page">section. A typeset PDF from a journal or layout program carries the text layer this needs. - Step 3Review the preview for column order — Academic PDFs are often two-column. Check the first 5,000 characters in the preview for interleaved lines so you know how much reading-order cleanup to budget for.
- Step 4Download the HTML — Save the
.htmlfile. It contains the whole article, page by page, in a single document. - Step 5Add the article metadata and structure — Replace the placeholder
<title>Converted PDF</title>with the real title, add author/date and Open Graph meta tags, and promote the abstract and section headings to<h2>/<h3>. - Step 6Re-insert figures, then publish with a 'Download PDF' link — Export figures with a sibling tool and add
<img>tags, then publish — keeping the original PDF available for readers who want the typeset version of record.
Academic article elements — what survives conversion
The converter extracts text only. Everything structural or visual needs manual restoration.
| Article element | In the HTML? | How to restore it |
|---|---|---|
| Title / author / abstract text | Yes — as <p> under a page label | Promote the title to <h1>, mark up the abstract |
| Section headings (Methods, Results…) | No — flattened to <p> | Promote to <h2>/<h3> by hand |
| Figures, charts, equation images | No — not extracted | Export with PDF to PNG, add <img> |
| Data tables | No <table> — text flattens | Use PDF Table to JSON |
| References / bibliography | Yes — as <p> text | Re-format as a list; add DOI links manually |
| Footnotes / margin notes | Text only, position lost | Re-anchor as inline notes in your editor |
Metadata to add after conversion
The converter writes only a placeholder title. These are the tags an article page needs for SEO and sharing.
| Tag | Why it matters | Example |
|---|---|---|
<title> | Replaces the placeholder; shown in search results & tabs | <title>Effects of X on Y — J. Smith (2026)</title> |
<meta name="description"> | Search snippet; often the abstract's first lines | content="We measured ..." |
og:title / og:description | Social-share card text | <meta property="og:title" content="..."> |
<link rel="canonical"> | Points to the authoritative URL (avoids duplicate-content issues vs the PDF) | href="https://site/articles/x-on-y" |
Input limits by tier
Checked on the PDF before conversion.
| Tier | Max file size | Max pages |
|---|---|---|
| Free | 2 MB | 50 pages |
| Pro | 50 MB | 500 pages |
| Pro + Media | 500 MB | 2,000 pages |
Cookbook
Workflows for turning article PDFs into publishable web pages, with the markup you start from.
Journal paper to an article page
Convert, then add the metadata and promote the abstract and section headings the converter left as plain text.
Input: smith-2026.pdf (8 pages, two-column) After conversion + manual edits: <title>Effects of X on Y — Smith et al. (2026)</title> <h1>Effects of X on Y</h1> <p class="authors">J. Smith, A. Lee</p> <h2>Abstract</h2> <p>... abstract text extracted from page 1 ...</p> <h2>1. Introduction</h2> <p>... promoted from flattened <p> ...</p>
Adding a figure the converter skipped
Figures are never extracted. Render the figure's page region to PNG, host it, and place the <img> with a caption.
Step 1: PDF to PNG on page 4 -> figure-2.png
Step 2: host at /articles/x-on-y/figure-2.png
Step 3: insert into the article HTML:
<figure>
<img src="figure-2.png" alt="Mean response vs dose">
<figcaption>Figure 2. Dose-response curve.</figcaption>
</figure>Fixing two-column reading order
A two-column PDF can interleave columns in the extracted text. There is no de-column option; reorder in the editor or crop one column at a time first.
Interleaved output:
<p>... col1-line1 col2-line1 col1-line2 col2-line2 ...</p>
Option A: manually split text into correct order.
Option B: crop the PDF to the left column with PDF Crop,
convert, repeat for the right column, then
concatenate the two clean HTML bodies.Avoiding duplicate content with a canonical tag
Publishing both the PDF and the HTML can split ranking signals. Point a canonical tag at whichever URL is authoritative.
In the HTML <head>:
<link rel="canonical"
href="https://site.com/articles/x-on-y">
Keep a download link in the body:
<a href="/pdf/smith-2026.pdf">Download the PDF (version of record)</a>Newsletter PDF to a blog post
Less structured than a journal paper — usually just promote a few headings and re-add the hero image.
Input: newsletter-may.pdf (4 pages) Convert -> newsletter-may.html Cleanup: - rename <title> to the issue title - promote 'In this issue' to <h2> - re-add masthead image via <img> - paste body <p> blocks into the blog editor
Edge cases and what actually happens
Abstract and section headings flatten to body text
By designThe converter does not detect article structure — 'Abstract', 'Methods', 'Results' all arrive as <p> text under a <h2>Page N</h2> label. Promote them by hand, or use PDF to Markdown for a per-page heading marker to post-process.
Figures and equation images dropped
By designNo images are extracted and no assets folder is created — figures, charts, and image-rendered equations simply disappear. Export them with PDF to PNG and re-insert as <img>/<figure> during cleanup.
Two-column layout interleaves columns
May interleaveMost academic PDFs are two-column, and pdf.js returns text in internal item order, which can zig-zag across columns. The text is complete but the reading order is scrambled. Reorder manually, or crop one column at a time with PDF Crop before converting.
Scanned / photocopied paper has no text
Empty outputOlder papers distributed as scans have no embedded text layer, so the page sections come out empty. Run PDF OCR first to create a text layer, then convert.
Math fonts produce garbled symbols
garbledEquation typesetting often uses subset fonts without Unicode maps, so extracted math comes out as gibberish. Re-render those equations as images via PDF to PNG and place them as <img>, or OCR the page if it's text-heavy.
Embargoed / pre-print manuscript
PreservedConversion is fully browser-local via pdf.js — the manuscript never reaches a server, so converting an unpublished or embargoed draft is safe. Only an anonymous usage counter is recorded when signed in.
Password-protected publisher PDF
fails to openAn encrypted PDF can't be opened by pdf.js and conversion fails. If you legitimately have the password, remove it with PDF Unlock first, then convert.
Article PDF over the free-tier limits
blockedFree tier allows 2 MB and 50 pages. A long, figure-heavy paper can exceed the size cap. Upgrade to Pro (50 MB / 500 pages) or convert the relevant pages via PDF Extract Pages.
Reference list runs together as one block
ReviewBecause each page becomes one <p>, a bibliography arrives as a single text run rather than a list of entries. Split it into a <ol> or per-entry <p> in your editor, and add DOI/URL links where appropriate.
Placeholder title left as 'Converted PDF'
ReviewEvery output ships <title>Converted PDF</title>. For an article this is the single most important SEO field, so replace it with the real title (and add a canonical tag) before publishing.
Frequently asked questions
Will the abstract, headings and references keep their structure?
No. The converter extracts text without detecting article structure — the abstract, section headings, and reference list all arrive as <p> text under a per-page <h2> label. You promote the title to <h1>, the sections to <h2>/<h3>, and re-format the references in your editor. For a per-page heading marker to start from, see PDF to Markdown.
Are the article's figures included?
No. Figures, charts, and image-based equations are not extracted, and no assets folder is created. Render the relevant pages with PDF to PNG, host the images, and add <img>/<figure> tags during cleanup.
How do I add Open Graph tags for social sharing?
Manually, after conversion. Add og:title, og:description, and og:image tags to the HTML <head>. The converter only writes a placeholder <title>, so you'll be editing the head regardless — add the OG and canonical tags at the same time.
Should I keep the PDF available too?
Yes. Publish the HTML for discovery and mobile reading, and offer a 'Download PDF' link for the typeset version of record. To avoid duplicate-content dilution between the two, add a <link rel="canonical"> pointing at whichever URL should rank.
Why is the two-column text scrambled?
pdf.js returns text in the PDF's internal item order, which for a two-column layout often interleaves the columns line-by-line. There's no automatic de-column option. Reorder the text manually, or crop the PDF to one column at a time with PDF Crop, convert each, then join the bodies.
Can I convert a scanned old paper?
Not directly — a scan has no text layer, so you'll get empty page sections. Run PDF OCR first to add a searchable text layer, then convert the OCR'd PDF. OCR also helps when a born-digital paper uses broken subset fonts that extract as gibberish.
Is my manuscript uploaded anywhere?
No. Conversion runs entirely in your browser via pdf.js; the file never leaves your device. Only an anonymous usage counter is recorded when you're signed in. That makes it safe for embargoed or pre-print drafts.
Are there options for layout or output format?
No. The tool auto-converts on drop with a fixed output structure — no page range, no layout mode, no format toggles. To convert only specific pages, extract them first with PDF Extract Pages.
What's the size limit for an article PDF?
Free tier allows 2 MB and up to 50 pages; Pro raises it to 50 MB / 500 pages. Figure-heavy papers can be large — if you hit the cap, upgrade or convert just the pages you need.
Why does each page show as one big paragraph?
Extraction joins text items with spaces, leaving no blank-line breaks within a page, so the paragraph splitter keeps the whole page in a single <p>. Split it into real paragraphs (and a reference list) in your editor.
Can I get Markdown for a Markdown-based blog?
Yes — PDF to Markdown outputs .md with a ## Page N heading per page and sentence-aware breaks, which pastes more cleanly into platforms like Ghost, Hugo, or Jekyll than raw HTML.
How accurate is the extracted text?
For born-digital, single-column PDFs with standard fonts it's faithful, including accented characters via the UTF-8 charset. Accuracy degrades with two-column layouts (reading order) and subset fonts that lack Unicode maps (garbled characters) — both are flagged in the edge cases above, and OCR is the fallback for the latter.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.