Convert a PDF Article to HTML for Web Publishing

How to convert a pdf article to html for web publishing

Step 1
Drop the article PDF onto the converter — Add the article to the PDF to HTML converter. Conversion starts automatically — there's nothing to configure.
Step 2
pdf.js extracts the text — Every page's text is pulled and wrapped in a <div class="page"> section. A typeset PDF from a journal or layout program carries the text layer this needs.
Step 3
Review the preview for column order — Academic PDFs are often two-column. Check the first 5,000 characters in the preview for interleaved lines so you know how much reading-order cleanup to budget for.
Step 4
Download the HTML — Save the .html file. It contains the whole article, page by page, in a single document.
Step 5
Add the article metadata and structure — Replace the placeholder <title>Converted PDF</title> with the real title, add author/date and Open Graph meta tags, and promote the abstract and section headings to <h2>/<h3>.
Step 6
Re-insert figures, then publish with a 'Download PDF' link — Export figures with a sibling tool and add <img> tags, then publish — keeping the original PDF available for readers who want the typeset version of record.

Academic article elements — what survives conversion

The converter extracts text only. Everything structural or visual needs manual restoration.

Article element	In the HTML?	How to restore it
Title / author / abstract text	Yes — as `<p>` under a page label	Promote the title to `<h1>`, mark up the abstract
Section headings (Methods, Results…)	No — flattened to `<p>`	Promote to `<h2>`/`<h3>` by hand
Figures, charts, equation images	No — not extracted	Export with PDF to PNG, add `<img>`
Data tables	No `<table>` — text flattens	Use PDF Table to JSON
References / bibliography	Yes — as `<p>` text	Re-format as a list; add DOI links manually
Footnotes / margin notes	Text only, position lost	Re-anchor as inline notes in your editor

Metadata to add after conversion

The converter writes only a placeholder title. These are the tags an article page needs for SEO and sharing.

Tag	Why it matters	Example
`<title>`	Replaces the placeholder; shown in search results & tabs	`<title>Effects of X on Y — J. Smith (2026)</title>`
`<meta name="description">`	Search snippet; often the abstract's first lines	`content="We measured ..."`
`og:title` / `og:description`	Social-share card text	`<meta property="og:title" content="...">`
`<link rel="canonical">`	Points to the authoritative URL (avoids duplicate-content issues vs the PDF)	`href="https://site/articles/x-on-y"`

Input limits by tier

Checked on the PDF before conversion.

Tier	Max file size	Max pages
Free	2 MB	50 pages
Pro	50 MB	500 pages
Pro + Media	500 MB	2,000 pages

Cookbook

Workflows for turning article PDFs into publishable web pages, with the markup you start from.

Journal paper to an article page

Convert, then add the metadata and promote the abstract and section headings the converter left as plain text.

Input:  smith-2026.pdf  (8 pages, two-column)

After conversion + manual edits:
  <title>Effects of X on Y &mdash; Smith et al. (2026)</title>
  <h1>Effects of X on Y</h1>
  <p class="authors">J. Smith, A. Lee</p>
  <h2>Abstract</h2>
  <p>... abstract text extracted from page 1 ...</p>
  <h2>1. Introduction</h2>
  <p>... promoted from flattened <p> ...</p>

Adding a figure the converter skipped

Figures are never extracted. Render the figure's page region to PNG, host it, and place the <img> with a caption.

Step 1: PDF to PNG on page 4  ->  figure-2.png
Step 2: host at /articles/x-on-y/figure-2.png
Step 3: insert into the article HTML:
  <figure>
    <img src="figure-2.png" alt="Mean response vs dose">
    <figcaption>Figure 2. Dose-response curve.</figcaption>
  </figure>

Fixing two-column reading order

A two-column PDF can interleave columns in the extracted text. There is no de-column option; reorder in the editor or crop one column at a time first.

Interleaved output:
  <p>... col1-line1 col2-line1 col1-line2 col2-line2 ...</p>

Option A: manually split text into correct order.
Option B: crop the PDF to the left column with PDF Crop,
          convert, repeat for the right column, then
          concatenate the two clean HTML bodies.

Avoiding duplicate content with a canonical tag

Publishing both the PDF and the HTML can split ranking signals. Point a canonical tag at whichever URL is authoritative.

In the HTML <head>:
  <link rel="canonical"
        href="https://site.com/articles/x-on-y">

Keep a download link in the body:
  <a href="/pdf/smith-2026.pdf">Download the PDF (version of record)</a>

Newsletter PDF to a blog post

Less structured than a journal paper — usually just promote a few headings and re-add the hero image.

Input: newsletter-may.pdf (4 pages)
Convert  ->  newsletter-may.html

Cleanup:
  - rename <title> to the issue title
  - promote 'In this issue' to <h2>
  - re-add masthead image via <img>
  - paste body <p> blocks into the blog editor

Edge cases and what actually happens

Abstract and section headings flatten to body text

By design

The converter does not detect article structure — 'Abstract', 'Methods', 'Results' all arrive as <p> text under a <h2>Page N</h2> label. Promote them by hand, or use PDF to Markdown for a per-page heading marker to post-process.

Figures and equation images dropped

By design

No images are extracted and no assets folder is created — figures, charts, and image-rendered equations simply disappear. Export them with PDF to PNG and re-insert as <img>/<figure> during cleanup.

Two-column layout interleaves columns

May interleave

Most academic PDFs are two-column, and pdf.js returns text in internal item order, which can zig-zag across columns. The text is complete but the reading order is scrambled. Reorder manually, or crop one column at a time with PDF Crop before converting.

Scanned / photocopied paper has no text

Empty output

Older papers distributed as scans have no embedded text layer, so the page sections come out empty. Run PDF OCR first to create a text layer, then convert.

Math fonts produce garbled symbols

garbled

Equation typesetting often uses subset fonts without Unicode maps, so extracted math comes out as gibberish. Re-render those equations as images via PDF to PNG and place them as <img>, or OCR the page if it's text-heavy.

Embargoed / pre-print manuscript

Preserved

Conversion is fully browser-local via pdf.js — the manuscript never reaches a server, so converting an unpublished or embargoed draft is safe. Only an anonymous usage counter is recorded when signed in.

Password-protected publisher PDF

fails to open

An encrypted PDF can't be opened by pdf.js and conversion fails. If you legitimately have the password, remove it with PDF Unlock first, then convert.

Article PDF over the free-tier limits

blocked

Free tier allows 2 MB and 50 pages. A long, figure-heavy paper can exceed the size cap. Upgrade to Pro (50 MB / 500 pages) or convert the relevant pages via PDF Extract Pages.

Reference list runs together as one block

Review

Because each page becomes one <p>, a bibliography arrives as a single text run rather than a list of entries. Split it into a <ol> or per-entry <p> in your editor, and add DOI/URL links where appropriate.

Placeholder title left as 'Converted PDF'

Review

Every output ships <title>Converted PDF</title>. For an article this is the single most important SEO field, so replace it with the real title (and add a canonical tag) before publishing.

Frequently asked questions

Will the abstract, headings and references keep their structure?

No. The converter extracts text without detecting article structure — the abstract, section headings, and reference list all arrive as <p> text under a per-page <h2> label. You promote the title to <h1>, the sections to <h2>/<h3>, and re-format the references in your editor. For a per-page heading marker to start from, see PDF to Markdown.

Are the article's figures included?

No. Figures, charts, and image-based equations are not extracted, and no assets folder is created. Render the relevant pages with PDF to PNG, host the images, and add <img>/<figure> tags during cleanup.

How do I add Open Graph tags for social sharing?

Manually, after conversion. Add og:title, og:description, and og:image tags to the HTML <head>. The converter only writes a placeholder <title>, so you'll be editing the head regardless — add the OG and canonical tags at the same time.

Should I keep the PDF available too?

Yes. Publish the HTML for discovery and mobile reading, and offer a 'Download PDF' link for the typeset version of record. To avoid duplicate-content dilution between the two, add a <link rel="canonical"> pointing at whichever URL should rank.

Why is the two-column text scrambled?

pdf.js returns text in the PDF's internal item order, which for a two-column layout often interleaves the columns line-by-line. There's no automatic de-column option. Reorder the text manually, or crop the PDF to one column at a time with PDF Crop, convert each, then join the bodies.

Can I convert a scanned old paper?

Not directly — a scan has no text layer, so you'll get empty page sections. Run PDF OCR first to add a searchable text layer, then convert the OCR'd PDF. OCR also helps when a born-digital paper uses broken subset fonts that extract as gibberish.

Is my manuscript uploaded anywhere?

No. Conversion runs entirely in your browser via pdf.js; the file never leaves your device. Only an anonymous usage counter is recorded when you're signed in. That makes it safe for embargoed or pre-print drafts.

Are there options for layout or output format?

No. The tool auto-converts on drop with a fixed output structure — no page range, no layout mode, no format toggles. To convert only specific pages, extract them first with PDF Extract Pages.

What's the size limit for an article PDF?

Free tier allows 2 MB and up to 50 pages; Pro raises it to 50 MB / 500 pages. Figure-heavy papers can be large — if you hit the cap, upgrade or convert just the pages you need.

Why does each page show as one big paragraph?

Extraction joins text items with spaces, leaving no blank-line breaks within a page, so the paragraph splitter keeps the whole page in a single <p>. Split it into real paragraphs (and a reference list) in your editor.

Can I get Markdown for a Markdown-based blog?

Yes — PDF to Markdown outputs .md with a ## Page N heading per page and sentence-aware breaks, which pastes more cleanly into platforms like Ghost, Hugo, or Jekyll than raw HTML.

How accurate is the extracted text?

For born-digital, single-column PDFs with standard fonts it's faithful, including accented characters via the UTF-8 charset. Accuracy degrades with two-column layouts (reading order) and subset fonts that lack Unicode maps (garbled characters) — both are flagged in the edge cases above, and OCR is the fallback for the latter.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.