Extract PDF Content into HTML Markup — Free Online

How to extract pdf content as html markup

Step 1
Drop the source PDF onto the converter — Add a single PDF to the PDF to HTML converter. It begins extracting immediately — there are no settings to configure first.
Step 2
Extraction runs page by page — pdf.js walks every page and pulls its text items, joining them with spaces. Born-digital PDFs (Word, InDesign, browser exports) have the text layer this needs.
Step 3
Check the preview for extraction quality — The first 5,000 characters of HTML appear in the result panel. Scan it for garbled characters or scrambled column order before you commit to importing — those are the two things worth catching early.
Step 4
Download the HTML — Save the .html file. It contains every page as a <div class="page"> section, ready to slice up.
Step 5
Strip the shell and keep the body — For a CMS, discard the DOCTYPE/head/inline style and keep only the <div class="page"> blocks (or just their <p> contents). Add your own CSS classes and heading tags as you paste.
Step 6
Import into your CMS and add structure — Paste into WordPress, Contentful, Sanity, or Notion. Promote real headings, rebuild any tables, and re-insert images — none of those are reconstructed automatically.

Extraction matrix — what comes across and what doesn't

Plan your post-import cleanup around this. The converter is a text extractor, not a layout engine.

Source element	Extracted to HTML?	Recommended follow-up
Body paragraphs	Yes — as `<p>` (usually one per page)	Split into real paragraphs in your editor
Headings / section titles	No — they become plain `<p>` text	Promote to `<h2>`/`<h3>` manually, or see PDF to Markdown
Tables	No `<table>` — cells flatten into text	Recover structure with PDF Table to JSON
Bulleted / numbered lists	Text only — no `<ul>`/`<ol>`	Rebuild list markup in the CMS editor
Images / diagrams	No — not extracted at all	Export with PDF to PNG, add `<img>` by hand
Bold / italic / colour	No — styling is not preserved	Re-apply inline emphasis in the editor

Output anatomy

Every conversion produces this fixed structure; there are no options to change it.

Element	Value
Doctype + root	`<!DOCTYPE html>` · `<html lang="en">`
Title (placeholder)	`<title>Converted PDF</title>` — rename before use
Styling	Inline `<style>`: sans-serif, `max-width:800px`, `.page` bottom border, `h2` `#333`
Per page	`<div class="page"><h2>Page N</h2> <p>text</p> </div>`
Escaping	`<` → `<`, `>` → `>` (other entities like `&` left as-is)

Tier limits on the input PDF

Checked before conversion runs.

Tier	Max file size	Max pages
Free	2 MB	50 pages
Pro	50 MB	500 pages
Pro + Media	500 MB	2,000 pages

Cookbook

Practical extraction-to-publish workflows, showing the exact markup you start from.

Whitepaper text into a WordPress post

Extract the text, keep only the body, and paste into the WordPress block editor — then re-add the headings the converter flattened.

1. Drop whitepaper.pdf  →  whitepaper.html
2. From the download, keep only:
   <div class="page"><h2>Page 1</h2><p>... full page text ...</p></div>
3. In WordPress: paste the <p> text into a Paragraph block.
4. Re-create headings: select 'Executive Summary' line,
   convert that block to Heading (H2).

Recovering a table the converter flattened

Tables come across as a run of text inside the page paragraph. Use the sibling tool to get structured rows, then build the <table> separately.

PDF to HTML output (table is just text):
  <p>Plan Price Seats Basic $9 3 Pro $29 10 ...</p>

Better: PDF Table to JSON  →
  [{"Plan":"Basic","Price":"$9","Seats":"3"},
   {"Plan":"Pro","Price":"$29","Seats":"10"}]

Then render that JSON as a real <table> in your template.

Cleaning the markup for a headless CMS

Headless CMSes (Contentful, Sanity) want clean fragments, not a full document. Strip the shell programmatically.

Keep only the page divs (pseudo-cleanup):
  - delete everything up to and including <body>
  - delete </body></html>
  - optionally remove the <h2>Page N</h2> labels

Result: a sequence of <div class="page"><p>...</p></div>
fragments you can map to CMS content blocks.

Catching a garbled-font extraction early

Some PDFs embed subset fonts with no Unicode map; the preview surfaces the problem before you import bad text.

Preview shows:
  <p>\ue001\ue0a2 ... unreadable glyphs ...</p>

This PDF's text layer is unrecoverable as text.
Fix: run PDF OCR on the PDF first, then re-convert
(OCR reads the rendered glyphs, not the broken map).

Two-column report — fixing reading order

pdf.js returns items in internal order, which can interleave columns. Spot it in the preview and reorder after import.

Output paragraph (interleaved across two columns):
  <p>Left col line 1 Right col line 1 Left col line 2 ...</p>

There is no de-column option. Manually split the text
into the correct reading order in your editor, or
convert one column at a time by cropping the PDF first
with PDF Crop.

Edge cases and what actually happens

Tables are not rebuilt as HTML tables

By design

Tabular content flattens into the page's paragraph text — there is no <table> detection. To recover rows and columns, use PDF Table to JSON and render the structured data as a table in your own template.

Lists lose their bullet/number markup

By design

Bulleted and numbered lists extract as plain text inside <p>, not as <ul>/<ol>. Rebuild the list markup in your CMS editor after import.

Document headings become body text

By design

The converter does not infer heading levels — it only labels each page with <h2>Page N</h2>. Your section titles arrive as <p> text. Promote them by hand, or use PDF to Markdown for a per-page ## marker to post-process.

No images come across

By design

Image extraction is out of scope — no assets folder is written and figures simply vanish from the markup. Export them with PDF to PNG and add <img> tags during cleanup.

Scanned PDF yields empty paragraphs

Empty output

An image-only PDF has no text layer, so the page sections come out with no <p> content. Add a text layer with PDF OCR first, then re-convert.

Custom-encoded font produces unreadable text

garbled

If a font is subset-embedded without a ToUnicode map, the extracted characters are wrong even though the PDF displays fine. The HTML text is unusable. OCR the PDF with PDF OCR to read the rendered glyphs instead.

Multi-column layout interleaves text

May interleave

Two-column documents can come out with columns zig-zagged together because pdf.js follows the PDF's internal item order. All the text is present but the reading order needs manual fixing after import.

Encrypted PDF can't be read

fails to open

pdf.js cannot open a password-protected PDF, so extraction fails before it starts. Remove the password with PDF Unlock (you must know it), then convert.

Free-tier size or page cap exceeded

blocked

Free tier allows 2 MB and 50 pages. A long content asset will exceed that. Upgrade to Pro, or extract the relevant section with PDF Extract Pages before converting.

Ampersand in source text left unescaped

Review

Only < and > are escaped. A literal & (e.g. 'R&D') passes through as-is, which is technically invalid HTML. Run the markup through a formatter/linter, or do a find-replace for stray &, before importing to a strict CMS.

Frequently asked questions

Will the heading hierarchy come out correctly?

No — the converter does not map font sizes to heading levels. The only heading it emits is a <h2>Page N</h2> label per page; your document's titles arrive as plain <p> text. You promote them to <h2>/<h3> during cleanup. If you want at least a per-page marker, PDF to Markdown writes a ## Page N heading you can build on.

Do PDF tables convert to HTML tables?

No. Table cells flatten into the running text of the page's paragraph — there is no <table> reconstruction. To recover the row-and-column structure, use PDF Table to JSON and render that JSON as a real table in your template or component.

Can I use the HTML output directly in a React component?

You'll want to clean it first. The download is a full document; for React, strip the head/shell and keep the <div class="page"> sections (or just the <p> text). Because only </> are escaped in the source, sanitise with DOMPurify before rendering via dangerouslySetInnerHTML.

Are there any options for how content is extracted?

No. The tool has no settings — it extracts all text from every page and emits a fixed HTML structure the moment you drop a file. To extract only part of a document, run PDF Extract Pages first and convert the result.

Why does an entire page show up as one paragraph?

Text items are joined with spaces during extraction, so there are no double line-breaks inside a page, and the paragraph splitter (which looks for blank lines) keeps the whole page in one <p>. That's expected — split it into real paragraphs in your editor.

What happens to images embedded in the PDF?

They are not extracted. This is a text-only converter — no image files, no assets folder. To bring images back, render the relevant pages with PDF to PNG or PDF to JPG, host them, and add <img> tags during cleanup.

My extracted text is gibberish — what went wrong?

The PDF almost certainly uses a subset font with no Unicode mapping, so the underlying character codes don't map to readable letters. The page looks fine but the text layer is broken. Run PDF OCR on the PDF to read the rendered glyphs, then convert the OCR'd file.

Is my content uploaded to a server?

No. Everything runs in your browser via pdf.js — the PDF never leaves your device. Only an anonymous usage count is recorded when you're signed in, and you can opt out in account settings. This matters for gated or pre-launch content you don't want exposed.

How big a PDF can I extract from?

Free tier allows 2 MB and up to 50 pages; Pro raises it to 50 MB / 500 pages, and Pro + Media to 500 MB / 2,000 pages. For larger documents, split with PDF Split or extract the part you need first.

Can I get Markdown instead of HTML?

Yes — PDF to Markdown outputs .md with a ## Page N heading per page and sentence-aware line breaks, which is often cleaner to paste into Markdown-based CMSes or static-site generators than raw HTML.

Does the export preserve bold and italics?

No. Inline styling (bold, italic, colour, font) is not preserved — only the plain characters come across. Re-apply emphasis in your editor where it matters.

How is this different from PDF to Text?

PDF to Text gives you a .txt file with pages separated by blank lines and no markup. This tool wraps the same extracted text in HTML — page <div>s, <p> tags, and an inline stylesheet — so it's ready to view in a browser or paste into a web editor.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Extract PDF Content as HTML Markup

How to extract pdf content as html markup

Extraction matrix — what comes across and what doesn't

Output anatomy

Tier limits on the input PDF

Cookbook

Whitepaper text into a WordPress post

Recovering a table the converter flattened

Cleaning the markup for a headless CMS

Catching a garbled-font extraction early

Two-column report — fixing reading order

Edge cases and what actually happens

Tables are not rebuilt as HTML tables

Lists lose their bullet/number markup

Document headings become body text

No images come across

Scanned PDF yields empty paragraphs

Custom-encoded font produces unreadable text

Multi-column layout interleaves text

Encrypted PDF can't be read

Free-tier size or page cap exceeded

Ampersand in source text left unescaped

Frequently asked questions

Will the heading hierarchy come out correctly?

Do PDF tables convert to HTML tables?

Can I use the HTML output directly in a React component?

Are there any options for how content is extracted?

Why does an entire page show up as one paragraph?

What happens to images embedded in the PDF?

My extracted text is gibberish — what went wrong?

Is my content uploaded to a server?

How big a PDF can I extract from?

Can I get Markdown instead of HTML?

Does the export preserve bold and italics?

How is this different from PDF to Text?

Privacy first

Related guides