How to extract pdf content as html markup
- Step 1Drop the source PDF onto the converter — Add a single PDF to the PDF to HTML converter. It begins extracting immediately — there are no settings to configure first.
- Step 2Extraction runs page by page — pdf.js walks every page and pulls its text items, joining them with spaces. Born-digital PDFs (Word, InDesign, browser exports) have the text layer this needs.
- Step 3Check the preview for extraction quality — The first 5,000 characters of HTML appear in the result panel. Scan it for garbled characters or scrambled column order before you commit to importing — those are the two things worth catching early.
- Step 4Download the HTML — Save the
.htmlfile. It contains every page as a<div class="page">section, ready to slice up. - Step 5Strip the shell and keep the body — For a CMS, discard the DOCTYPE/head/inline style and keep only the
<div class="page">blocks (or just their<p>contents). Add your own CSS classes and heading tags as you paste. - Step 6Import into your CMS and add structure — Paste into WordPress, Contentful, Sanity, or Notion. Promote real headings, rebuild any tables, and re-insert images — none of those are reconstructed automatically.
Extraction matrix — what comes across and what doesn't
Plan your post-import cleanup around this. The converter is a text extractor, not a layout engine.
| Source element | Extracted to HTML? | Recommended follow-up |
|---|---|---|
| Body paragraphs | Yes — as <p> (usually one per page) | Split into real paragraphs in your editor |
| Headings / section titles | No — they become plain <p> text | Promote to <h2>/<h3> manually, or see PDF to Markdown |
| Tables | No <table> — cells flatten into text | Recover structure with PDF Table to JSON |
| Bulleted / numbered lists | Text only — no <ul>/<ol> | Rebuild list markup in the CMS editor |
| Images / diagrams | No — not extracted at all | Export with PDF to PNG, add <img> by hand |
| Bold / italic / colour | No — styling is not preserved | Re-apply inline emphasis in the editor |
Output anatomy
Every conversion produces this fixed structure; there are no options to change it.
| Element | Value |
|---|---|
| Doctype + root | <!DOCTYPE html> · <html lang="en"> |
| Title (placeholder) | <title>Converted PDF</title> — rename before use |
| Styling | Inline <style>: sans-serif, max-width:800px, .page bottom border, h2 #333 |
| Per page | <div class="page"><h2>Page N</h2> <p>text</p> </div> |
| Escaping | < → <, > → > (other entities like & left as-is) |
Tier limits on the input PDF
Checked before conversion runs.
| Tier | Max file size | Max pages |
|---|---|---|
| Free | 2 MB | 50 pages |
| Pro | 50 MB | 500 pages |
| Pro + Media | 500 MB | 2,000 pages |
Cookbook
Practical extraction-to-publish workflows, showing the exact markup you start from.
Whitepaper text into a WordPress post
Extract the text, keep only the body, and paste into the WordPress block editor — then re-add the headings the converter flattened.
1. Drop whitepaper.pdf → whitepaper.html 2. From the download, keep only: <div class="page"><h2>Page 1</h2><p>... full page text ...</p></div> 3. In WordPress: paste the <p> text into a Paragraph block. 4. Re-create headings: select 'Executive Summary' line, convert that block to Heading (H2).
Recovering a table the converter flattened
Tables come across as a run of text inside the page paragraph. Use the sibling tool to get structured rows, then build the <table> separately.
PDF to HTML output (table is just text):
<p>Plan Price Seats Basic $9 3 Pro $29 10 ...</p>
Better: PDF Table to JSON →
[{"Plan":"Basic","Price":"$9","Seats":"3"},
{"Plan":"Pro","Price":"$29","Seats":"10"}]
Then render that JSON as a real <table> in your template.Cleaning the markup for a headless CMS
Headless CMSes (Contentful, Sanity) want clean fragments, not a full document. Strip the shell programmatically.
Keep only the page divs (pseudo-cleanup): - delete everything up to and including <body> - delete </body></html> - optionally remove the <h2>Page N</h2> labels Result: a sequence of <div class="page"><p>...</p></div> fragments you can map to CMS content blocks.
Catching a garbled-font extraction early
Some PDFs embed subset fonts with no Unicode map; the preview surfaces the problem before you import bad text.
Preview shows: <p>\ue001\ue0a2 ... unreadable glyphs ...</p> This PDF's text layer is unrecoverable as text. Fix: run PDF OCR on the PDF first, then re-convert (OCR reads the rendered glyphs, not the broken map).
Two-column report — fixing reading order
pdf.js returns items in internal order, which can interleave columns. Spot it in the preview and reorder after import.
Output paragraph (interleaved across two columns): <p>Left col line 1 Right col line 1 Left col line 2 ...</p> There is no de-column option. Manually split the text into the correct reading order in your editor, or convert one column at a time by cropping the PDF first with PDF Crop.
Edge cases and what actually happens
Tables are not rebuilt as HTML tables
By designTabular content flattens into the page's paragraph text — there is no <table> detection. To recover rows and columns, use PDF Table to JSON and render the structured data as a table in your own template.
Lists lose their bullet/number markup
By designBulleted and numbered lists extract as plain text inside <p>, not as <ul>/<ol>. Rebuild the list markup in your CMS editor after import.
Document headings become body text
By designThe converter does not infer heading levels — it only labels each page with <h2>Page N</h2>. Your section titles arrive as <p> text. Promote them by hand, or use PDF to Markdown for a per-page ## marker to post-process.
No images come across
By designImage extraction is out of scope — no assets folder is written and figures simply vanish from the markup. Export them with PDF to PNG and add <img> tags during cleanup.
Scanned PDF yields empty paragraphs
Empty outputAn image-only PDF has no text layer, so the page sections come out with no <p> content. Add a text layer with PDF OCR first, then re-convert.
Custom-encoded font produces unreadable text
garbledIf a font is subset-embedded without a ToUnicode map, the extracted characters are wrong even though the PDF displays fine. The HTML text is unusable. OCR the PDF with PDF OCR to read the rendered glyphs instead.
Multi-column layout interleaves text
May interleaveTwo-column documents can come out with columns zig-zagged together because pdf.js follows the PDF's internal item order. All the text is present but the reading order needs manual fixing after import.
Encrypted PDF can't be read
fails to openpdf.js cannot open a password-protected PDF, so extraction fails before it starts. Remove the password with PDF Unlock (you must know it), then convert.
Free-tier size or page cap exceeded
blockedFree tier allows 2 MB and 50 pages. A long content asset will exceed that. Upgrade to Pro, or extract the relevant section with PDF Extract Pages before converting.
Ampersand in source text left unescaped
ReviewOnly < and > are escaped. A literal & (e.g. 'R&D') passes through as-is, which is technically invalid HTML. Run the markup through a formatter/linter, or do a find-replace for stray &, before importing to a strict CMS.
Frequently asked questions
Will the heading hierarchy come out correctly?
No — the converter does not map font sizes to heading levels. The only heading it emits is a <h2>Page N</h2> label per page; your document's titles arrive as plain <p> text. You promote them to <h2>/<h3> during cleanup. If you want at least a per-page marker, PDF to Markdown writes a ## Page N heading you can build on.
Do PDF tables convert to HTML tables?
No. Table cells flatten into the running text of the page's paragraph — there is no <table> reconstruction. To recover the row-and-column structure, use PDF Table to JSON and render that JSON as a real table in your template or component.
Can I use the HTML output directly in a React component?
You'll want to clean it first. The download is a full document; for React, strip the head/shell and keep the <div class="page"> sections (or just the <p> text). Because only </> are escaped in the source, sanitise with DOMPurify before rendering via dangerouslySetInnerHTML.
Are there any options for how content is extracted?
No. The tool has no settings — it extracts all text from every page and emits a fixed HTML structure the moment you drop a file. To extract only part of a document, run PDF Extract Pages first and convert the result.
Why does an entire page show up as one paragraph?
Text items are joined with spaces during extraction, so there are no double line-breaks inside a page, and the paragraph splitter (which looks for blank lines) keeps the whole page in one <p>. That's expected — split it into real paragraphs in your editor.
What happens to images embedded in the PDF?
They are not extracted. This is a text-only converter — no image files, no assets folder. To bring images back, render the relevant pages with PDF to PNG or PDF to JPG, host them, and add <img> tags during cleanup.
My extracted text is gibberish — what went wrong?
The PDF almost certainly uses a subset font with no Unicode mapping, so the underlying character codes don't map to readable letters. The page looks fine but the text layer is broken. Run PDF OCR on the PDF to read the rendered glyphs, then convert the OCR'd file.
Is my content uploaded to a server?
No. Everything runs in your browser via pdf.js — the PDF never leaves your device. Only an anonymous usage count is recorded when you're signed in, and you can opt out in account settings. This matters for gated or pre-launch content you don't want exposed.
How big a PDF can I extract from?
Free tier allows 2 MB and up to 50 pages; Pro raises it to 50 MB / 500 pages, and Pro + Media to 500 MB / 2,000 pages. For larger documents, split with PDF Split or extract the part you need first.
Can I get Markdown instead of HTML?
Yes — PDF to Markdown outputs .md with a ## Page N heading per page and sentence-aware line breaks, which is often cleaner to paste into Markdown-based CMSes or static-site generators than raw HTML.
Does the export preserve bold and italics?
No. Inline styling (bold, italic, colour, font) is not preserved — only the plain characters come across. Re-apply emphasis in your editor where it matters.
How is this different from PDF to Text?
PDF to Text gives you a .txt file with pages separated by blank lines and no markup. This tool wraps the same extracted text in HTML — page <div>s, <p> tags, and an inline stylesheet — so it's ready to view in a browser or paste into a web editor.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.