How to convert a pdf document to an html webpage
- Step 1Open the converter and drop your PDF — Load the file into the PDF to HTML converter. It accepts a single PDF; conversion starts automatically the instant the file is added — there is no Convert button to press.
- Step 2Let pdf.js read the text layer — Every page's embedded text is extracted with pdf.js. A born-digital PDF (exported from Word, Google Docs, InDesign, a browser) has this layer; a scanned image does not — see the OCR note below.
- Step 3Preview the generated HTML — The result panel shows the first 5,000 characters of the HTML source in a code block, plus stat tiles for input pages, input size, and output size. The full document is in the download.
- Step 4Download the .html file — Save the single
.htmlfile (named after your PDF). It is fully self-contained — DOCTYPE, charset, inline style, and one section per page. - Step 5Restyle for your site — Open it in a code editor. Replace the inline
<style>with your site's stylesheet, and target the.pageandh2selectors the converter emits to match your design. - Step 6Add the SEO essentials, then publish — Set a real
<title>(the converter writes the placeholder "Converted PDF"), add a meta description and heading hierarchy, then publish and add the URL to your sitemap so Google can crawl it.
What the converter emits — exact output shape
The HTML structure is fixed: there are no options, so every conversion produces this skeleton.
| Part of the output | What it contains | Notes |
|---|---|---|
| Document shell | <!DOCTYPE html>, <html lang="en">, <head> with <meta charset="UTF-8"> and <title>Converted PDF</title> | The title is a fixed placeholder — rename it before publishing |
| Inline style | One <style> block: sans-serif body, max-width:800px, centred margins, .page with a bottom border, h2 coloured #333 | No external CSS file; replace it with your own stylesheet |
| Per-page section | <div class="page"><h2>Page N</h2> … </div> for each PDF page | The only heading is the page label <h2>; the document's own headings are not detected |
| Body text | Each page's text inside <p> tags, with < and > escaped to </> | Almost always one <p> per page (see the paragraph note below) |
What is and isn't reconstructed
The tool extracts the text layer only. Anything visual or structural beyond plain text is not carried over.
| Feature | In the HTML output? | What to do instead |
|---|---|---|
| Body text | Yes — every page's text in <p> tags | — |
| Images / logos / charts | No — not extracted, no assets folder is created | Render pages to images with PDF to PNG or PDF to JPG and add <img> tags by hand |
| Headings (H1/H2/H3) from the document | No — only a <h2>Page N</h2> label per page | Promote headings manually, or convert with PDF to Markdown for ## heading markers per page |
Tables as <table> | No — table cells flatten into the page's paragraph | Use PDF Table to JSON to recover row/column structure |
| Fonts, colours, exact layout | No — replaced by the inline default stylesheet | Treat HTML as reflowable text; keep the PDF for the pixel-perfect version |
File and page limits by tier
Enforced on the input PDF before conversion runs.
| Tier | Max file size | Max pages |
|---|---|---|
| Free | 2 MB | 50 pages |
| Pro | 50 MB | 500 pages |
| Pro + Media | 500 MB | 2,000 pages |
| Developer | 2 GB | 10,000 pages |
Cookbook
Real conversions and exactly what the generated HTML looks like for each. Output is abbreviated to show structure.
A two-page born-digital PDF
A PDF exported from Google Docs. Each page's text becomes one <p> inside a <div class="page">, under a <h2>Page N</h2> label.
Input: brochure.pdf (2 pages, exported from Google Docs)
Output (brochure.html, abbreviated):
<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8">
<title>Converted PDF</title>
<style>body{font-family:sans-serif;max-width:800px;...}</style>
</head><body>
<div class="page"><h2>Page 1</h2>
<p>Acme Cloud Platform Overview ...</p></div>
<div class="page"><h2>Page 2</h2>
<p>Pricing starts at $29/month ...</p></div>
</body></html>Renaming the placeholder title before publishing
Every conversion writes <title>Converted PDF</title>. For SEO you must replace it — Google uses the title tag in search results.
Before (as generated): <title>Converted PDF</title> After (edit in your code editor): <title>Acme Cloud Platform — Overview & Pricing</title> <meta name="description" content="Acme Cloud pricing, ...">
Swapping the inline style for your site CSS
The tool ships a minimal default stylesheet inline. Replace it with a link to your own and style the .page and h2 selectors the converter emits.
Replace the generated <style>...</style> with:
<link rel="stylesheet" href="/css/site.css">
Then in site.css target the emitted classes:
.page { border-bottom: none; padding: 2rem 0; }
.page h2 { font-size: .75rem; text-transform: uppercase;
color: var(--muted); } /* hide page labels if unwanted */Adding an image the converter skipped
Images are never extracted. To restore a logo or chart, render the page (or just that image area) to PNG with the sibling tool, host it, and add an <img> tag.
Step 1: PDF to PNG → page-1.png
Step 2: upload page-1.png to /assets/
Step 3: paste into the HTML where the image belongs:
<div class="page"><h2>Page 1</h2>
<img src="/assets/page-1.png" alt="Architecture diagram">
<p>Acme Cloud Platform Overview ...</p>
</div>Scanned PDF returns an empty body
A photographed or scanned document has no text layer, so pdf.js finds nothing to extract — the page sections come out with no <p> content. OCR first.
Input: scanned-flyer.pdf (image-only) Output body: <div class="page"><h2>Page 1</h2></div> ← no <p>, no text Fix: run PDF OCR first to add a real text layer, then convert the OCR'd PDF to HTML.
Edge cases and what actually happens
Scanned / image-only PDF
Empty outputIf the PDF is a scan or photo with no embedded text layer, pdf.js extracts nothing and each <div class="page"> comes out with only its <h2>Page N</h2> label and no <p>. Run PDF OCR first to add a searchable text layer, then convert the OCR'd PDF here.
Images, logos and charts are dropped
By designThis converter extracts the text layer only — it never reads image XObjects and never writes an assets folder. Visual elements simply do not appear in the HTML. To bring them back, render pages with PDF to PNG and add <img> tags manually.
Document headings collapse into body text
By designThe tool does not infer heading levels from font size or weight — the only <h2> it emits is the per-page "Page N" label. A 'Chapter 1' title at 24pt becomes ordinary <p> text. For per-page heading markers, PDF to Markdown emits a ## Page N heading you can post-process.
Each page renders as a single paragraph
ExpectedText items are joined with spaces during extraction, so there are no blank-line breaks inside a page. The paragraph splitter looks for double newlines and finds none, so a page's whole text usually lands in one <p>. Split it into real paragraphs by hand if you need finer structure.
File larger than 2 MB on the free tier
blockedFree tier caps input at 2 MB. A text-heavy report can exceed that quickly. Upgrade to Pro (50 MB) or split the PDF first with PDF Split and convert each part separately.
More than 50 pages on the free tier
blockedThe free tier converts up to 50 pages. Longer manuals need Pro (500 pages) or higher, or extract the section you need with PDF Extract Pages before converting.
Password-protected / encrypted PDF
fails to openpdf.js cannot read an encrypted PDF that needs a password to open, so conversion fails before any text is extracted. Remove the password first with PDF Unlock (you must know it), then convert the unlocked file.
Multi-column layout reads across columns
May interleavepdf.js returns text in the PDF's internal item order, which for a two-column academic layout can zig-zag across columns mid-line. The HTML text is all present but the reading order may be scrambled — review and reorder by hand after conversion.
Custom-encoded or subset font with no Unicode map
garbledSome PDFs embed subset fonts without a ToUnicode map, so the extracted characters are wrong even though the page looks fine. The HTML will contain garbled text. OCR via PDF OCR is the reliable workaround — it reads the rendered glyphs instead of the broken text layer.
Ampersands and quotes are left as-is
ReviewOnly < and > are escaped (to </>). A literal & or stray quote in the source text is passed through unescaped. For valid, strict HTML, run the output through an HTML formatter/linter before publishing.
Frequently asked questions
Are images from the PDF included in the HTML?
No. This tool extracts the text layer only — it does not read embedded images and does not create an assets folder. The HTML you get is text-only. To include images, render the pages to PNG or JPG with PDF to PNG or PDF to JPG, host them, and add <img> tags to the HTML yourself.
Will the heading structure (H1/H2/H3) be detected?
Not from the document. The only heading the converter emits is a <h2>Page N</h2> label at the top of each page section; your document's own headings come through as ordinary <p> text. Promote them manually after conversion, or use PDF to Markdown, which at least marks each page with a ## heading you can build on.
Will the HTML be indexed by Google?
Yes — the body text lives in real <p> tags, which is fully crawlable. Before publishing, replace the placeholder <title>Converted PDF</title> with a descriptive title, add a meta description, fix the heading hierarchy, and add the page to your sitemap so Googlebot can find it.
Does the output keep the PDF's exact layout, fonts and colours?
No. HTML is reflowable text. The converter applies a small default stylesheet (sans-serif, 800px max width) and drops the PDF's fonts, colours, and pixel positioning. The content is faithful; the appearance is generic until you restyle it. Keep the PDF if you need a pixel-perfect copy.
Are there any options or settings?
No. The tool auto-runs the moment you drop a PDF — there is no Convert button, no page-range field, and no formatting choices. It extracts all text from every page and emits the fixed HTML skeleton. To work with a subset of pages, extract them first with PDF Extract Pages.
Why is each page just one big paragraph?
During extraction the text items are joined with spaces, so there are no blank lines inside a page. The paragraph splitter looks for double line-breaks and finds none, so the whole page lands in a single <p>. That is expected behaviour; split into multiple paragraphs by hand if you need them.
My scanned PDF produced empty page sections — why?
A scan is just images, with no embedded text for pdf.js to read, so each page section comes out empty. Run PDF OCR first to add a real text layer, then convert the OCR'd PDF here.
Is the document uploaded anywhere?
No. Conversion runs entirely in your browser via pdf.js; the PDF bytes never leave your device. Only an anonymous usage counter is recorded when you're signed in, which you can opt out of in account settings.
What's the file-size and page limit?
Free tier allows 2 MB and up to 50 pages. Pro raises that to 50 MB / 500 pages, Pro + Media to 500 MB / 2,000 pages, and Developer to 2 GB / 10,000 pages. Over the limit, split the PDF with PDF Split and convert the parts.
What's the difference between this and HTML to PDF?
Opposite directions. This tool turns a PDF into an HTML page. HTML to PDF does the reverse — it renders HTML content into a PDF document. Use that one when you want a downloadable PDF from a web page or template.
Can I automate this for a batch of PDFs?
Yes, with a Pro plan. pdf-to-html is available as a runner-builtin tool — pair the @jadapps/runner once, then POST each PDF to the local runner endpoint and collect the HTML. The conversion still runs locally on your machine, so the documents never reach JAD's servers.
Should I keep offering the PDF alongside the HTML page?
Usually yes — publish the HTML for search and mobile readers, and keep a 'Download PDF' link for anyone who wants the formatted, printable original. The HTML carries the indexable text; the PDF carries the exact layout and any images the converter skipped.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.