How to save a web page as clean markdown
- Step 1Capture the page HTML the right way — For a server-rendered page, View Source (Ctrl+U) and copy. For a JavaScript-rendered page (most modern sites), View Source is an empty shell — instead open DevTools, find the
<article>or main content element, right-click → Copy → Copy outerHTML. - Step 2Copy only the main content — This is the single most important step. Copying the whole
<body>drags in nav, ads, comments, and<script>/<style>blocks that leak as text. Copy the smallest element that contains the article — usually<article>,<main>, or the content<div>. - Step 3Paste and run — Choose Paste text, drop the HTML in, and run. There are no options — Turndown converts deterministically. (Upload also works if you saved the HTML to a
.htmlfile.) - Step 4Scan for leaked junk — Check the Markdown for stray CSS (
.class{...}) or script text — that's the tell that ad/analytics<script>/<style>came along. If so, re-copy a tighter element and convert again. - Step 5Save to your vault — Copy the Markdown to your notes app or download it as a
.mdfile. Add a source-URL line and capture date at the top so future-you knows where it came from. - Step 6Tidy for storage — Optionally run the result through md-prettifier to normalize spacing, and md-link-validator to confirm the links you captured still resolve.
Capture method by page type
Which capture method to use, and what to expect in the Markdown.
| Page type | How to capture | Result quality |
|---|---|---|
| Server-rendered article (blog, news) | View Source → copy <article> | Excellent — full content present |
| JS-rendered SPA (React/Vue/Next) | DevTools → Copy outerHTML of content element | Good — captures the rendered DOM |
Whole <body> pasted | Not recommended | Noisy — nav/ads/script text leaks in |
| Paywalled / login-gated page | Capture only the HTML you can legitimately see | As good as the visible HTML |
| Infinite-scroll feed | Scroll to load, then copy the loaded section | Partial — only loaded items exist in DOM |
What converts cleanly vs. what leaks
Turndown does not auto-clean a page. Knowing what survives tells you how tight your copy needs to be.
| Element | Result | Action |
|---|---|---|
<article> / <main> content | Clean Markdown | Copy this element |
<nav>, <header>, <footer> | Links/text converted as-is | Exclude from your copy |
<script> (ads, analytics) | Text leaks | Don't include — copy a tighter element |
<style> blocks | CSS leaks as text | Don't include |
<aside> (related links, ads) | Converted as content | Exclude from your copy |
<table> with header | GFM pipe table | Kept — good |
<iframe> (videos, embeds) | Empty output | Note the URL manually if needed |
Tier limits for HTML input
Full-page HTML is large; the character count (not just bytes) is what's enforced.
| Plan | Max file size | Max characters | Files per run |
|---|---|---|---|
| Free | 1 MB | 500,000 | 1 |
| Pro | 10 MB | 5,000,000 | 10 |
| Pro-media | 50 MB | 20,000,000 | 50 |
| Developer | 500 MB | Unlimited | Unlimited |
Cookbook
Real capture scenarios and the Markdown they produce. The lesson in most of them: copy the article, not the page.
Article copied cleanly via Copy outerHTML
Copying just the <article> element gives clean, readable Markdown with no nav or ad noise.
HTML in (from DevTools Copy outerHTML on <article>): <article><h1>How DNS Works</h1><p>DNS resolves <strong>names</strong> to IPs.</p></article> Markdown out: # How DNS Works DNS resolves **names** to IPs.
What happens when you paste the whole page
Include the analytics script and you get its source text in your notes. This is why copying a tighter element matters.
HTML in (whole page fragment):
<script>ga('send','pageview');</script>
<style>.ad{display:block}</style>
<p>Real content.</p>
Markdown out:
ga('send','pageview');
.ad{display:block}
Real content.
→ Re-copy just the <article> to avoid the leaked script/style text.A data table from a reference page
Reference pages often have comparison tables. With a header row, they convert to a clean GFM table you can search in your notes.
HTML in: <table><thead><tr><th>Port</th><th>Service</th></tr></thead> <tbody><tr><td>443</td><td>HTTPS</td></tr></tbody></table> Markdown out: | Port | Service | | --- | --- | | 443 | HTTPS |
Add provenance to your saved note
The tool gives you the body Markdown; add a source header yourself so the archive is traceable.
After conversion, prepend in your vault: > Source: https://example.com/how-dns-works > Captured: 2026-06-13 # How DNS Works DNS resolves **names** to IPs.
A video embed leaves a gap
Tutorial pages with embedded videos lose the embed. Capture the URL so you can re-link it in your notes.
HTML in: <iframe src="https://player.vimeo.com/video/12345"></iframe> <p>Watch the walkthrough above.</p> Markdown out: Watch the walkthrough above. → The iframe produced nothing. Add the URL by hand: [Walkthrough video](https://player.vimeo.com/video/12345)
Edge cases and what actually happens
It can't fetch a URL for you
Not supportedThis is a converter, not a crawler. You must provide the HTML (View Source or Copy outerHTML). It never makes a network request to a page, which is also why it's fully private.
JavaScript-rendered pages come back empty from View Source
ExpectedModern SPAs build content client-side, so View Source returns an empty shell and the Markdown is nearly empty. Use DevTools → Copy outerHTML on the rendered element instead — that captures the live DOM.
Ads, nav, and cookie banners are not removed
By designThere is no readability/article-extraction step. Whatever HTML you paste is converted. The clean-up happens at capture time: copy only the <article>/main element, not the whole page.
Analytics/ad scripts leak as visible text
LeakedIf your copied HTML includes <script> (Google Analytics, ad tags), the script source appears as text in the Markdown — Turndown has no rule to drop it. Likewise <style> CSS. Copy a tighter element to avoid them.
Embedded videos and maps disappear
Dropped<iframe> embeds (YouTube, Vimeo, Google Maps) produce empty output. If the embed matters for your archive, copy its src URL from the source HTML and add a Markdown link by hand.
Lazy-loaded images may have placeholder src
PreservedSites that lazy-load images often keep the real URL in data-src and put a placeholder in src. Turndown reads src, so you may capture . Capture after images load, or copy the real URL from data-src manually.
Infinite-scroll feeds only capture what's loaded
PartialOnly DOM that exists at copy time is converted. For an infinite-scroll page, scroll to load the section you want before Copy outerHTML, or you'll archive a fraction of the content.
A reference table without a header stays as HTML
By designLayout-only tables (no <thead>/<th>) are left as raw HTML in the Markdown, since the GFM rule needs a header. Repair them with md-table-repair if you need clean output.
Whole-page capture exceeds the character limit
RejectedFull HTML pages with inline SVG and scripts are huge and can blow past the Free 500,000-character cap. The tool reports the count and limit. Copy just the article (which also gives cleaner output) or upgrade for the 5,000,000-character Pro limit.
Frequently asked questions
Can I paste a URL and have it scrape the page?
No. It converts HTML you provide, not a URL it fetches. Use View Source (server-rendered) or DevTools Copy outerHTML (JS-rendered) to get the HTML, then paste it.
Will it handle JavaScript-rendered content?
Only if you give it the rendered HTML. View Source on a SPA returns an empty shell, so use DevTools → Copy outerHTML on the content element to capture the live DOM.
Does it strip ads, navigation, and scripts automatically?
No. There's no readability extraction. Worse, <script> and <style> text leaks into the output. The fix is to copy only the <article>/main element, not the whole page.
Is this for archiving to archive.org?
No — this produces a local Markdown copy for your own notes/archive. It's complementary to web-archive services, not a submission tool.
Is anything I capture sent to a server?
No. The conversion runs entirely in your browser, so the pages you capture stay on your device.
Why did a script's code end up in my notes?
You included a <script> block in the copied HTML. Turndown emits its text. Re-copy a tighter element (just the article) and convert again.
Do tables survive?
Tables with a header row become GFM pipe tables. Layout tables without a header stay as raw HTML — repair them with the table-repair tool if needed.
What about images?
Images become  references to the original URLs; files aren't downloaded. Lazy-loaded images may capture a placeholder src — grab the real URL from data-src if so.
Can I add the source URL automatically?
No — the tool outputs body Markdown only. Prepend a source/date line yourself in your notes app for provenance.
How big a page can I convert?
Free allows 500,000 characters. Full pages are large; copying just the article keeps you under the limit and produces cleaner Markdown. Pro raises it to 5,000,000.
Is this good for feeding an LLM?
Yes — Markdown is more token-efficient than HTML. For an LLM-focused workflow, see the dedicated guide at html-to-markdown-for-llm-input.
What format do I get out?
Markdown (.md). Copy it to your clipboard or download it. To convert Markdown back to HTML, use md-to-html.
Privacy first
All Markdown processing runs locally in your browser using JavaScript. No file is ever uploaded to JAD Apps servers — only metadata counters are saved for signed-in dashboard stats.