How to archive blog content as markdown
- Step 1Capture each post's body HTML — View Source on the post (Ctrl+U) and copy, or in DevTools right-click the post's
<article>/content element → Copy → Copy outerHTML. Capture the post body, not the whole page template, so nav and sidebars stay out of your archive. - Step 2Paste or upload and convert — Choose Paste text and drop the HTML in (or Upload file with a saved
.html), then run. There are no options — Turndown produces the Markdown deterministically. - Step 3Save the images separately — The Markdown references images by their original URL, which dies with the host. Download each image (right-click → Save, or save the whole page's assets) into a folder alongside your
.md, then rewrite paths with md-image-path-rewriter. - Step 4Add a metadata header — The tool outputs body Markdown only. Prepend a front-matter block with the post title, original publish date, source URL, and tags so the archive is self-describing. md-frontmatter-builder can scaffold it.
- Step 5Re-link any embeds —
<iframe>embeds (YouTube, CodePen, tweets) are dropped. Note their URLs from the source HTML and add them back as Markdown links so the archived post still points to the original media. - Step 6Commit to Git — Store the
.mdand its image folder in a Git repo. That gives you version history, off-machine backup (push to a remote), and a format that outlives any blogging platform.
What survives the archive vs. what you must capture separately
A durable archive needs more than body text. This is the gap list for blog backups.
| Blog element | In the Markdown? | Action for a complete archive |
|---|---|---|
| Post body (headings, text, lists) | Yes — converted | Done |
| Code blocks | Yes — fenced, with language if tagged | Done |
| Tables (with header row) | Yes — GFM pipe table | Header-less ones stay as HTML |
| Images | Reference only () | Download files + rewrite paths |
| Post metadata (date, author, tags) | No | Add front matter manually |
| Comments | No (unless in the copied HTML) | Export comments separately if wanted |
| Embeds (YouTube, CodePen, tweets) | No — empty output | Note URLs and re-link |
Blog markup → Markdown
How typical blog HTML converts with this tool's Turndown config. Verified against the running converter.
| Blog markup | Markdown output | Notes |
|---|---|---|
<h2>Section</h2> | ## Section | ATX heading; id attribute dropped |
<blockquote> | > quote | Pull quotes preserved |
<figure><img><figcaption> |  + caption paragraph | Caption kept as following text |
<pre><code class="language-py"> | py block | Language preserved for tech posts |
<a href="/2019/old-post/"> | [text](/2019/old-post/) | Internal links keep old slugs |
<iframe> (embed) | (empty) | Re-link manually |
<del> strikethrough | ~text~ | Single-tilde in this plugin version |
Tier limits for HTML input
Older posts with inline styling can be large; the character count is what's enforced.
| Plan | Max file size | Max characters | Files per run |
|---|---|---|---|
| Free | 1 MB | 500,000 | 1 |
| Pro | 10 MB | 5,000,000 | 10 |
| Pro-media | 50 MB | 20,000,000 | 50 |
| Developer | 500 MB | Unlimited | Unlimited |
Cookbook
Real blog-archiving scenarios and the Markdown they produce — plus the manual steps that make the backup complete.
A blog post body converts cleanly
Headings, a pull quote, and a paragraph archive as clean, durable Markdown.
HTML in: <article><h2>Why I Left Medium</h2> <blockquote><p>Own your content.</p></blockquote> <p>Here's what I learned.</p></article> Markdown out: ## Why I Left Medium > Own your content. Here's what I learned.
Image reference that needs the file saved
The image converts to a reference pointing at the original host. When the blog dies, that URL 404s — so download the file and rewrite the path.
HTML in: <img src="https://oldblog.com/wp-content/uploads/cover.jpg" alt="Cover"> Markdown out:  Archive fix: 1. Save cover.jpg into ./images/ 2. md-image-path-rewriter → 
Add a metadata header for a self-describing archive
Body Markdown alone loses the publish date and source. Prepend front matter so the archived post stands on its own.
After conversion, prepend (md-frontmatter-builder): --- title: "Why I Left Medium" date: 2019-08-14 source: https://oldblog.com/why-i-left-medium tags: [blogging, ownership] --- ## Why I Left Medium > Own your content.
A CodePen embed is lost — re-link it
Embedded demos vanish. Capture the URL from the source HTML and add a link so the archive still references the original.
HTML in: <iframe src="https://codepen.io/user/embed/abcdef"></iframe> <p>Live demo above.</p> Markdown out: Live demo above. Archive fix — add the link by hand: [Live demo on CodePen](https://codepen.io/user/pen/abcdef)
A technical post keeps its code highlighting
For dev blogs, fenced blocks with language tags mean your archived tutorials still render with syntax highlighting if you republish.
HTML in:
<pre><code class="language-py">def hello():
print('hi')</code></pre>
Markdown out:
```py
def hello():
print('hi')
```Edge cases and what actually happens
Images are not downloaded
Not handled<img> becomes  pointing at the original host. When the blog goes offline, those URLs break and your archive shows broken images. Download the files separately and rewrite paths with md-image-path-rewriter.
Post metadata is not captured
By designPublish date, author, and tags live in the page template or the CMS, not the body HTML — so they're not in the Markdown. Add a front-matter block manually with md-frontmatter-builder so the archive is dated and attributed.
Comments are not archived
Not handledComment threads (Disqus, native, Webmentions) are usually loaded separately and won't be in the post body you copy. If comments matter, export them from the platform separately — this tool won't capture them.
Embeds (YouTube, CodePen, tweets) disappear
Dropped<iframe> embeds produce empty output, so demos and videos vanish from the archive. Record each embed URL from the source HTML and re-add it as a Markdown link to keep the reference.
Internal links keep old slugs and will break
PreservedLinks to other posts keep the original blog's URLs (/2019/old-post/). After the blog dies these 404. Decide whether to point them at archived copies, the Wayback Machine, or your new site, and fix them after conversion.
Theme `<style>`/`<script>` leaks if you copy the whole page
LeakedCopying the full page instead of the <article> brings inline <style>/<script> along, and their text leaks into the Markdown. Capture only the post body element for a clean archive.
Old posts with classic-editor tables stay as HTML
By designTables without a header row (common in old posts) aren't converted and remain raw <table> HTML. They still archive fine as text but won't render as clean tables — repair with md-table-repair if you care.
Only the loaded part of a paginated post is captured
PartialIf a post is split across pages or lazy-loads sections, only the HTML present in the DOM at copy time converts. Load every part (or capture each page) before converting so the archive is complete.
A media-heavy post exceeds the character limit
RejectedOld posts with inline base64 images or heavy styling can blow past the Free 500,000-character cap. The tool reports the count and limit. Strip inline data URIs (save those images as files) or upgrade to Pro's 5,000,000.
Frequently asked questions
Will my images be archived too?
No. Images become  references to the original host, which breaks when the blog goes offline. Download the image files separately and rewrite the paths with the image-path-rewriter tool.
Should I include the comments?
Comments are usually loaded separately and won't be in the post body you copy, so they aren't archived. If you want them, export them from the platform (e.g. Disqus) as a separate file.
What about blog metadata like date and author?
Not captured — the tool converts body HTML, and metadata lives in the template. Add a front-matter block manually with md-frontmatter-builder so the archive is self-describing.
Do embedded videos and demos survive?
No. <iframe> embeds produce empty output. Note their URLs from the source HTML and re-add them as Markdown links to preserve the reference.
Why Markdown for a long-term archive?
It's plain text — readable in any editor, diffable in Git, and importable into any future blog. No database, no proprietary format, no platform dependency.
Can I archive a whole blog at once?
Not in one run — this converts one post body at a time (Free is 1 file). Capture and convert posts individually, or use a platform export to feed each body through programmatically.
Do internal links still work after archiving?
They keep the original blog's URLs, which break once the blog is gone. Re-point them at archived copies or your new site after conversion.
Is anything I archive uploaded?
No. Conversion runs entirely in your browser, so you can archive private drafts and unpublished posts safely.
Will code samples keep their formatting?
Yes. <pre><code> becomes a fenced block, and a language-* class is kept as the fence language — important for dev-blog archives.
What's the best way to store the archive?
Put the .md files and an images folder in a Git repo and push to a remote. That gives you history, off-machine backup, and a durable format.
What if a post has inline base64 images?
Those inflate the character count and can exceed the Free 500,000 limit. Extract them to image files first, or upgrade to Pro's 5,000,000-character limit.
Can I re-publish the archive later?
Yes — that's the point. Markdown imports into Hugo, Astro, Ghost, and most platforms. Add front matter and rewrite image paths, and the post is ready to go live again.
Privacy first
All Markdown processing runs locally in your browser using JavaScript. No file is ever uploaded to JAD Apps servers — only metadata counters are saved for signed-in dashboard stats.