How to reduce pdf size by optimising large embedded cjk fonts
- Step 1Check which fonts are the heavy ones — In Acrobat → File → Properties → Fonts, CJK faces (e.g. 'MS Gothic', 'SimSun', 'Noto Sans CJK', 'WenQuanYi') are typically the largest entries. Compare against the total file size to confirm fonts are the bottleneck and not images.
- Step 2Open the PDF Font Subsetter — Go to the PDF Font Subsetter. Processing is entirely local — language documents never leave the browser.
- Step 3Mind the tier limit for big CJK files — A fully-embedded CJK PDF can easily exceed the free tier's 2 MB cap. Pro raises it to 50 MB / 500 pages. If the upload is rejected, that's the size/page limit, not a processing failure.
- Step 4Press Process — No options. The tool scans every page's text for used CJK (and Latin) codepoints, analyses each embedded font with fontkit, and re-saves with packed object streams. Large CJK documents take longer because every page's text content is parsed.
- Step 5Verify CJK rendering and search — Open the output, confirm Chinese/Japanese/Korean characters render correctly, and test search on a known phrase to confirm the text layer survived.
- Step 6Compress images separately if needed — If the CJK PDF also embeds figures or scans, run lossy compression for those, or lossless to keep everything selectable.
Why CJK fonts dominate PDF size
Approximate glyph counts and the implication for embedded font size. Exact bytes depend on the face and outline complexity.
| Font type | Typical glyph count | Embedded size order | Optimisation payoff |
|---|---|---|---|
| Latin text face | 200–800 | 20–120 KB | Modest |
| Latin + extended/symbols | 1,000–3,000 | 100–400 KB | Moderate |
| CJK (common subset) | 3,000–9,000 | 1–6 MB | High |
| CJK (full face) | 15,000–40,000+ | 5–20 MB | Highest (in theory) |
What the tool does on a CJK document
The pipeline applied to a Chinese/Japanese/Korean PDF, step by step.
| Stage | CJK-specific behaviour |
|---|---|
| Codepoint scan | Collects every distinct CJK character that actually appears on any page, plus basic ASCII |
| Font analysis | Parses each embedded CJK FontFile2/FontFile3 with fontkit and computes the used-glyph subset |
| Re-save | Repacks the whole document with object streams — the source of the measurable reduction |
| Text layer | Preserves the ToUnicode map so CJK search and copy/paste keep working |
| Images | Untouched — compress separately if the file also has figures/scans |
PDF tier limits for CJK files
Big CJK PDFs frequently need Pro because of the file-size cap.
| Tier | Max file size | Max pages | Batch files |
|---|---|---|---|
| Free | 2 MB | 50 | 1 |
| Pro | 50 MB | 500 | 5 |
| Pro Media | 500 MB | 500 | 5 |
Cookbook
CJK-specific scenarios. The tool has no settings, so what varies is the document and which characters it actually uses.
A short Japanese letter with a full Gothic face embedded
A two-page business letter embeds the full MS Gothic face but uses only a few hundred kana and kanji. The re-save repacks the document.
Input: letter_jp.pdf 6.4 MB (full MS Gothic embedded) Process: Font Subsetter Output: letter_jp.pdf smaller (structure repacked) Search a kanji phrase in the output → still found (ToUnicode kept) Note: a 6.4 MB file needs Pro (free cap is 2 MB)
A Chinese contract already subsetted by its toolchain
A contract produced by a modern CJK-aware generator already embeds a subset of SimSun. There's little left to remove — the output is close to the input.
Input: contract_zh.pdf 1.2 MB (SimSun, Embedded Subset) Process: Font Subsetter Output: contract_zh.pdf ≈ 1.1 MB (already lean) Takeaway: many CJK toolchains subset already; small gain is normal.
A Korean report that's actually image-heavy
A report embeds Noto Sans CJK KR but most of its 18 MB is high-res charts. Font work barely moves the needle; the images are the problem.
Input: report_kr.pdf 18 MB (mostly chart images)
Process: Font Subsetter → tiny change (only structure repacked)
Better: /pdf-tools/pdf-compress-lossy with a target size
→ re-encodes the charts → multi-MB reductionConfirming CJK search survives the re-save
The most important CJK check: text remains searchable because the Unicode mapping is preserved.
1. Process the CJK PDF 2. Open output → Ctrl/Cmd+F → type a known CJK phrase → match highlighted (search uses ToUnicode, not glyph IDs) 3. Select the phrase → copy → paste → characters intact
Mixed CJK + Latin document
A bilingual manual embeds both a CJK face and a Latin face. The tool collects codepoints from both scripts across all pages.
Codepoints collected: Latin (0x20–0x7E always) + every distinct
CJK char that appears + any extended Latin used
Process: Font Subsetter → both embedded faces analysed,
document re-saved with object streamsEdge cases and what actually happens
The CJK PDF exceeds the free tier size cap
rejectedFully-embedded CJK PDFs commonly run several megabytes, which exceeds the free tier's 2 MB limit. The upload is rejected before processing. Use Pro (50 MB) or Pro Media (500 MB), or split the document first with PDF Split.
The font was already subsetted by the source toolchain
By designMany CJK generators already embed only the glyphs used. In that case there's little to remove and the output is close to the input — the structural re-save still runs, but the dramatic 90%+ reduction only applies to PDFs that embed the full face.
Most of the file is images, not the CJK font
Use a compressorCharts, scans, and figures aren't touched by font analysis. If a CJK report is image-heavy, this tool barely shrinks it. Use lossy compression for the images, or lossless to keep text selectable.
Vertical writing or complex shaping
PreservedVertical CJK text and advanced shaping are encoded in the content streams, which are preserved on the round-trip. The tool doesn't re-lay-out text, so vertical/ruby/shaped runs render the same after processing.
A CID-keyed font program won't parse
PreservedSome CJK fonts use CID-keyed Type 0 structures fontkit may not parse. If that happens, the font is left intact and the rest of the document is still re-saved — you never lose the glyphs to a parse failure.
Large page count makes processing slow
slow but worksEvery page's text content is parsed in the browser to collect codepoints. A long CJK document (hundreds of pages) takes time and memory but completes. Splitting first can speed it up.
Search stops working after processing
Investigate sourceSearch relies on the ToUnicode map, which this tool preserves. If CJK search fails in the output, it almost always failed in the source too (the original lacked a proper ToUnicode map). Re-generating the PDF with correct Unicode mapping is the fix, not this tool.
Encrypted CJK document
SupportedThe tool reads with encryption ignored, so most protected CJK files process. If a strongly encrypted file fails to load, remove the password with Remove Password first.
Frequently asked questions
How much smaller can a CJK PDF get?
If the source embeds the full CJK face, the theoretical payoff is large because most of those tens of thousands of glyphs are unused. In practice the measurable reduction comes from the object-stream re-save and depends on the source — files that already subset their fonts see only a small gain. The result panel shows the exact before/after size.
How do I know if my PDF has a big embedded CJK font?
Open File → Properties → Fonts in Acrobat. CJK faces like SimSun, MS Gothic, Batang, or Noto Sans CJK that show 'Embedded' (not 'Embedded Subset') are the heavy ones. Compare the font's apparent weight to the total file size to confirm fonts, not images, are the bottleneck.
Will CJK text still be searchable after processing?
Yes. Search uses the PDF's ToUnicode Unicode-mapping layer, not the font file's internal glyph IDs, and this tool preserves that map on the round-trip. If search fails in the output, it almost certainly failed in the source — the original lacked a proper Unicode map.
My CJK file is 8 MB — why won't it upload on the free tier?
The free tier caps PDFs at 2 MB and 50 pages, and big embedded CJK fonts blow past that easily. Use Pro (50 MB / 500 pages) or Pro Media (500 MB). The rejection happens before processing, so it's a limit, not a failure.
Does it handle vertical (top-to-bottom) Japanese text?
Yes. Vertical writing direction is part of the content streams, which are preserved on the re-save. The tool doesn't re-typeset text, so vertical layout, ruby, and shaped runs render identically afterward.
What if my CJK font is CID-keyed and won't parse?
If fontkit can't parse a CID-keyed Type 0 program, the tool leaves that font intact and re-saves the rest of the document. You won't lose the glyphs — at worst that one font isn't analysed for subsetting.
Can I subset CJK fonts in a PDF/A without breaking it?
The re-save preserves embedded fonts and the text layer, so a PDF/A's font requirement is respected. If you need to re-assert PDF/A tagging afterward, run the PDF to PDF/A converter and re-validate, since strict conformance is checked separately.
Why did my image-heavy Korean report barely shrink?
Because the bulk was images, not the font. Font analysis doesn't touch JPEG/PNG XObjects. Run lossy compression to re-encode the charts and figures, or lossless if you must keep everything selectable.
Is my confidential CJK document uploaded?
No. pdf-lib, pdfjs, and fontkit run in your browser tab and the document never leaves your device — appropriate for confidential business and legal text in any language.
Does it change the appearance of the characters?
No. The glyphs that appear in the document are preserved exactly, and the content streams are kept, so every character renders the same. Only the file's internal structure is repacked.
Can I process multiple CJK files at once?
Not on the free tier — one file per run. Pro allows 5 PDFs per batch. For a folder of CJK documents, Pro is the practical choice; otherwise process them one at a time.
Should I run this before or after compressing the images?
Order doesn't strictly matter, but for a CJK PDF with both heavy fonts and images, run a compressor for the images (lossy for size targets, lossless to keep text) and this tool for the structure. Running both is safe; the second pass simply finds less to do.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.