Reduce PDF Size by Optimising Large CJK Fonts

How to reduce pdf size by optimising large embedded cjk fonts

Step 1
Check which fonts are the heavy ones — In Acrobat → File → Properties → Fonts, CJK faces (e.g. 'MS Gothic', 'SimSun', 'Noto Sans CJK', 'WenQuanYi') are typically the largest entries. Compare against the total file size to confirm fonts are the bottleneck and not images.
Step 2
Open the PDF Font Subsetter — Go to the PDF Font Subsetter. Processing is entirely local — language documents never leave the browser.
Step 3
Mind the tier limit for big CJK files — A fully-embedded CJK PDF can easily exceed the free tier's 2 MB cap. Pro raises it to 50 MB / 500 pages. If the upload is rejected, that's the size/page limit, not a processing failure.
Step 4
Press Process — No options. The tool scans every page's text for used CJK (and Latin) codepoints, analyses each embedded font with fontkit, and re-saves with packed object streams. Large CJK documents take longer because every page's text content is parsed.
Step 5
Verify CJK rendering and search — Open the output, confirm Chinese/Japanese/Korean characters render correctly, and test search on a known phrase to confirm the text layer survived.
Step 6
Compress images separately if needed — If the CJK PDF also embeds figures or scans, run lossy compression for those, or lossless to keep everything selectable.

Why CJK fonts dominate PDF size

Approximate glyph counts and the implication for embedded font size. Exact bytes depend on the face and outline complexity.

Font type	Typical glyph count	Embedded size order	Optimisation payoff
Latin text face	200–800	20–120 KB	Modest
Latin + extended/symbols	1,000–3,000	100–400 KB	Moderate
CJK (common subset)	3,000–9,000	1–6 MB	High
CJK (full face)	15,000–40,000+	5–20 MB	Highest (in theory)

What the tool does on a CJK document

The pipeline applied to a Chinese/Japanese/Korean PDF, step by step.

Stage	CJK-specific behaviour
Codepoint scan	Collects every distinct CJK character that actually appears on any page, plus basic ASCII
Font analysis	Parses each embedded CJK `FontFile2`/`FontFile3` with fontkit and computes the used-glyph subset
Re-save	Repacks the whole document with object streams — the source of the measurable reduction
Text layer	Preserves the ToUnicode map so CJK search and copy/paste keep working
Images	Untouched — compress separately if the file also has figures/scans

PDF tier limits for CJK files

Big CJK PDFs frequently need Pro because of the file-size cap.

Tier	Max file size	Max pages	Batch files
Free	2 MB	50	1
Pro	50 MB	500	5
Pro Media	500 MB	500	5

Cookbook

CJK-specific scenarios. The tool has no settings, so what varies is the document and which characters it actually uses.

A short Japanese letter with a full Gothic face embedded

A two-page business letter embeds the full MS Gothic face but uses only a few hundred kana and kanji. The re-save repacks the document.

Input:   letter_jp.pdf    6.4 MB  (full MS Gothic embedded)
Process: Font Subsetter
Output:  letter_jp.pdf    smaller (structure repacked)

Search a kanji phrase in the output → still found (ToUnicode kept)
Note: a 6.4 MB file needs Pro (free cap is 2 MB)

A Chinese contract already subsetted by its toolchain

A contract produced by a modern CJK-aware generator already embeds a subset of SimSun. There's little left to remove — the output is close to the input.

Input:   contract_zh.pdf  1.2 MB  (SimSun, Embedded Subset)
Process: Font Subsetter
Output:  contract_zh.pdf  ≈ 1.1 MB  (already lean)

Takeaway: many CJK toolchains subset already; small gain is normal.

A Korean report that's actually image-heavy

A report embeds Noto Sans CJK KR but most of its 18 MB is high-res charts. Font work barely moves the needle; the images are the problem.

Input:   report_kr.pdf    18 MB  (mostly chart images)
Process: Font Subsetter → tiny change (only structure repacked)

Better: /pdf-tools/pdf-compress-lossy with a target size
        → re-encodes the charts → multi-MB reduction

Confirming CJK search survives the re-save

The most important CJK check: text remains searchable because the Unicode mapping is preserved.

1. Process the CJK PDF
2. Open output → Ctrl/Cmd+F → type a known CJK phrase
   → match highlighted (search uses ToUnicode, not glyph IDs)
3. Select the phrase → copy → paste → characters intact

Mixed CJK + Latin document

A bilingual manual embeds both a CJK face and a Latin face. The tool collects codepoints from both scripts across all pages.

Codepoints collected: Latin (0x20–0x7E always) + every distinct
  CJK char that appears + any extended Latin used

Process: Font Subsetter → both embedded faces analysed,
         document re-saved with object streams

Edge cases and what actually happens

The CJK PDF exceeds the free tier size cap

rejected

Fully-embedded CJK PDFs commonly run several megabytes, which exceeds the free tier's 2 MB limit. The upload is rejected before processing. Use Pro (50 MB) or Pro Media (500 MB), or split the document first with PDF Split.

The font was already subsetted by the source toolchain

By design

Many CJK generators already embed only the glyphs used. In that case there's little to remove and the output is close to the input — the structural re-save still runs, but the dramatic 90%+ reduction only applies to PDFs that embed the full face.

Most of the file is images, not the CJK font

Use a compressor

Charts, scans, and figures aren't touched by font analysis. If a CJK report is image-heavy, this tool barely shrinks it. Use lossy compression for the images, or lossless to keep text selectable.

Vertical writing or complex shaping

Preserved

Vertical CJK text and advanced shaping are encoded in the content streams, which are preserved on the round-trip. The tool doesn't re-lay-out text, so vertical/ruby/shaped runs render the same after processing.

A CID-keyed font program won't parse

Preserved

Some CJK fonts use CID-keyed Type 0 structures fontkit may not parse. If that happens, the font is left intact and the rest of the document is still re-saved — you never lose the glyphs to a parse failure.

Large page count makes processing slow

slow but works

Every page's text content is parsed in the browser to collect codepoints. A long CJK document (hundreds of pages) takes time and memory but completes. Splitting first can speed it up.

Search stops working after processing

Investigate source

Search relies on the ToUnicode map, which this tool preserves. If CJK search fails in the output, it almost always failed in the source too (the original lacked a proper ToUnicode map). Re-generating the PDF with correct Unicode mapping is the fix, not this tool.

Encrypted CJK document

Supported

The tool reads with encryption ignored, so most protected CJK files process. If a strongly encrypted file fails to load, remove the password with Remove Password first.

Frequently asked questions

How much smaller can a CJK PDF get?

If the source embeds the full CJK face, the theoretical payoff is large because most of those tens of thousands of glyphs are unused. In practice the measurable reduction comes from the object-stream re-save and depends on the source — files that already subset their fonts see only a small gain. The result panel shows the exact before/after size.

How do I know if my PDF has a big embedded CJK font?

Open File → Properties → Fonts in Acrobat. CJK faces like SimSun, MS Gothic, Batang, or Noto Sans CJK that show 'Embedded' (not 'Embedded Subset') are the heavy ones. Compare the font's apparent weight to the total file size to confirm fonts, not images, are the bottleneck.

Will CJK text still be searchable after processing?

Yes. Search uses the PDF's ToUnicode Unicode-mapping layer, not the font file's internal glyph IDs, and this tool preserves that map on the round-trip. If search fails in the output, it almost certainly failed in the source — the original lacked a proper Unicode map.

My CJK file is 8 MB — why won't it upload on the free tier?

The free tier caps PDFs at 2 MB and 50 pages, and big embedded CJK fonts blow past that easily. Use Pro (50 MB / 500 pages) or Pro Media (500 MB). The rejection happens before processing, so it's a limit, not a failure.

Does it handle vertical (top-to-bottom) Japanese text?

Yes. Vertical writing direction is part of the content streams, which are preserved on the re-save. The tool doesn't re-typeset text, so vertical layout, ruby, and shaped runs render identically afterward.

What if my CJK font is CID-keyed and won't parse?

If fontkit can't parse a CID-keyed Type 0 program, the tool leaves that font intact and re-saves the rest of the document. You won't lose the glyphs — at worst that one font isn't analysed for subsetting.

Can I subset CJK fonts in a PDF/A without breaking it?

The re-save preserves embedded fonts and the text layer, so a PDF/A's font requirement is respected. If you need to re-assert PDF/A tagging afterward, run the PDF to PDF/A converter and re-validate, since strict conformance is checked separately.

Why did my image-heavy Korean report barely shrink?

Because the bulk was images, not the font. Font analysis doesn't touch JPEG/PNG XObjects. Run lossy compression to re-encode the charts and figures, or lossless if you must keep everything selectable.

Is my confidential CJK document uploaded?

No. pdf-lib, pdfjs, and fontkit run in your browser tab and the document never leaves your device — appropriate for confidential business and legal text in any language.

Does it change the appearance of the characters?

No. The glyphs that appear in the document are preserved exactly, and the content streams are kept, so every character renders the same. Only the file's internal structure is repacked.

Can I process multiple CJK files at once?

Not on the free tier — one file per run. Pro allows 5 PDFs per batch. For a folder of CJK documents, Pro is the practical choice; otherwise process them one at a time.

Should I run this before or after compressing the images?

Order doesn't strictly matter, but for a CJK PDF with both heavy fonts and images, run a compressor for the images (lossy for size targets, lossless to keep text) and this tool for the structure. Running both is safe; the second pass simply finds less to do.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to reduce pdf size by optimising large embedded cjk fonts

Step 1
Check which fonts are the heavy ones — In Acrobat → File → Properties → Fonts, CJK faces (e.g. 'MS Gothic', 'SimSun', 'Noto Sans CJK', 'WenQuanYi') are typically the largest entries. Compare against the total file size to confirm fonts are the bottleneck and not images.
Step 2
Open the PDF Font Subsetter — Go to the PDF Font Subsetter. Processing is entirely local — language documents never leave the browser.
Step 3
Mind the tier limit for big CJK files — A fully-embedded CJK PDF can easily exceed the free tier's 2 MB cap. Pro raises it to 50 MB / 500 pages. If the upload is rejected, that's the size/page limit, not a processing failure.
Step 4
Press Process — No options. The tool scans every page's text for used CJK (and Latin) codepoints, analyses each embedded font with fontkit, and re-saves with packed object streams. Large CJK documents take longer because every page's text content is parsed.
Step 5
Verify CJK rendering and search — Open the output, confirm Chinese/Japanese/Korean characters render correctly, and test search on a known phrase to confirm the text layer survived.
Step 6
Compress images separately if needed — If the CJK PDF also embeds figures or scans, run lossy compression for those, or lossless to keep everything selectable.

Why CJK fonts dominate PDF size

Approximate glyph counts and the implication for embedded font size. Exact bytes depend on the face and outline complexity.

Font type	Typical glyph count	Embedded size order	Optimisation payoff
Latin text face	200–800	20–120 KB	Modest
Latin + extended/symbols	1,000–3,000	100–400 KB	Moderate
CJK (common subset)	3,000–9,000	1–6 MB	High
CJK (full face)	15,000–40,000+	5–20 MB	Highest (in theory)

What the tool does on a CJK document

The pipeline applied to a Chinese/Japanese/Korean PDF, step by step.

Stage	CJK-specific behaviour
Codepoint scan	Collects every distinct CJK character that actually appears on any page, plus basic ASCII
Font analysis	Parses each embedded CJK `FontFile2`/`FontFile3` with fontkit and computes the used-glyph subset
Re-save	Repacks the whole document with object streams — the source of the measurable reduction
Text layer	Preserves the ToUnicode map so CJK search and copy/paste keep working
Images	Untouched — compress separately if the file also has figures/scans

PDF tier limits for CJK files

Big CJK PDFs frequently need Pro because of the file-size cap.

Tier	Max file size	Max pages	Batch files
Free	2 MB	50	1
Pro	50 MB	500	5
Pro Media	500 MB	500	5

Cookbook

CJK-specific scenarios. The tool has no settings, so what varies is the document and which characters it actually uses.

A short Japanese letter with a full Gothic face embedded

A two-page business letter embeds the full MS Gothic face but uses only a few hundred kana and kanji. The re-save repacks the document.

Input:   letter_jp.pdf    6.4 MB  (full MS Gothic embedded)
Process: Font Subsetter
Output:  letter_jp.pdf    smaller (structure repacked)

Search a kanji phrase in the output → still found (ToUnicode kept)
Note: a 6.4 MB file needs Pro (free cap is 2 MB)

A Chinese contract already subsetted by its toolchain

A contract produced by a modern CJK-aware generator already embeds a subset of SimSun. There's little left to remove — the output is close to the input.

Input:   contract_zh.pdf  1.2 MB  (SimSun, Embedded Subset)
Process: Font Subsetter
Output:  contract_zh.pdf  ≈ 1.1 MB  (already lean)

Takeaway: many CJK toolchains subset already; small gain is normal.

A Korean report that's actually image-heavy

A report embeds Noto Sans CJK KR but most of its 18 MB is high-res charts. Font work barely moves the needle; the images are the problem.

Input:   report_kr.pdf    18 MB  (mostly chart images)
Process: Font Subsetter → tiny change (only structure repacked)

Better: /pdf-tools/pdf-compress-lossy with a target size
        → re-encodes the charts → multi-MB reduction

Confirming CJK search survives the re-save

The most important CJK check: text remains searchable because the Unicode mapping is preserved.

1. Process the CJK PDF
2. Open output → Ctrl/Cmd+F → type a known CJK phrase
   → match highlighted (search uses ToUnicode, not glyph IDs)
3. Select the phrase → copy → paste → characters intact

Mixed CJK + Latin document

A bilingual manual embeds both a CJK face and a Latin face. The tool collects codepoints from both scripts across all pages.

Codepoints collected: Latin (0x20–0x7E always) + every distinct
  CJK char that appears + any extended Latin used

Process: Font Subsetter → both embedded faces analysed,
         document re-saved with object streams

Edge cases and what actually happens

The CJK PDF exceeds the free tier size cap

rejected

The font was already subsetted by the source toolchain

By design

Most of the file is images, not the CJK font

Use a compressor

Charts, scans, and figures aren't touched by font analysis. If a CJK report is image-heavy, this tool barely shrinks it. Use lossy compression for the images, or lossless to keep text selectable.

Vertical writing or complex shaping

Preserved

A CID-keyed font program won't parse

Preserved

Large page count makes processing slow

slow but works

Every page's text content is parsed in the browser to collect codepoints. A long CJK document (hundreds of pages) takes time and memory but completes. Splitting first can speed it up.

Search stops working after processing

Investigate source

Encrypted CJK document

Supported

The tool reads with encryption ignored, so most protected CJK files process. If a strongly encrypted file fails to load, remove the password with Remove Password first.

Frequently asked questions

How much smaller can a CJK PDF get?

How do I know if my PDF has a big embedded CJK font?

Will CJK text still be searchable after processing?

My CJK file is 8 MB — why won't it upload on the free tier?

Does it handle vertical (top-to-bottom) Japanese text?

What if my CJK font is CID-keyed and won't parse?

Can I subset CJK fonts in a PDF/A without breaking it?

Why did my image-heavy Korean report barely shrink?

Is my confidential CJK document uploaded?

No. pdf-lib, pdfjs, and fontkit run in your browser tab and the document never leaves your device — appropriate for confidential business and legal text in any language.

Does it change the appearance of the characters?

No. The glyphs that appear in the document are preserved exactly, and the content streams are kept, so every character renders the same. Only the file's internal structure is repacked.

Can I process multiple CJK files at once?

Not on the free tier — one file per run. Pro allows 5 PDFs per batch. For a folder of CJK documents, Pro is the practical choice; otherwise process them one at a time.

Should I run this before or after compressing the images?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Reduce PDF Size by Optimising Large Embedded CJK Fonts

How to reduce pdf size by optimising large embedded cjk fonts

Why CJK fonts dominate PDF size

What the tool does on a CJK document

PDF tier limits for CJK files

Cookbook

A short Japanese letter with a full Gothic face embedded

A Chinese contract already subsetted by its toolchain

A Korean report that's actually image-heavy

Confirming CJK search survives the re-save

Mixed CJK + Latin document

Edge cases and what actually happens

The CJK PDF exceeds the free tier size cap

The font was already subsetted by the source toolchain

Most of the file is images, not the CJK font

Vertical writing or complex shaping

A CID-keyed font program won't parse

Large page count makes processing slow

Search stops working after processing

Encrypted CJK document

Frequently asked questions

How much smaller can a CJK PDF get?

How do I know if my PDF has a big embedded CJK font?

Will CJK text still be searchable after processing?

My CJK file is 8 MB — why won't it upload on the free tier?

Does it handle vertical (top-to-bottom) Japanese text?

What if my CJK font is CID-keyed and won't parse?

Can I subset CJK fonts in a PDF/A without breaking it?

Why did my image-heavy Korean report barely shrink?

Is my confidential CJK document uploaded?

Does it change the appearance of the characters?

Can I process multiple CJK files at once?

Should I run this before or after compressing the images?

Privacy first

Related guides

Reduce PDF Size by Optimising Large Embedded CJK Fonts

How to reduce pdf size by optimising large embedded cjk fonts

Why CJK fonts dominate PDF size

What the tool does on a CJK document

PDF tier limits for CJK files

Cookbook

A short Japanese letter with a full Gothic face embedded

A Chinese contract already subsetted by its toolchain

A Korean report that's actually image-heavy

Confirming CJK search survives the re-save

Mixed CJK + Latin document

Edge cases and what actually happens

The CJK PDF exceeds the free tier size cap

The font was already subsetted by the source toolchain

Most of the file is images, not the CJK font

Vertical writing or complex shaping

A CID-keyed font program won't parse

Large page count makes processing slow

Search stops working after processing

Encrypted CJK document

Frequently asked questions

How much smaller can a CJK PDF get?

How do I know if my PDF has a big embedded CJK font?

Will CJK text still be searchable after processing?

My CJK file is 8 MB — why won't it upload on the free tier?

Does it handle vertical (top-to-bottom) Japanese text?

What if my CJK font is CID-keyed and won't parse?

Can I subset CJK fonts in a PDF/A without breaking it?

Why did my image-heavy Korean report barely shrink?

Is my confidential CJK document uploaded?

Does it change the appearance of the characters?

Can I process multiple CJK files at once?

Should I run this before or after compressing the images?

Privacy first

Related guides