Split PDF Text into Semantic-Style Chunks

How to split pdf text into sentence-aware chunks for ai

Step 1
Open the chunker and load the PDF — Drop the document into the PDF to Text Chunks tool. It extracts the embedded text layer with pdf.js in your browser; no upload happens.
Step 2
Decide your granularity — Pick Max words per chunk based on topic density. Dense, fact-rich documents (specs, policies) reward smaller chunks (150-300); narrative or explanatory text tolerates larger ones (400-600). The default is 500.
Step 3
Set Max words per chunk — The single control accepts 50-2000 words. The chunker fills each chunk with whole sentences until adding the next would overshoot the target, then starts a new chunk with a 50-word overlap from the previous one.
Step 4
Generate the chunks — Run the tool. It returns a JSON array, one object per chunk, downloaded as <filename>.chunks.json.
Step 5
Inspect a few chunks for boundary quality — Open the JSON and read two or three chunks end-to-end. They should start and end on complete sentences. If a multi-column source produced interleaved text, the PDF needs reflow before chunking.
Step 6
Embed and index, keeping the metadata — Embed each text and store id, page, and pageRange with the vector so retrieval can surface and cite the right section.

Fixed-size vs sentence-aware vs embedding-semantic

Where this tool sits. It is the middle column — and it is explicit about not being the third.

Approach	Split point	Coherence	This tool
Fixed-size	Every N characters/tokens	Cuts mid-sentence; fragments clauses	Not how this works
Sentence-aware (+overlap)	Sentence boundaries `. ! ?`, whole sentences packed to a word target, 50-word overlap	Each chunk is whole sentences; seams bridged	Yes — this is the implemented method
Embedding-semantic	Where an embedding model detects a topic shift	Highest, but costly and model-dependent	Not implemented — post-process if you need it

Output shape, field by field

Every chunk object the tool emits. Verified against the implementation.

Field	Type	Meaning
`id`	number	0-based positional index of the chunk in the array
`page`	number	The first page this chunk's text touches
`pageRange`	[number, number]	[start, end] page span the chunk covers
`text`	string	The chunk content — whole sentences joined by spaces
`tokens`	number	Whitespace word count of `text` (an estimate, not BPE tokens)

Cookbook

Recipes for getting coherent, sentence-aligned chunks out of real documents.

Sentence-aligned chunks from a policy document

A 30-page policy with dense paragraphs. Default 500-word chunks keep each section's sentences together.

Max words per chunk: 500

[
  { "id": 0, "page": 1, "pageRange": [1, 1],
    "text": "This policy governs the use of ...", "tokens": 491 },
  { "id": 1, "page": 1, "pageRange": [1, 2],
    "text": "... access controls. Employees must ...", "tokens": 503 }
]
# Note id 1 starts with the overlap tail of id 0.

Why fixed-character splitting fails (and this doesn't)

Side-by-side of a sentence cut by a naive 200-char splitter vs the sentence-aware result.

Fixed 200-char split:
  chunk A: "...the renewal fee is due on the"
  chunk B: "first business day of each quarter."
  -> neither chunk answers 'when is the fee due?'

Sentence-aware split:
  chunk A: "...the renewal fee is due on the first
            business day of each quarter."
  -> the whole fact lives in one chunk.

Tighter chunks for fact-dense specs

Technical specifications pack many distinct facts per page; smaller chunks isolate each one for precise retrieval.

Max words per chunk: 200

A 12-page spec sheet -> ~25-35 chunks,
each one or two pages, each a focused cluster
of related sentences -> a query for one parameter
retrieves a chunk about that parameter, not the page.

Adding metadata your retriever can filter on

Augment each chunk with your own fields after generation — the tool's JSON is easy to enrich.

for i, c in enumerate(chunks):
    c['doc'] = 'security-policy-v4'
    c['section'] = guess_section(c['page'])
# now you can filter retrieval by doc/section
# while still citing pageRange in the answer.

Reproducible runs via the API

For pipelines that must reproduce the same chunking, drive the tool through the runner with explicit options.

GET /api/v1/tools/pdf-to-chunks   # schema
POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
{ "options": { "maxChunkSize": 350, "overlap": 60 } }
# deterministic for a given PDF + options

Edge cases and what actually happens

Calling it 'semantic' oversells it

By design

This is sentence-aware splitting with overlap, not embedding-based topic detection. It will not place a boundary precisely where the subject changes mid-paragraph. For most retrieval workloads that gap is invisible; if you genuinely need topic segmentation, run an embedding pass over these chunks afterward.

Text with no sentence punctuation won't split well

Source-dependent

Bullet lists, code blocks, and tables often lack . ! ?, so the splitter treats long stretches as one giant 'sentence' and packs them into one over-target chunk. Pre-extract structured content with PDF Table to JSON or accept larger list-heavy chunks.

Abbreviations create false sentence breaks

Minor noise

A period after 'Inc.', 'e.g.', or 'No. 5' can trigger an early split. The overlap softens the impact, and most embedding models tolerate it. If it matters for a specific corpus, normalise abbreviations before chunking.

Multi-column layout interleaves columns

Reading-order risk

Extraction follows pdf.js content order, not visual reading order. Two-column pages may zigzag between columns, breaking sentence continuity. Inspect a chunk from any multi-column document and reflow the PDF if the order is wrong.

Image-only PDF yields nothing to split

No text layer

There is no OCR. A scanned document has no extractable sentences, so the chunk array is empty. OCR it with PDF OCR first.

Overlap is fixed at 50 in the browser

Limited control

The web UI exposes only Max words per chunk; the 50-word overlap is not adjustable there. To tune overlap, drive the tool via the API/runner where overlap is a real option (clamped to half the chunk size).

tokens overstates how much fits in a model

Expected

tokens is a word count. A chunk reporting 500 'tokens' is roughly 650 model tokens. When packing chunks into a prompt budget, convert with a ~1.3x factor and leave margin.

Page count exceeds the tier limit

413 limit

Free allows 50 pages, Pro 500, Pro+Media 2,000. A long manual hits the page ceiling first. Split it with PDF Split (Fixed) and chunk each part, or upgrade the plan.

Frequently asked questions

How is semantic chunking different from fixed-size chunking?

Fixed-size chunking cuts at a character or token count regardless of where you are in a sentence, producing fragments. This tool splits at sentence boundaries and packs whole sentences up to a word target, so each chunk is a coherent block — and it carries a 50-word overlap so meaning at the seams isn't lost. It is sentence-aware rather than embedding-semantic, which is what most 'semantic chunking' searches really want.

Is this real embedding-based semantic chunking?

No. It does not run an embedding model to detect topic shifts. It respects sentence boundaries and overlaps the seams, which gives most of the retrieval benefit at none of the cost or unpredictability. If you need true topic-boundary detection, post-process these sentence-aware chunks with your own embedding model.

What does each chunk contain?

A JSON object with id (position), page (first page touched), pageRange ([start, end]), text (whole sentences joined by spaces), and tokens (a whitespace word count). The page fields make source citation straightforward.

Which embedding models pair well with these chunks?

All standard ones — OpenAI text-embedding-3, Cohere embed-v3, and HuggingFace sentence-transformers all produce stronger embeddings from coherent, sentence-aligned text than from mid-sentence fragments. Keep chunks within the model's input limit; the default 500 words is well inside every common model.

Why is one chunk much larger than the rest?

Because a single sentence (or a punctuation-free stretch like a list or table) exceeded the target, and the tool never splits within a sentence. The long unit becomes its own chunk. Extract tables and lists separately if this happens often.

Can I change the overlap?

Not in the browser — it's fixed at 50 words there. The API/runner exposes an overlap option, clamped to at most half the chunk size. Call GET /api/v1/tools/pdf-to-chunks for the schema.

Does it handle scanned PDFs?

No OCR is performed, so a scan has no text to split. Run it through PDF OCR to add a text layer first, then chunk the OCR'd file.

Should I store the page number with each chunk?

Yes — keep page and pageRange as metadata on every vector. When a chunk is retrieved, your LLM answer can cite the exact page, and you can render a deep link into the source PDF.

Is the document uploaded anywhere?

No. Text extraction and chunking happen in your browser with pdf.js. Nothing is sent to a server beyond an anonymous usage counter when signed in.

What about abbreviations breaking sentences early?

A period in 'Inc.' or 'e.g.' can cause an early boundary. The 50-word overlap absorbs most of the impact, and embedding retrieval is robust to it. Normalise abbreviations beforehand only if a specific corpus is unusually sensitive.

How do I make runs reproducible?

For a fixed PDF and fixed options the output is deterministic. Drive the tool through the runner with explicit maxChunkSize and overlap to lock the parameters into your pipeline config rather than clicking the UI each time.

How large a PDF can it process?

Free tier: 2 MB / 50 pages. Pro: 50 MB / 500 pages. Pro+Media: 500 MB / 2,000 pages. Developer: 2 GB / 10,000 pages. Split larger documents with PDF Split (Fixed) before chunking.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to split pdf text into sentence-aware chunks for ai

Step 1
Open the chunker and load the PDF — Drop the document into the PDF to Text Chunks tool. It extracts the embedded text layer with pdf.js in your browser; no upload happens.
Step 2
Decide your granularity — Pick Max words per chunk based on topic density. Dense, fact-rich documents (specs, policies) reward smaller chunks (150-300); narrative or explanatory text tolerates larger ones (400-600). The default is 500.
Step 3
Set Max words per chunk — The single control accepts 50-2000 words. The chunker fills each chunk with whole sentences until adding the next would overshoot the target, then starts a new chunk with a 50-word overlap from the previous one.
Step 4
Generate the chunks — Run the tool. It returns a JSON array, one object per chunk, downloaded as <filename>.chunks.json.
Step 5
Inspect a few chunks for boundary quality — Open the JSON and read two or three chunks end-to-end. They should start and end on complete sentences. If a multi-column source produced interleaved text, the PDF needs reflow before chunking.
Step 6
Embed and index, keeping the metadata — Embed each text and store id, page, and pageRange with the vector so retrieval can surface and cite the right section.

Fixed-size vs sentence-aware vs embedding-semantic

Where this tool sits. It is the middle column — and it is explicit about not being the third.

Approach	Split point	Coherence	This tool
Fixed-size	Every N characters/tokens	Cuts mid-sentence; fragments clauses	Not how this works
Sentence-aware (+overlap)	Sentence boundaries `. ! ?`, whole sentences packed to a word target, 50-word overlap	Each chunk is whole sentences; seams bridged	Yes — this is the implemented method
Embedding-semantic	Where an embedding model detects a topic shift	Highest, but costly and model-dependent	Not implemented — post-process if you need it

Output shape, field by field

Every chunk object the tool emits. Verified against the implementation.

Field	Type	Meaning
`id`	number	0-based positional index of the chunk in the array
`page`	number	The first page this chunk's text touches
`pageRange`	[number, number]	[start, end] page span the chunk covers
`text`	string	The chunk content — whole sentences joined by spaces
`tokens`	number	Whitespace word count of `text` (an estimate, not BPE tokens)

Cookbook

Recipes for getting coherent, sentence-aligned chunks out of real documents.

Sentence-aligned chunks from a policy document

A 30-page policy with dense paragraphs. Default 500-word chunks keep each section's sentences together.

Max words per chunk: 500

[
  { "id": 0, "page": 1, "pageRange": [1, 1],
    "text": "This policy governs the use of ...", "tokens": 491 },
  { "id": 1, "page": 1, "pageRange": [1, 2],
    "text": "... access controls. Employees must ...", "tokens": 503 }
]
# Note id 1 starts with the overlap tail of id 0.

Why fixed-character splitting fails (and this doesn't)

Side-by-side of a sentence cut by a naive 200-char splitter vs the sentence-aware result.

Fixed 200-char split:
  chunk A: "...the renewal fee is due on the"
  chunk B: "first business day of each quarter."
  -> neither chunk answers 'when is the fee due?'

Sentence-aware split:
  chunk A: "...the renewal fee is due on the first
            business day of each quarter."
  -> the whole fact lives in one chunk.

Tighter chunks for fact-dense specs

Technical specifications pack many distinct facts per page; smaller chunks isolate each one for precise retrieval.

Max words per chunk: 200

A 12-page spec sheet -> ~25-35 chunks,
each one or two pages, each a focused cluster
of related sentences -> a query for one parameter
retrieves a chunk about that parameter, not the page.

Adding metadata your retriever can filter on

Augment each chunk with your own fields after generation — the tool's JSON is easy to enrich.

for i, c in enumerate(chunks):
    c['doc'] = 'security-policy-v4'
    c['section'] = guess_section(c['page'])
# now you can filter retrieval by doc/section
# while still citing pageRange in the answer.

Reproducible runs via the API

For pipelines that must reproduce the same chunking, drive the tool through the runner with explicit options.

GET /api/v1/tools/pdf-to-chunks   # schema
POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
{ "options": { "maxChunkSize": 350, "overlap": 60 } }
# deterministic for a given PDF + options

Edge cases and what actually happens

Calling it 'semantic' oversells it

By design

Text with no sentence punctuation won't split well

Source-dependent

Abbreviations create false sentence breaks

Minor noise

Multi-column layout interleaves columns

Reading-order risk

Image-only PDF yields nothing to split

No text layer

There is no OCR. A scanned document has no extractable sentences, so the chunk array is empty. OCR it with PDF OCR first.

Overlap is fixed at 50 in the browser

Limited control

tokens overstates how much fits in a model

Expected

tokens is a word count. A chunk reporting 500 'tokens' is roughly 650 model tokens. When packing chunks into a prompt budget, convert with a ~1.3x factor and leave margin.

Page count exceeds the tier limit

413 limit

Free allows 50 pages, Pro 500, Pro+Media 2,000. A long manual hits the page ceiling first. Split it with PDF Split (Fixed) and chunk each part, or upgrade the plan.

Frequently asked questions

How is semantic chunking different from fixed-size chunking?

Is this real embedding-based semantic chunking?

What does each chunk contain?

Which embedding models pair well with these chunks?

Why is one chunk much larger than the rest?

Can I change the overlap?

Not in the browser — it's fixed at 50 words there. The API/runner exposes an overlap option, clamped to at most half the chunk size. Call GET /api/v1/tools/pdf-to-chunks for the schema.

Does it handle scanned PDFs?

No OCR is performed, so a scan has no text to split. Run it through PDF OCR to add a text layer first, then chunk the OCR'd file.

Should I store the page number with each chunk?

Yes — keep page and pageRange as metadata on every vector. When a chunk is retrieved, your LLM answer can cite the exact page, and you can render a deep link into the source PDF.

Is the document uploaded anywhere?

No. Text extraction and chunking happen in your browser with pdf.js. Nothing is sent to a server beyond an anonymous usage counter when signed in.

What about abbreviations breaking sentences early?

How do I make runs reproducible?

How large a PDF can it process?

Free tier: 2 MB / 50 pages. Pro: 50 MB / 500 pages. Pro+Media: 500 MB / 2,000 pages. Developer: 2 GB / 10,000 pages. Split larger documents with PDF Split (Fixed) before chunking.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Split PDF Text into Sentence-Aware Chunks for AI

How to split pdf text into sentence-aware chunks for ai

Fixed-size vs sentence-aware vs embedding-semantic

Output shape, field by field

Cookbook

Sentence-aligned chunks from a policy document

Why fixed-character splitting fails (and this doesn't)

Tighter chunks for fact-dense specs

Adding metadata your retriever can filter on

Reproducible runs via the API

Edge cases and what actually happens

Calling it 'semantic' oversells it

Text with no sentence punctuation won't split well

Abbreviations create false sentence breaks

Multi-column layout interleaves columns

Image-only PDF yields nothing to split

Overlap is fixed at 50 in the browser

tokens overstates how much fits in a model

Page count exceeds the tier limit

Frequently asked questions

How is semantic chunking different from fixed-size chunking?

Is this real embedding-based semantic chunking?

What does each chunk contain?

Which embedding models pair well with these chunks?

Why is one chunk much larger than the rest?

Can I change the overlap?

Does it handle scanned PDFs?

Should I store the page number with each chunk?

Is the document uploaded anywhere?

What about abbreviations breaking sentences early?

How do I make runs reproducible?

How large a PDF can it process?

Privacy first

Related guides

Split PDF Text into Sentence-Aware Chunks for AI

How to split pdf text into sentence-aware chunks for ai

Fixed-size vs sentence-aware vs embedding-semantic

Output shape, field by field

Cookbook

Sentence-aligned chunks from a policy document

Why fixed-character splitting fails (and this doesn't)

Tighter chunks for fact-dense specs

Adding metadata your retriever can filter on

Reproducible runs via the API

Edge cases and what actually happens

Calling it 'semantic' oversells it

Text with no sentence punctuation won't split well

Abbreviations create false sentence breaks

Multi-column layout interleaves columns

Image-only PDF yields nothing to split

Overlap is fixed at 50 in the browser

tokens overstates how much fits in a model

Page count exceeds the tier limit

Frequently asked questions

How is semantic chunking different from fixed-size chunking?

Is this real embedding-based semantic chunking?

What does each chunk contain?

Which embedding models pair well with these chunks?

Why is one chunk much larger than the rest?

Can I change the overlap?

Does it handle scanned PDFs?

Should I store the page number with each chunk?

Is the document uploaded anywhere?

What about abbreviations breaking sentences early?

How do I make runs reproducible?

How large a PDF can it process?

Privacy first

Related guides