How to split pdf text into sentence-aware chunks for ai
- Step 1Open the chunker and load the PDF — Drop the document into the PDF to Text Chunks tool. It extracts the embedded text layer with pdf.js in your browser; no upload happens.
- Step 2Decide your granularity — Pick Max words per chunk based on topic density. Dense, fact-rich documents (specs, policies) reward smaller chunks (150-300); narrative or explanatory text tolerates larger ones (400-600). The default is 500.
- Step 3Set Max words per chunk — The single control accepts 50-2000 words. The chunker fills each chunk with whole sentences until adding the next would overshoot the target, then starts a new chunk with a 50-word overlap from the previous one.
- Step 4Generate the chunks — Run the tool. It returns a JSON array, one object per chunk, downloaded as
<filename>.chunks.json. - Step 5Inspect a few chunks for boundary quality — Open the JSON and read two or three chunks end-to-end. They should start and end on complete sentences. If a multi-column source produced interleaved text, the PDF needs reflow before chunking.
- Step 6Embed and index, keeping the metadata — Embed each
textand storeid,page, andpageRangewith the vector so retrieval can surface and cite the right section.
Fixed-size vs sentence-aware vs embedding-semantic
Where this tool sits. It is the middle column — and it is explicit about not being the third.
| Approach | Split point | Coherence | This tool |
|---|---|---|---|
| Fixed-size | Every N characters/tokens | Cuts mid-sentence; fragments clauses | Not how this works |
| Sentence-aware (+overlap) | Sentence boundaries . ! ?, whole sentences packed to a word target, 50-word overlap | Each chunk is whole sentences; seams bridged | Yes — this is the implemented method |
| Embedding-semantic | Where an embedding model detects a topic shift | Highest, but costly and model-dependent | Not implemented — post-process if you need it |
Output shape, field by field
Every chunk object the tool emits. Verified against the implementation.
| Field | Type | Meaning |
|---|---|---|
id | number | 0-based positional index of the chunk in the array |
page | number | The first page this chunk's text touches |
pageRange | [number, number] | [start, end] page span the chunk covers |
text | string | The chunk content — whole sentences joined by spaces |
tokens | number | Whitespace word count of text (an estimate, not BPE tokens) |
Cookbook
Recipes for getting coherent, sentence-aligned chunks out of real documents.
Sentence-aligned chunks from a policy document
A 30-page policy with dense paragraphs. Default 500-word chunks keep each section's sentences together.
Max words per chunk: 500
[
{ "id": 0, "page": 1, "pageRange": [1, 1],
"text": "This policy governs the use of ...", "tokens": 491 },
{ "id": 1, "page": 1, "pageRange": [1, 2],
"text": "... access controls. Employees must ...", "tokens": 503 }
]
# Note id 1 starts with the overlap tail of id 0.Why fixed-character splitting fails (and this doesn't)
Side-by-side of a sentence cut by a naive 200-char splitter vs the sentence-aware result.
Fixed 200-char split:
chunk A: "...the renewal fee is due on the"
chunk B: "first business day of each quarter."
-> neither chunk answers 'when is the fee due?'
Sentence-aware split:
chunk A: "...the renewal fee is due on the first
business day of each quarter."
-> the whole fact lives in one chunk.Tighter chunks for fact-dense specs
Technical specifications pack many distinct facts per page; smaller chunks isolate each one for precise retrieval.
Max words per chunk: 200 A 12-page spec sheet -> ~25-35 chunks, each one or two pages, each a focused cluster of related sentences -> a query for one parameter retrieves a chunk about that parameter, not the page.
Adding metadata your retriever can filter on
Augment each chunk with your own fields after generation — the tool's JSON is easy to enrich.
for i, c in enumerate(chunks):
c['doc'] = 'security-policy-v4'
c['section'] = guess_section(c['page'])
# now you can filter retrieval by doc/section
# while still citing pageRange in the answer.Reproducible runs via the API
For pipelines that must reproduce the same chunking, drive the tool through the runner with explicit options.
GET /api/v1/tools/pdf-to-chunks # schema
POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
{ "options": { "maxChunkSize": 350, "overlap": 60 } }
# deterministic for a given PDF + optionsEdge cases and what actually happens
Calling it 'semantic' oversells it
By designThis is sentence-aware splitting with overlap, not embedding-based topic detection. It will not place a boundary precisely where the subject changes mid-paragraph. For most retrieval workloads that gap is invisible; if you genuinely need topic segmentation, run an embedding pass over these chunks afterward.
Text with no sentence punctuation won't split well
Source-dependentBullet lists, code blocks, and tables often lack . ! ?, so the splitter treats long stretches as one giant 'sentence' and packs them into one over-target chunk. Pre-extract structured content with PDF Table to JSON or accept larger list-heavy chunks.
Abbreviations create false sentence breaks
Minor noiseA period after 'Inc.', 'e.g.', or 'No. 5' can trigger an early split. The overlap softens the impact, and most embedding models tolerate it. If it matters for a specific corpus, normalise abbreviations before chunking.
Multi-column layout interleaves columns
Reading-order riskExtraction follows pdf.js content order, not visual reading order. Two-column pages may zigzag between columns, breaking sentence continuity. Inspect a chunk from any multi-column document and reflow the PDF if the order is wrong.
Image-only PDF yields nothing to split
No text layerThere is no OCR. A scanned document has no extractable sentences, so the chunk array is empty. OCR it with PDF OCR first.
Overlap is fixed at 50 in the browser
Limited controlThe web UI exposes only Max words per chunk; the 50-word overlap is not adjustable there. To tune overlap, drive the tool via the API/runner where overlap is a real option (clamped to half the chunk size).
tokens overstates how much fits in a model
Expectedtokens is a word count. A chunk reporting 500 'tokens' is roughly 650 model tokens. When packing chunks into a prompt budget, convert with a ~1.3x factor and leave margin.
Page count exceeds the tier limit
413 limitFree allows 50 pages, Pro 500, Pro+Media 2,000. A long manual hits the page ceiling first. Split it with PDF Split (Fixed) and chunk each part, or upgrade the plan.
Frequently asked questions
How is semantic chunking different from fixed-size chunking?
Fixed-size chunking cuts at a character or token count regardless of where you are in a sentence, producing fragments. This tool splits at sentence boundaries and packs whole sentences up to a word target, so each chunk is a coherent block — and it carries a 50-word overlap so meaning at the seams isn't lost. It is sentence-aware rather than embedding-semantic, which is what most 'semantic chunking' searches really want.
Is this real embedding-based semantic chunking?
No. It does not run an embedding model to detect topic shifts. It respects sentence boundaries and overlaps the seams, which gives most of the retrieval benefit at none of the cost or unpredictability. If you need true topic-boundary detection, post-process these sentence-aware chunks with your own embedding model.
What does each chunk contain?
A JSON object with id (position), page (first page touched), pageRange ([start, end]), text (whole sentences joined by spaces), and tokens (a whitespace word count). The page fields make source citation straightforward.
Which embedding models pair well with these chunks?
All standard ones — OpenAI text-embedding-3, Cohere embed-v3, and HuggingFace sentence-transformers all produce stronger embeddings from coherent, sentence-aligned text than from mid-sentence fragments. Keep chunks within the model's input limit; the default 500 words is well inside every common model.
Why is one chunk much larger than the rest?
Because a single sentence (or a punctuation-free stretch like a list or table) exceeded the target, and the tool never splits within a sentence. The long unit becomes its own chunk. Extract tables and lists separately if this happens often.
Can I change the overlap?
Not in the browser — it's fixed at 50 words there. The API/runner exposes an overlap option, clamped to at most half the chunk size. Call GET /api/v1/tools/pdf-to-chunks for the schema.
Does it handle scanned PDFs?
No OCR is performed, so a scan has no text to split. Run it through PDF OCR to add a text layer first, then chunk the OCR'd file.
Should I store the page number with each chunk?
Yes — keep page and pageRange as metadata on every vector. When a chunk is retrieved, your LLM answer can cite the exact page, and you can render a deep link into the source PDF.
Is the document uploaded anywhere?
No. Text extraction and chunking happen in your browser with pdf.js. Nothing is sent to a server beyond an anonymous usage counter when signed in.
What about abbreviations breaking sentences early?
A period in 'Inc.' or 'e.g.' can cause an early boundary. The 50-word overlap absorbs most of the impact, and embedding retrieval is robust to it. Normalise abbreviations beforehand only if a specific corpus is unusually sensitive.
How do I make runs reproducible?
For a fixed PDF and fixed options the output is deterministic. Drive the tool through the runner with explicit maxChunkSize and overlap to lock the parameters into your pipeline config rather than clicking the UI each time.
How large a PDF can it process?
Free tier: 2 MB / 50 pages. Pro: 50 MB / 500 pages. Pro+Media: 500 MB / 2,000 pages. Developer: 2 GB / 10,000 pages. Split larger documents with PDF Split (Fixed) before chunking.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.