How to chunk a pdf document for a rag pipeline
- Step 1Open the PDF chunker and drop your document — Load the PDF into the PDF to Text Chunks tool. Extraction and chunking run in your browser via pdf.js — nothing is uploaded. The tool reads the document's embedded text layer; a scanned, image-only PDF has no text to chunk (see the OCR step below).
- Step 2Set Max words per chunk for your embedding model — The single control is Max words per chunk (50–2000, default 500). For OpenAI
text-embedding-3-small/-largeor Cohereembed-v3, 200–500 words is a sound default — it stays comfortably inside the model's input limit while keeping each chunk focused enough for precise retrieval. - Step 3Confirm the document has a real text layer — If the source is a scan, run it through PDF OCR first to add a searchable text layer, then bring the OCR'd PDF back here. The chunker does not OCR — it only sees text pdf.js can extract.
- Step 4Generate and download the chunks JSON — The tool produces a JSON array and downloads it as
<filename>.chunks.json. Each element is{ id, page, pageRange, text, tokens }wheretokensis a whitespace word count, not a model-tokenizer count. - Step 5Embed each chunk's text — Iterate the array, send each
textto your embedding API in batches, and keepid,page, andpageRangeas metadata on the resulting vector. Re-embed the whole document if you re-chunk with a different size —idvalues are positional, not stable across runs. - Step 6Upsert into your vector store and test retrieval — Upsert the vectors into Pinecone, Weaviate, Chroma, Qdrant, or pgvector, then run a few known questions and inspect which
page/pageRangecome back. If answers arrive truncated, lower Max words per chunk; if retrieval pulls scattered noise, raise it.
What the chunker actually does (and doesn't)
Verified against the tool's implementation. The browser UI exposes Max words per chunk only; overlap is fixed at 50 unless you call the API.
| Aspect | Behaviour | Notes for RAG |
|---|---|---|
| Split unit | Sentence boundaries — regex (?<=[.!?])\s+ | Sentences are never cut mid-way; a single sentence longer than the target becomes its own oversized chunk |
| Chunk size control | Max words per chunk — 50 to 2000, default 500 | It is a word target, not a model-token target; multiply by ~1.3 for a rough BPE-token estimate |
| Overlap | Fixed 50 words in the browser; configurable via API (overlap) | Overlap is taken from the trailing sentences of the previous chunk, clamped to at most half the target |
tokens field | Whitespace-separated word count of the chunk text | Use it for budgeting, but don't treat it as exact GPT/Claude tokens |
| Metadata | id (0-based, positional), page, pageRange [start,end] | page is the first page the chunk touches; pageRange spans every page it covers |
| OCR | None — embedded text layer only | Scanned PDFs must go through PDF OCR first or they yield zero chunks |
File and page limits by plan
PDF-family limits from the platform's tier table. The chunker reads every page, so a large book can hit the page ceiling before the byte ceiling.
| Plan | Max file size | Max pages | Files per batch |
|---|---|---|---|
| Free | 2 MB | 50 pages | 1 |
| Pro | 50 MB | 500 pages | 5 |
| Pro + Media | 500 MB | 2,000 pages | 50 |
| Developer | 2 GB | 10,000 pages | unlimited |
Cookbook
Concrete RAG chunking recipes. The JSON shown is the tool's real output shape — id, page, pageRange, text, tokens.
Default 500-word chunks for a research-paper corpus
The out-of-the-box setting. A 10-page paper yields a handful of chunks, each a run of whole sentences with a 50-word tail carried into the next.
Max words per chunk: 500 (overlap: 50, fixed in browser)
Output (first element of <paper>.chunks.json):
[
{
"id": 0,
"page": 1,
"pageRange": [1, 2],
"text": "Retrieval-augmented generation combines a parametric ...",
"tokens": 498
},
...
]Smaller chunks for high-precision Q&A retrieval
FAQ-style or policy lookups benefit from tighter chunks so a single retrieved chunk maps to a single answer. Drop Max words per chunk to ~200.
Max words per chunk: 200 A 10-page handbook -> ~12-18 chunks instead of ~5. Each chunk stays within one or two pages, so the retrieved pageRange points the user to a precise section.
Loop the chunks into an embedding + upsert pipeline
The JSON array drops straight into a batch-embed loop. Keep the chunker's metadata on each vector for citation.
import json, openai, pinecone
chunks = json.load(open('paper.chunks.json'))
for c in chunks:
vec = openai.embeddings.create(
model='text-embedding-3-small', input=c['text']
).data[0].embedding
index.upsert([(f"paper-{c['id']}", vec, {
'page': c['page'], 'pageRange': c['pageRange'],
'text': c['text']
})])Automate it locally with the runner (overlap configurable here)
The browser locks overlap at 50; the API exposes it. Pair the runner once and processing stays on your machine — the document never leaves your network.
# Discover the option schema
GET /api/v1/tools/pdf-to-chunks
# Run locally via the paired @jadapps/runner
POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
{
"options": { "maxChunkSize": 400, "overlap": 80 }
}
# -> same {id,page,pageRange,text,tokens} JSON arrayCiting the source page in a RAG answer
Because every chunk carries page provenance, your prompt can ask the LLM to cite it and you can render a deep link.
Retrieved chunk metadata: { page: 7, pageRange: [7, 8] }
Prompt scaffold:
"Answer using the context. Cite the page in [p.N] form."
Context: <chunk.text>
Model: "... renewal is automatic [p.7]."Edge cases and what actually happens
Scanned / image-only PDF produces zero chunks
No text layerThe chunker reads only the embedded text layer via pdf.js. A scanned contract or photographed page has no extractable text, so it yields an empty or near-empty chunk array. Run it through PDF OCR first to add a text layer, then chunk the OCR'd output.
A single sentence is longer than Max words per chunk
By designSentences are never split mid-way. If one sentence exceeds your target (a long enumerated clause, a dense legal recital), it becomes its own chunk that overruns the target. This is intentional — a coherent over-target chunk beats a truncated fragment for retrieval.
tokens is words, not model tokens
ExpectedThe tokens field counts whitespace-separated words, not BPE tokens. A 500-word chunk is roughly 650-700 GPT/Claude tokens. Leave headroom when you size chunks against an embedding model's hard input limit.
Multi-column or magazine layout interleaves text
Reading-order riskText comes out in pdf.js content-stream order, not reconstructed reading order. Two-column papers can interleave left and right columns, mid-sentence. Sanity-check a chunk from a multi-column source; if order is scrambled, the PDF needs reflow before chunking.
Free tier file or page ceiling hit
413 limitFree tier caps PDFs at 2 MB and 50 pages. A long book exceeds the page limit before the byte limit. Upgrade to Pro (50 MB / 500 pages) or split the document with PDF Split (Fixed) first, then chunk each part.
overlap can't exceed half the chunk size
ClampedVia the API, overlap is clamped to at most half the target word count. Requesting overlap 400 on a 400-word target is silently reduced to 200, so consecutive chunks can never be more than half-redundant.
Re-chunking changes every id
Positional idsid is a 0-based positional index, not a content hash. Re-running with a different Max words per chunk renumbers everything, so re-embed the whole document rather than diffing by id. Use a content hash if you need stable identifiers across runs.
Ligatures or spacing artefacts in extracted text
Source-dependentSome PDFs store fi/fl as ligature glyphs or insert spaces between every letter. pdf.js extracts what's there; the chunker doesn't clean it. If embeddings look off, pre-clean the text or extract via PDF to Text and inspect before embedding.
Tables are flattened into the sentence stream
Use a sibling toolTabular data has no . ! ? boundaries, so a table is swept into surrounding sentences and chunks awkwardly. Extract tables separately with PDF Table to JSON and index them as structured records alongside the prose chunks.
Frequently asked questions
Is this true semantic chunking?
No, and the tool is deliberately honest about it. It is sentence-aware chunking with a sliding overlap: it respects . ! ? boundaries and carries trailing context into the next chunk, which is what the large majority of RAG pipelines actually need. It does not compute embeddings to detect topic shifts, so it is not 'semantic' in the embedding sense. If you need topic-boundary detection, post-process these chunks with your own embedding model.
What chunk size should I pick for my embedding model?
Set Max words per chunk to 200-500 for general retrieval. text-embedding-3-small/large, Cohere embed-v3, and HuggingFace sentence-transformers all handle that comfortably. Smaller chunks (150-250 words) give sharper retrieval for Q&A; larger chunks (500+) give more surrounding context per hit. Remember the field is words, not tokens — multiply by ~1.3 for a token estimate.
How much overlap do the chunks have?
50 words, carried from the tail of each chunk into the start of the next. The browser UI keeps this fixed; only the API/runner exposes an overlap option. Overlap is clamped to at most half the chunk size, so chunks can never be more than 50% redundant.
Does each chunk tell me which page it came from?
Yes. Every chunk has page (the first page it touches) and pageRange ([start, end] covering every page it spans). Store both as vector metadata so a retrieved answer can cite the exact page.
Can it chunk a scanned PDF?
Not directly — it reads only the embedded text layer and does no OCR, so a pure scan yields no chunks. Run the scan through PDF OCR to add a text layer first, then chunk the result.
Is my document uploaded to a server?
No. Extraction and chunking run in your browser via pdf.js. The PDF never leaves your device; only an anonymous usage counter is recorded when you're signed in. For automated runs, the paired runner processes files locally on your own machine.
What vector databases does the output work with?
All of them. The output is a plain JSON array of text plus metadata, so it loads into Pinecone, Weaviate, Chroma, Qdrant, Milvus, or pgvector with a trivial loop — embed each text, upsert with id/page/pageRange as metadata.
How big a PDF can I chunk?
Free tier allows 2 MB and 50 pages; Pro 50 MB and 500 pages; Pro+Media 500 MB and 2,000 pages; Developer 2 GB and 10,000 pages. For very long books, split with PDF Split (Fixed) and chunk each part.
Why did one chunk come out much bigger than my target?
Because a single sentence exceeded the target and sentences are never split. The chunker flushes the buffer before the oversized sentence, then that long sentence forms its own chunk. A coherent over-target chunk retrieves far better than a mid-sentence cut.
How do I handle tables inside the PDF?
Pull them out separately. Tables have no sentence punctuation, so they get smeared into adjacent prose. Use PDF Table to JSON to extract tabular data as structured rows and index those alongside your text chunks.
Are the chunk ids stable if I re-process the file?
No. id is a positional 0-based index. Changing Max words per chunk renumbers everything, so re-embed the whole document on re-chunk. If you need IDs that survive re-runs, hash the chunk text yourself.
Can I run this as part of an automated ingestion job?
Yes. GET /api/v1/tools/pdf-to-chunks returns the option schema; pair the @jadapps/runner once and POST your PDF to 127.0.0.1:9789/v1/tools/pdf-to-chunks/run with { maxChunkSize, overlap }. Files are processed locally on your machine, so confidential documents never reach JAD's servers — ideal for a scheduled corpus re-index.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.