How to chunk a pdf document to fit llm context windows
- Step 1Note your model's context window and reserve headroom — GPT-4o is 128K tokens, Claude models run 200K, Gemini 1.5 Pro up to 1M. Decide how much to reserve for the system prompt, instructions, and the response — then chunk well below the remainder.
- Step 2Load the PDF into the chunker — Open PDF to Text Chunks and drop the document. Extraction is local via pdf.js; nothing uploads.
- Step 3Set Max words per chunk below your budget — Because
tokensis a word count (~1.3x for model tokens), pick a word target that, times 1.3, comfortably fits your reserved budget. For a 100K-token working budget you have huge headroom; for tight per-call budgets, keep chunks at 300-600 words. - Step 4Generate the chunks JSON — Run the tool. Each chunk is
{ id, page, pageRange, text, tokens }, downloaded as<filename>.chunks.json. - Step 5Map your prompt over each chunk — Send each chunk's
textto the LLM with a consistent instruction (summarise / extract entities / answer about this section). Include the previous chunk'spageRangeif you want the model to maintain continuity. - Step 6Reduce the partial outputs into a final answer — Collect every chunk's response, then issue a final prompt asking the model to synthesise the partials into one coherent output — citing the page fields you carried through.
Words vs model tokens vs context windows
Why you size in words but budget in model tokens. The ~1.3x is a rule of thumb for English prose.
| Max words per chunk | Approx. model tokens (x1.3) | Fits comfortably in |
|---|---|---|
| 200 | ~260 | Any model; very tight per-call budgets |
| 500 (default) | ~650 | GPT-4o, Claude, Gemini with room to spare |
| 1000 | ~1,300 | All major models; fewer, larger calls |
| 2000 (UI max) | ~2,600 | Large windows; minimise number of calls |
Context windows of common models (for sizing)
Reserve headroom for your system prompt and the response — never chunk right up to the limit.
| Model | Context window | Practical chunk strategy |
|---|---|---|
| GPT-4o | 128K tokens | Big headroom; chunk for attention quality, not just fit |
| Claude (200K) | 200K tokens | Fewer, larger chunks; overlap keeps continuity |
| Gemini 1.5 Pro | up to 1M tokens | Often no need to chunk for fit — chunk for retrieval/cost |
| Smaller / local models | 4K-32K tokens | Chunk tightly (200-500 words) and leave generous reply room |
Cookbook
Map-reduce patterns for processing a long PDF chunk by chunk.
Map-reduce summary of a long report
Summarise each chunk, then summarise the summaries — the standard pattern for a document longer than one comfortable call.
chunks = json.load(open('report.chunks.json'))
partials = []
for c in chunks:
partials.append(llm(
f"Summarise this section (pp.{c['pageRange']}):\n{c['text']}"))
final = llm(
"Combine these section summaries into one coherent\n"
"summary, preserving page citations:\n" + '\n'.join(partials))Per-chunk extraction with page citations
Pull structured facts from each chunk and keep the page so the answer is traceable.
for c in chunks:
facts = llm(
"Extract every deadline as JSON {date, what, page}. "
f"Page hint: {c['page']}.\n{c['text']}")
# each fact carries the page it came from
collected.extend(json.loads(facts))Leaving headroom for the response
Size the chunk so chunk + prompt + reply all fit. Words x1.3 gives model tokens.
Model budget for one call: 8000 tokens Reserve: system prompt 800 + reply 2000 = 2800 Available for context: 5200 tokens 5200 / 1.3 ~= 4000 words of headroom -> Max words per chunk: 1500 is safe (leaves slack)
Carrying overlap to keep continuity
The built-in 50-word overlap already bridges seams; for narrative tasks you can also prepend the previous chunk's tail explicitly.
prev_tail = ''
for c in chunks:
prompt = prev_tail + c['text']
out = llm('Continue the running summary:\n' + prompt)
prev_tail = ' '.join(c['text'].split()[-50:]) # last 50 wordsAutomating the split before a batch LLM job
Produce the chunks programmatically so the LLM step runs unattended.
GET /api/v1/tools/pdf-to-chunks # schema
POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
{ "options": { "maxChunkSize": 800 } }
# returns the JSON array -> feed to your map-reduce jobEdge cases and what actually happens
Sizing chunks in words but budgeting in model tokens
Expectedtokens is a word count. If you size Max words per chunk right up to your model-token budget, the real BPE count (~1.3x) will overflow once you add the system prompt and response. Always convert words to model tokens and leave headroom.
A fact spans two chunks during sequential processing
MitigatedThe 50-word overlap carries the tail of each chunk into the next, so most boundary-spanning facts survive. For long-range dependencies (a definition introduced 20 pages before it's used), no chunker fixes it — include a running summary or document-level metadata in each prompt.
Oversized single sentence exceeds your per-call budget
By designSentences are never split, so one very long sentence can produce a chunk larger than your target. If that chunk's word count x1.3 exceeds a tight model budget, lower the target won't help — pre-process the source to break the sentence, or use a model with a bigger window.
Scanned PDF gives the model nothing to read
No text layerNo OCR is performed; a scan yields empty chunks. OCR it first with PDF OCR, then chunk and feed to the LLM.
Overlap fixed at 50 words in the browser
Limited controlNarrative tasks sometimes want larger overlap for continuity. The UI fixes it at 50; the API exposes overlap (capped at half the chunk size). Alternatively prepend the previous chunk's tail in your prompt code.
Multi-column layout breaks reading order in a chunk
Reading-order riskExtraction follows pdf.js content order. A two-column page can interleave columns, so the model reads a scrambled sequence. Inspect a chunk; reflow multi-column PDFs before chunking for LLM input.
Aggregating many partials drifts from the source
Reduce-step riskMap-reduce over many chunks can lose fidelity in the reduce step. Keep pageRange on every partial and ask the final prompt to cite pages, so you can spot-check the synthesis against the original PDF.
Document fits the window but costs too much whole
Cost strategyEven with a 200K or 1M window, sending the entire document every query is expensive. Chunk + retrieve only the relevant pieces (a RAG pattern) instead of stuffing the full text into every call.
Page or file size over the tier limit
413 limitFree allows 50 pages / 2 MB. A long PDF hits the page ceiling first. Split with PDF Split (Fixed) and chunk each part, or upgrade to Pro (500 pages).
Frequently asked questions
How do I size chunks to fit my LLM's context window?
Decide your per-call budget, subtract the system prompt and expected response, then convert the remainder to words by dividing by ~1.3 (since the tool's tokens field is words, and model tokens run higher). Set Max words per chunk below that. For a 100K-token window you have enormous headroom; for tight budgets keep chunks at 300-600 words.
Is the tokens field the same as GPT/Claude tokens?
No. It is a whitespace word count, not BPE tokens. Multiply by roughly 1.3 to estimate model tokens. Never size chunks right up to the model limit using the raw tokens number — you'll overflow once the prompt and reply are added.
Should I overlap context between chunks?
The tool already carries a 50-word overlap from each chunk into the next, which preserves continuity for most tasks. For long narrative summaries you can additionally prepend the previous chunk's tail in your own prompt code. The API lets you raise the built-in overlap.
How do I aggregate responses from multiple chunks?
Use map-reduce: run the same instruction over each chunk (the map step), collect the partial outputs, then issue one final prompt asking the model to synthesise them into a single coherent answer (the reduce step). Carry pageRange through so the final answer cites pages.
What if a fact spans two chunks?
The 50-word overlap usually keeps it whole in at least one chunk. For dependencies that span many pages, include a running summary or document metadata in each prompt — chunking alone can't connect distant references.
Can I just use Gemini's 1M window and skip chunking?
You can for pure fit, but stuffing a whole long document into every call is expensive and can dilute the model's attention. Chunking plus retrieval (sending only relevant pieces) is usually cheaper and more accurate even on large-window models.
Does it work on scanned PDFs?
No — there is no OCR, so a scan produces empty chunks. Run PDF OCR first to add a text layer, then chunk and feed to your LLM.
How do I keep the model's answer traceable to the source?
Each chunk carries page and pageRange. Pass them into your prompts and ask the model to cite pages in its output, then spot-check against the original PDF. This is the cheapest guard against hallucinated synthesis.
What's the largest PDF I can chunk?
Free: 2 MB / 50 pages. Pro: 50 MB / 500 pages. Pro+Media: 500 MB / 2,000 pages. Developer: 2 GB / 10,000 pages. Split bigger documents with PDF Split (Fixed).
Is my document sent anywhere by the chunker?
No. Chunking runs in your browser via pdf.js. You decide when chunk text goes to your LLM provider; the chunker itself never uploads the file.
Why is one chunk bigger than my target?
A single sentence longer than the target becomes its own chunk, since sentences are never split. If that overruns a tight model budget, the fix is to break the sentence in the source or use a larger context window, not a smaller target.
Can I run the split automatically before an LLM batch job?
Yes. Read the schema from GET /api/v1/tools/pdf-to-chunks, pair the @jadapps/runner, and POST the PDF to 127.0.0.1:9789/v1/tools/pdf-to-chunks/run. The JSON array feeds straight into your unattended map-reduce pipeline, with files processed locally.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.