Chunk a PDF for LLM Context Windows — Free Tool

How to chunk a pdf document to fit llm context windows

Step 1
Note your model's context window and reserve headroom — GPT-4o is 128K tokens, Claude models run 200K, Gemini 1.5 Pro up to 1M. Decide how much to reserve for the system prompt, instructions, and the response — then chunk well below the remainder.
Step 2
Load the PDF into the chunker — Open PDF to Text Chunks and drop the document. Extraction is local via pdf.js; nothing uploads.
Step 3
Set Max words per chunk below your budget — Because tokens is a word count (~1.3x for model tokens), pick a word target that, times 1.3, comfortably fits your reserved budget. For a 100K-token working budget you have huge headroom; for tight per-call budgets, keep chunks at 300-600 words.
Step 4
Generate the chunks JSON — Run the tool. Each chunk is { id, page, pageRange, text, tokens }, downloaded as <filename>.chunks.json.
Step 5
Map your prompt over each chunk — Send each chunk's text to the LLM with a consistent instruction (summarise / extract entities / answer about this section). Include the previous chunk's pageRange if you want the model to maintain continuity.
Step 6
Reduce the partial outputs into a final answer — Collect every chunk's response, then issue a final prompt asking the model to synthesise the partials into one coherent output — citing the page fields you carried through.

Words vs model tokens vs context windows

Why you size in words but budget in model tokens. The ~1.3x is a rule of thumb for English prose.

Max words per chunk	Approx. model tokens (x1.3)	Fits comfortably in
200	~260	Any model; very tight per-call budgets
500 (default)	~650	GPT-4o, Claude, Gemini with room to spare
1000	~1,300	All major models; fewer, larger calls
2000 (UI max)	~2,600	Large windows; minimise number of calls

Context windows of common models (for sizing)

Reserve headroom for your system prompt and the response — never chunk right up to the limit.

Model	Context window	Practical chunk strategy
GPT-4o	128K tokens	Big headroom; chunk for attention quality, not just fit
Claude (200K)	200K tokens	Fewer, larger chunks; overlap keeps continuity
Gemini 1.5 Pro	up to 1M tokens	Often no need to chunk for fit — chunk for retrieval/cost
Smaller / local models	4K-32K tokens	Chunk tightly (200-500 words) and leave generous reply room

Cookbook

Map-reduce patterns for processing a long PDF chunk by chunk.

Map-reduce summary of a long report

Summarise each chunk, then summarise the summaries — the standard pattern for a document longer than one comfortable call.

chunks = json.load(open('report.chunks.json'))
partials = []
for c in chunks:
    partials.append(llm(
      f"Summarise this section (pp.{c['pageRange']}):\n{c['text']}"))
final = llm(
  "Combine these section summaries into one coherent\n"
  "summary, preserving page citations:\n" + '\n'.join(partials))

Per-chunk extraction with page citations

Pull structured facts from each chunk and keep the page so the answer is traceable.

for c in chunks:
    facts = llm(
      "Extract every deadline as JSON {date, what, page}. "
      f"Page hint: {c['page']}.\n{c['text']}")
    # each fact carries the page it came from
    collected.extend(json.loads(facts))

Leaving headroom for the response

Size the chunk so chunk + prompt + reply all fit. Words x1.3 gives model tokens.

Model budget for one call: 8000 tokens
Reserve: system prompt 800 + reply 2000 = 2800
Available for context: 5200 tokens
5200 / 1.3 ~= 4000 words of headroom
-> Max words per chunk: 1500 is safe (leaves slack)

Carrying overlap to keep continuity

The built-in 50-word overlap already bridges seams; for narrative tasks you can also prepend the previous chunk's tail explicitly.

prev_tail = ''
for c in chunks:
    prompt = prev_tail + c['text']
    out = llm('Continue the running summary:\n' + prompt)
    prev_tail = ' '.join(c['text'].split()[-50:])  # last 50 words

Automating the split before a batch LLM job

Produce the chunks programmatically so the LLM step runs unattended.

GET /api/v1/tools/pdf-to-chunks        # schema
POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
{ "options": { "maxChunkSize": 800 } }
# returns the JSON array -> feed to your map-reduce job

Edge cases and what actually happens

Sizing chunks in words but budgeting in model tokens

Expected

tokens is a word count. If you size Max words per chunk right up to your model-token budget, the real BPE count (~1.3x) will overflow once you add the system prompt and response. Always convert words to model tokens and leave headroom.

A fact spans two chunks during sequential processing

Mitigated

The 50-word overlap carries the tail of each chunk into the next, so most boundary-spanning facts survive. For long-range dependencies (a definition introduced 20 pages before it's used), no chunker fixes it — include a running summary or document-level metadata in each prompt.

Oversized single sentence exceeds your per-call budget

By design

Sentences are never split, so one very long sentence can produce a chunk larger than your target. If that chunk's word count x1.3 exceeds a tight model budget, lower the target won't help — pre-process the source to break the sentence, or use a model with a bigger window.

Scanned PDF gives the model nothing to read

No text layer

No OCR is performed; a scan yields empty chunks. OCR it first with PDF OCR, then chunk and feed to the LLM.

Overlap fixed at 50 words in the browser

Limited control

Narrative tasks sometimes want larger overlap for continuity. The UI fixes it at 50; the API exposes overlap (capped at half the chunk size). Alternatively prepend the previous chunk's tail in your prompt code.

Multi-column layout breaks reading order in a chunk

Reading-order risk

Extraction follows pdf.js content order. A two-column page can interleave columns, so the model reads a scrambled sequence. Inspect a chunk; reflow multi-column PDFs before chunking for LLM input.

Aggregating many partials drifts from the source

Reduce-step risk

Map-reduce over many chunks can lose fidelity in the reduce step. Keep pageRange on every partial and ask the final prompt to cite pages, so you can spot-check the synthesis against the original PDF.

Document fits the window but costs too much whole

Cost strategy

Even with a 200K or 1M window, sending the entire document every query is expensive. Chunk + retrieve only the relevant pieces (a RAG pattern) instead of stuffing the full text into every call.

Page or file size over the tier limit

413 limit

Free allows 50 pages / 2 MB. A long PDF hits the page ceiling first. Split with PDF Split (Fixed) and chunk each part, or upgrade to Pro (500 pages).

Frequently asked questions

How do I size chunks to fit my LLM's context window?

Decide your per-call budget, subtract the system prompt and expected response, then convert the remainder to words by dividing by ~1.3 (since the tool's tokens field is words, and model tokens run higher). Set Max words per chunk below that. For a 100K-token window you have enormous headroom; for tight budgets keep chunks at 300-600 words.

Is the tokens field the same as GPT/Claude tokens?

No. It is a whitespace word count, not BPE tokens. Multiply by roughly 1.3 to estimate model tokens. Never size chunks right up to the model limit using the raw tokens number — you'll overflow once the prompt and reply are added.

Should I overlap context between chunks?

The tool already carries a 50-word overlap from each chunk into the next, which preserves continuity for most tasks. For long narrative summaries you can additionally prepend the previous chunk's tail in your own prompt code. The API lets you raise the built-in overlap.

How do I aggregate responses from multiple chunks?

Use map-reduce: run the same instruction over each chunk (the map step), collect the partial outputs, then issue one final prompt asking the model to synthesise them into a single coherent answer (the reduce step). Carry pageRange through so the final answer cites pages.

What if a fact spans two chunks?

The 50-word overlap usually keeps it whole in at least one chunk. For dependencies that span many pages, include a running summary or document metadata in each prompt — chunking alone can't connect distant references.

Can I just use Gemini's 1M window and skip chunking?

You can for pure fit, but stuffing a whole long document into every call is expensive and can dilute the model's attention. Chunking plus retrieval (sending only relevant pieces) is usually cheaper and more accurate even on large-window models.

Does it work on scanned PDFs?

No — there is no OCR, so a scan produces empty chunks. Run PDF OCR first to add a text layer, then chunk and feed to your LLM.

How do I keep the model's answer traceable to the source?

Each chunk carries page and pageRange. Pass them into your prompts and ask the model to cite pages in its output, then spot-check against the original PDF. This is the cheapest guard against hallucinated synthesis.

What's the largest PDF I can chunk?

Free: 2 MB / 50 pages. Pro: 50 MB / 500 pages. Pro+Media: 500 MB / 2,000 pages. Developer: 2 GB / 10,000 pages. Split bigger documents with PDF Split (Fixed).

Is my document sent anywhere by the chunker?

No. Chunking runs in your browser via pdf.js. You decide when chunk text goes to your LLM provider; the chunker itself never uploads the file.

Why is one chunk bigger than my target?

A single sentence longer than the target becomes its own chunk, since sentences are never split. If that overruns a tight model budget, the fix is to break the sentence in the source or use a larger context window, not a smaller target.

Can I run the split automatically before an LLM batch job?

Yes. Read the schema from GET /api/v1/tools/pdf-to-chunks, pair the @jadapps/runner, and POST the PDF to 127.0.0.1:9789/v1/tools/pdf-to-chunks/run. The JSON array feeds straight into your unattended map-reduce pipeline, with files processed locally.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to chunk a pdf document to fit llm context windows

Step 1
Note your model's context window and reserve headroom — GPT-4o is 128K tokens, Claude models run 200K, Gemini 1.5 Pro up to 1M. Decide how much to reserve for the system prompt, instructions, and the response — then chunk well below the remainder.
Step 2
Load the PDF into the chunker — Open PDF to Text Chunks and drop the document. Extraction is local via pdf.js; nothing uploads.
Step 3
Set Max words per chunk below your budget — Because tokens is a word count (~1.3x for model tokens), pick a word target that, times 1.3, comfortably fits your reserved budget. For a 100K-token working budget you have huge headroom; for tight per-call budgets, keep chunks at 300-600 words.
Step 4
Generate the chunks JSON — Run the tool. Each chunk is { id, page, pageRange, text, tokens }, downloaded as <filename>.chunks.json.
Step 5
Map your prompt over each chunk — Send each chunk's text to the LLM with a consistent instruction (summarise / extract entities / answer about this section). Include the previous chunk's pageRange if you want the model to maintain continuity.
Step 6
Reduce the partial outputs into a final answer — Collect every chunk's response, then issue a final prompt asking the model to synthesise the partials into one coherent output — citing the page fields you carried through.

Words vs model tokens vs context windows

Why you size in words but budget in model tokens. The ~1.3x is a rule of thumb for English prose.

Max words per chunk	Approx. model tokens (x1.3)	Fits comfortably in
200	~260	Any model; very tight per-call budgets
500 (default)	~650	GPT-4o, Claude, Gemini with room to spare
1000	~1,300	All major models; fewer, larger calls
2000 (UI max)	~2,600	Large windows; minimise number of calls

Context windows of common models (for sizing)

Reserve headroom for your system prompt and the response — never chunk right up to the limit.

Model	Context window	Practical chunk strategy
GPT-4o	128K tokens	Big headroom; chunk for attention quality, not just fit
Claude (200K)	200K tokens	Fewer, larger chunks; overlap keeps continuity
Gemini 1.5 Pro	up to 1M tokens	Often no need to chunk for fit — chunk for retrieval/cost
Smaller / local models	4K-32K tokens	Chunk tightly (200-500 words) and leave generous reply room

Cookbook

Map-reduce patterns for processing a long PDF chunk by chunk.

Map-reduce summary of a long report

Summarise each chunk, then summarise the summaries — the standard pattern for a document longer than one comfortable call.

chunks = json.load(open('report.chunks.json'))
partials = []
for c in chunks:
    partials.append(llm(
      f"Summarise this section (pp.{c['pageRange']}):\n{c['text']}"))
final = llm(
  "Combine these section summaries into one coherent\n"
  "summary, preserving page citations:\n" + '\n'.join(partials))

Per-chunk extraction with page citations

Pull structured facts from each chunk and keep the page so the answer is traceable.

for c in chunks:
    facts = llm(
      "Extract every deadline as JSON {date, what, page}. "
      f"Page hint: {c['page']}.\n{c['text']}")
    # each fact carries the page it came from
    collected.extend(json.loads(facts))

Leaving headroom for the response

Size the chunk so chunk + prompt + reply all fit. Words x1.3 gives model tokens.

Model budget for one call: 8000 tokens
Reserve: system prompt 800 + reply 2000 = 2800
Available for context: 5200 tokens
5200 / 1.3 ~= 4000 words of headroom
-> Max words per chunk: 1500 is safe (leaves slack)

Carrying overlap to keep continuity

The built-in 50-word overlap already bridges seams; for narrative tasks you can also prepend the previous chunk's tail explicitly.

prev_tail = ''
for c in chunks:
    prompt = prev_tail + c['text']
    out = llm('Continue the running summary:\n' + prompt)
    prev_tail = ' '.join(c['text'].split()[-50:])  # last 50 words

Automating the split before a batch LLM job

Produce the chunks programmatically so the LLM step runs unattended.

GET /api/v1/tools/pdf-to-chunks        # schema
POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
{ "options": { "maxChunkSize": 800 } }
# returns the JSON array -> feed to your map-reduce job

Edge cases and what actually happens

Sizing chunks in words but budgeting in model tokens

Expected

A fact spans two chunks during sequential processing

Mitigated

Oversized single sentence exceeds your per-call budget

By design

Scanned PDF gives the model nothing to read

No text layer

No OCR is performed; a scan yields empty chunks. OCR it first with PDF OCR, then chunk and feed to the LLM.

Overlap fixed at 50 words in the browser

Limited control

Multi-column layout breaks reading order in a chunk

Reading-order risk

Extraction follows pdf.js content order. A two-column page can interleave columns, so the model reads a scrambled sequence. Inspect a chunk; reflow multi-column PDFs before chunking for LLM input.

Aggregating many partials drifts from the source

Reduce-step risk

Document fits the window but costs too much whole

Cost strategy

Even with a 200K or 1M window, sending the entire document every query is expensive. Chunk + retrieve only the relevant pieces (a RAG pattern) instead of stuffing the full text into every call.

Page or file size over the tier limit

413 limit

Free allows 50 pages / 2 MB. A long PDF hits the page ceiling first. Split with PDF Split (Fixed) and chunk each part, or upgrade to Pro (500 pages).

Frequently asked questions

How do I size chunks to fit my LLM's context window?

Is the tokens field the same as GPT/Claude tokens?

Should I overlap context between chunks?

How do I aggregate responses from multiple chunks?

What if a fact spans two chunks?

Can I just use Gemini's 1M window and skip chunking?

Does it work on scanned PDFs?

No — there is no OCR, so a scan produces empty chunks. Run PDF OCR first to add a text layer, then chunk and feed to your LLM.

How do I keep the model's answer traceable to the source?

What's the largest PDF I can chunk?

Free: 2 MB / 50 pages. Pro: 50 MB / 500 pages. Pro+Media: 500 MB / 2,000 pages. Developer: 2 GB / 10,000 pages. Split bigger documents with PDF Split (Fixed).

Is my document sent anywhere by the chunker?

No. Chunking runs in your browser via pdf.js. You decide when chunk text goes to your LLM provider; the chunker itself never uploads the file.

Why is one chunk bigger than my target?

Can I run the split automatically before an LLM batch job?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Chunk a PDF Document to Fit LLM Context Windows

How to chunk a pdf document to fit llm context windows

Words vs model tokens vs context windows

Context windows of common models (for sizing)

Cookbook

Map-reduce summary of a long report

Per-chunk extraction with page citations

Leaving headroom for the response

Carrying overlap to keep continuity

Automating the split before a batch LLM job

Edge cases and what actually happens

Sizing chunks in words but budgeting in model tokens

A fact spans two chunks during sequential processing

Oversized single sentence exceeds your per-call budget

Scanned PDF gives the model nothing to read

Overlap fixed at 50 words in the browser

Multi-column layout breaks reading order in a chunk

Aggregating many partials drifts from the source

Document fits the window but costs too much whole

Page or file size over the tier limit

Frequently asked questions

How do I size chunks to fit my LLM's context window?

Is the tokens field the same as GPT/Claude tokens?

Should I overlap context between chunks?

How do I aggregate responses from multiple chunks?

What if a fact spans two chunks?

Can I just use Gemini's 1M window and skip chunking?

Does it work on scanned PDFs?

How do I keep the model's answer traceable to the source?

What's the largest PDF I can chunk?

Is my document sent anywhere by the chunker?

Why is one chunk bigger than my target?

Can I run the split automatically before an LLM batch job?

Privacy first

Related guides

Chunk a PDF Document to Fit LLM Context Windows

How to chunk a pdf document to fit llm context windows

Words vs model tokens vs context windows

Context windows of common models (for sizing)

Cookbook

Map-reduce summary of a long report

Per-chunk extraction with page citations

Leaving headroom for the response

Carrying overlap to keep continuity

Automating the split before a batch LLM job

Edge cases and what actually happens

Sizing chunks in words but budgeting in model tokens

A fact spans two chunks during sequential processing

Oversized single sentence exceeds your per-call budget

Scanned PDF gives the model nothing to read

Overlap fixed at 50 words in the browser

Multi-column layout breaks reading order in a chunk

Aggregating many partials drifts from the source

Document fits the window but costs too much whole

Page or file size over the tier limit

Frequently asked questions

How do I size chunks to fit my LLM's context window?

Is the tokens field the same as GPT/Claude tokens?

Should I overlap context between chunks?

How do I aggregate responses from multiple chunks?

What if a fact spans two chunks?

Can I just use Gemini's 1M window and skip chunking?

Does it work on scanned PDFs?

How do I keep the model's answer traceable to the source?

What's the largest PDF I can chunk?

Is my document sent anywhere by the chunker?

Why is one chunk bigger than my target?

Can I run the split automatically before an LLM batch job?

Privacy first

Related guides