Chunk a PDF for a RAG Pipeline — Free Browser Tool

How to chunk a pdf document for a rag pipeline

Step 1
Open the PDF chunker and drop your document — Load the PDF into the PDF to Text Chunks tool. Extraction and chunking run in your browser via pdf.js — nothing is uploaded. The tool reads the document's embedded text layer; a scanned, image-only PDF has no text to chunk (see the OCR step below).
Step 2
Set Max words per chunk for your embedding model — The single control is Max words per chunk (50–2000, default 500). For OpenAI text-embedding-3-small/-large or Cohere embed-v3, 200–500 words is a sound default — it stays comfortably inside the model's input limit while keeping each chunk focused enough for precise retrieval.
Step 3
Confirm the document has a real text layer — If the source is a scan, run it through PDF OCR first to add a searchable text layer, then bring the OCR'd PDF back here. The chunker does not OCR — it only sees text pdf.js can extract.
Step 4
Generate and download the chunks JSON — The tool produces a JSON array and downloads it as <filename>.chunks.json. Each element is { id, page, pageRange, text, tokens } where tokens is a whitespace word count, not a model-tokenizer count.
Step 5
Embed each chunk's text — Iterate the array, send each text to your embedding API in batches, and keep id, page, and pageRange as metadata on the resulting vector. Re-embed the whole document if you re-chunk with a different size — id values are positional, not stable across runs.
Step 6
Upsert into your vector store and test retrieval — Upsert the vectors into Pinecone, Weaviate, Chroma, Qdrant, or pgvector, then run a few known questions and inspect which page/pageRange come back. If answers arrive truncated, lower Max words per chunk; if retrieval pulls scattered noise, raise it.

What the chunker actually does (and doesn't)

Verified against the tool's implementation. The browser UI exposes Max words per chunk only; overlap is fixed at 50 unless you call the API.

Aspect	Behaviour	Notes for RAG
Split unit	Sentence boundaries — regex `(?<=[.!?])\s+`	Sentences are never cut mid-way; a single sentence longer than the target becomes its own oversized chunk
Chunk size control	`Max words per chunk` — 50 to 2000, default 500	It is a word target, not a model-token target; multiply by ~1.3 for a rough BPE-token estimate
Overlap	Fixed 50 words in the browser; configurable via API (`overlap`)	Overlap is taken from the trailing sentences of the previous chunk, clamped to at most half the target
`tokens` field	Whitespace-separated word count of the chunk text	Use it for budgeting, but don't treat it as exact GPT/Claude tokens
Metadata	`id` (0-based, positional), `page`, `pageRange [start,end]`	`page` is the first page the chunk touches; `pageRange` spans every page it covers
OCR	None — embedded text layer only	Scanned PDFs must go through PDF OCR first or they yield zero chunks

File and page limits by plan

PDF-family limits from the platform's tier table. The chunker reads every page, so a large book can hit the page ceiling before the byte ceiling.

Plan	Max file size	Max pages	Files per batch
Free	2 MB	50 pages	1
Pro	50 MB	500 pages	5
Pro + Media	500 MB	2,000 pages	50
Developer	2 GB	10,000 pages	unlimited

Cookbook

Concrete RAG chunking recipes. The JSON shown is the tool's real output shape — id, page, pageRange, text, tokens.

Default 500-word chunks for a research-paper corpus

The out-of-the-box setting. A 10-page paper yields a handful of chunks, each a run of whole sentences with a 50-word tail carried into the next.

Max words per chunk: 500  (overlap: 50, fixed in browser)

Output (first element of <paper>.chunks.json):
[
  {
    "id": 0,
    "page": 1,
    "pageRange": [1, 2],
    "text": "Retrieval-augmented generation combines a parametric ...",
    "tokens": 498
  },
  ...
]

Smaller chunks for high-precision Q&A retrieval

FAQ-style or policy lookups benefit from tighter chunks so a single retrieved chunk maps to a single answer. Drop Max words per chunk to ~200.

Max words per chunk: 200

A 10-page handbook -> ~12-18 chunks instead of ~5.
Each chunk stays within one or two pages, so the
retrieved pageRange points the user to a precise section.

Loop the chunks into an embedding + upsert pipeline

The JSON array drops straight into a batch-embed loop. Keep the chunker's metadata on each vector for citation.

import json, openai, pinecone
chunks = json.load(open('paper.chunks.json'))
for c in chunks:
    vec = openai.embeddings.create(
        model='text-embedding-3-small', input=c['text']
    ).data[0].embedding
    index.upsert([(f"paper-{c['id']}", vec, {
        'page': c['page'], 'pageRange': c['pageRange'],
        'text': c['text']
    })])

Automate it locally with the runner (overlap configurable here)

The browser locks overlap at 50; the API exposes it. Pair the runner once and processing stays on your machine — the document never leaves your network.

# Discover the option schema
GET /api/v1/tools/pdf-to-chunks

# Run locally via the paired @jadapps/runner
POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
{
  "options": { "maxChunkSize": 400, "overlap": 80 }
}
# -> same {id,page,pageRange,text,tokens} JSON array

Citing the source page in a RAG answer

Because every chunk carries page provenance, your prompt can ask the LLM to cite it and you can render a deep link.

Retrieved chunk metadata: { page: 7, pageRange: [7, 8] }

Prompt scaffold:
"Answer using the context. Cite the page in [p.N] form."
Context: <chunk.text>

Model: "... renewal is automatic [p.7]."

Edge cases and what actually happens

Scanned / image-only PDF produces zero chunks

No text layer

The chunker reads only the embedded text layer via pdf.js. A scanned contract or photographed page has no extractable text, so it yields an empty or near-empty chunk array. Run it through PDF OCR first to add a text layer, then chunk the OCR'd output.

A single sentence is longer than Max words per chunk

By design

Sentences are never split mid-way. If one sentence exceeds your target (a long enumerated clause, a dense legal recital), it becomes its own chunk that overruns the target. This is intentional — a coherent over-target chunk beats a truncated fragment for retrieval.

tokens is words, not model tokens

Expected

The tokens field counts whitespace-separated words, not BPE tokens. A 500-word chunk is roughly 650-700 GPT/Claude tokens. Leave headroom when you size chunks against an embedding model's hard input limit.

Multi-column or magazine layout interleaves text

Reading-order risk

Text comes out in pdf.js content-stream order, not reconstructed reading order. Two-column papers can interleave left and right columns, mid-sentence. Sanity-check a chunk from a multi-column source; if order is scrambled, the PDF needs reflow before chunking.

Free tier file or page ceiling hit

413 limit

Free tier caps PDFs at 2 MB and 50 pages. A long book exceeds the page limit before the byte limit. Upgrade to Pro (50 MB / 500 pages) or split the document with PDF Split (Fixed) first, then chunk each part.

overlap can't exceed half the chunk size

Clamped

Via the API, overlap is clamped to at most half the target word count. Requesting overlap 400 on a 400-word target is silently reduced to 200, so consecutive chunks can never be more than half-redundant.

Re-chunking changes every id

Positional ids

id is a 0-based positional index, not a content hash. Re-running with a different Max words per chunk renumbers everything, so re-embed the whole document rather than diffing by id. Use a content hash if you need stable identifiers across runs.

Ligatures or spacing artefacts in extracted text

Source-dependent

Some PDFs store fi/fl as ligature glyphs or insert spaces between every letter. pdf.js extracts what's there; the chunker doesn't clean it. If embeddings look off, pre-clean the text or extract via PDF to Text and inspect before embedding.

Tables are flattened into the sentence stream

Use a sibling tool

Tabular data has no . ! ? boundaries, so a table is swept into surrounding sentences and chunks awkwardly. Extract tables separately with PDF Table to JSON and index them as structured records alongside the prose chunks.

Frequently asked questions

Is this true semantic chunking?

No, and the tool is deliberately honest about it. It is sentence-aware chunking with a sliding overlap: it respects . ! ? boundaries and carries trailing context into the next chunk, which is what the large majority of RAG pipelines actually need. It does not compute embeddings to detect topic shifts, so it is not 'semantic' in the embedding sense. If you need topic-boundary detection, post-process these chunks with your own embedding model.

What chunk size should I pick for my embedding model?

Set Max words per chunk to 200-500 for general retrieval. text-embedding-3-small/large, Cohere embed-v3, and HuggingFace sentence-transformers all handle that comfortably. Smaller chunks (150-250 words) give sharper retrieval for Q&A; larger chunks (500+) give more surrounding context per hit. Remember the field is words, not tokens — multiply by ~1.3 for a token estimate.

How much overlap do the chunks have?

50 words, carried from the tail of each chunk into the start of the next. The browser UI keeps this fixed; only the API/runner exposes an overlap option. Overlap is clamped to at most half the chunk size, so chunks can never be more than 50% redundant.

Does each chunk tell me which page it came from?

Yes. Every chunk has page (the first page it touches) and pageRange ([start, end] covering every page it spans). Store both as vector metadata so a retrieved answer can cite the exact page.

Can it chunk a scanned PDF?

Not directly — it reads only the embedded text layer and does no OCR, so a pure scan yields no chunks. Run the scan through PDF OCR to add a text layer first, then chunk the result.

Is my document uploaded to a server?

No. Extraction and chunking run in your browser via pdf.js. The PDF never leaves your device; only an anonymous usage counter is recorded when you're signed in. For automated runs, the paired runner processes files locally on your own machine.

What vector databases does the output work with?

All of them. The output is a plain JSON array of text plus metadata, so it loads into Pinecone, Weaviate, Chroma, Qdrant, Milvus, or pgvector with a trivial loop — embed each text, upsert with id/page/pageRange as metadata.

How big a PDF can I chunk?

Free tier allows 2 MB and 50 pages; Pro 50 MB and 500 pages; Pro+Media 500 MB and 2,000 pages; Developer 2 GB and 10,000 pages. For very long books, split with PDF Split (Fixed) and chunk each part.

Why did one chunk come out much bigger than my target?

Because a single sentence exceeded the target and sentences are never split. The chunker flushes the buffer before the oversized sentence, then that long sentence forms its own chunk. A coherent over-target chunk retrieves far better than a mid-sentence cut.

How do I handle tables inside the PDF?

Pull them out separately. Tables have no sentence punctuation, so they get smeared into adjacent prose. Use PDF Table to JSON to extract tabular data as structured rows and index those alongside your text chunks.

Are the chunk ids stable if I re-process the file?

No. id is a positional 0-based index. Changing Max words per chunk renumbers everything, so re-embed the whole document on re-chunk. If you need IDs that survive re-runs, hash the chunk text yourself.

Can I run this as part of an automated ingestion job?

Yes. GET /api/v1/tools/pdf-to-chunks returns the option schema; pair the @jadapps/runner once and POST your PDF to 127.0.0.1:9789/v1/tools/pdf-to-chunks/run with { maxChunkSize, overlap }. Files are processed locally on your machine, so confidential documents never reach JAD's servers — ideal for a scheduled corpus re-index.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to chunk a pdf document for a rag pipeline

Step 1
Open the PDF chunker and drop your document — Load the PDF into the PDF to Text Chunks tool. Extraction and chunking run in your browser via pdf.js — nothing is uploaded. The tool reads the document's embedded text layer; a scanned, image-only PDF has no text to chunk (see the OCR step below).
Step 2
Set Max words per chunk for your embedding model — The single control is Max words per chunk (50–2000, default 500). For OpenAI text-embedding-3-small/-large or Cohere embed-v3, 200–500 words is a sound default — it stays comfortably inside the model's input limit while keeping each chunk focused enough for precise retrieval.
Step 3
Confirm the document has a real text layer — If the source is a scan, run it through PDF OCR first to add a searchable text layer, then bring the OCR'd PDF back here. The chunker does not OCR — it only sees text pdf.js can extract.
Step 4
Generate and download the chunks JSON — The tool produces a JSON array and downloads it as <filename>.chunks.json. Each element is { id, page, pageRange, text, tokens } where tokens is a whitespace word count, not a model-tokenizer count.
Step 5
Embed each chunk's text — Iterate the array, send each text to your embedding API in batches, and keep id, page, and pageRange as metadata on the resulting vector. Re-embed the whole document if you re-chunk with a different size — id values are positional, not stable across runs.
Step 6
Upsert into your vector store and test retrieval — Upsert the vectors into Pinecone, Weaviate, Chroma, Qdrant, or pgvector, then run a few known questions and inspect which page/pageRange come back. If answers arrive truncated, lower Max words per chunk; if retrieval pulls scattered noise, raise it.

What the chunker actually does (and doesn't)

Verified against the tool's implementation. The browser UI exposes Max words per chunk only; overlap is fixed at 50 unless you call the API.

Aspect	Behaviour	Notes for RAG
Split unit	Sentence boundaries — regex `(?<=[.!?])\s+`	Sentences are never cut mid-way; a single sentence longer than the target becomes its own oversized chunk
Chunk size control	`Max words per chunk` — 50 to 2000, default 500	It is a word target, not a model-token target; multiply by ~1.3 for a rough BPE-token estimate
Overlap	Fixed 50 words in the browser; configurable via API (`overlap`)	Overlap is taken from the trailing sentences of the previous chunk, clamped to at most half the target
`tokens` field	Whitespace-separated word count of the chunk text	Use it for budgeting, but don't treat it as exact GPT/Claude tokens
Metadata	`id` (0-based, positional), `page`, `pageRange [start,end]`	`page` is the first page the chunk touches; `pageRange` spans every page it covers
OCR	None — embedded text layer only	Scanned PDFs must go through PDF OCR first or they yield zero chunks

File and page limits by plan

PDF-family limits from the platform's tier table. The chunker reads every page, so a large book can hit the page ceiling before the byte ceiling.

Plan	Max file size	Max pages	Files per batch
Free	2 MB	50 pages	1
Pro	50 MB	500 pages	5
Pro + Media	500 MB	2,000 pages	50
Developer	2 GB	10,000 pages	unlimited

Cookbook

Concrete RAG chunking recipes. The JSON shown is the tool's real output shape — id, page, pageRange, text, tokens.

Default 500-word chunks for a research-paper corpus

The out-of-the-box setting. A 10-page paper yields a handful of chunks, each a run of whole sentences with a 50-word tail carried into the next.

Max words per chunk: 500  (overlap: 50, fixed in browser)

Output (first element of <paper>.chunks.json):
[
  {
    "id": 0,
    "page": 1,
    "pageRange": [1, 2],
    "text": "Retrieval-augmented generation combines a parametric ...",
    "tokens": 498
  },
  ...
]

Smaller chunks for high-precision Q&A retrieval

FAQ-style or policy lookups benefit from tighter chunks so a single retrieved chunk maps to a single answer. Drop Max words per chunk to ~200.

Max words per chunk: 200

A 10-page handbook -> ~12-18 chunks instead of ~5.
Each chunk stays within one or two pages, so the
retrieved pageRange points the user to a precise section.

Loop the chunks into an embedding + upsert pipeline

The JSON array drops straight into a batch-embed loop. Keep the chunker's metadata on each vector for citation.

import json, openai, pinecone
chunks = json.load(open('paper.chunks.json'))
for c in chunks:
    vec = openai.embeddings.create(
        model='text-embedding-3-small', input=c['text']
    ).data[0].embedding
    index.upsert([(f"paper-{c['id']}", vec, {
        'page': c['page'], 'pageRange': c['pageRange'],
        'text': c['text']
    })])

Automate it locally with the runner (overlap configurable here)

The browser locks overlap at 50; the API exposes it. Pair the runner once and processing stays on your machine — the document never leaves your network.

# Discover the option schema
GET /api/v1/tools/pdf-to-chunks

# Run locally via the paired @jadapps/runner
POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
{
  "options": { "maxChunkSize": 400, "overlap": 80 }
}
# -> same {id,page,pageRange,text,tokens} JSON array

Citing the source page in a RAG answer

Because every chunk carries page provenance, your prompt can ask the LLM to cite it and you can render a deep link.

Retrieved chunk metadata: { page: 7, pageRange: [7, 8] }

Prompt scaffold:
"Answer using the context. Cite the page in [p.N] form."
Context: <chunk.text>

Model: "... renewal is automatic [p.7]."

Edge cases and what actually happens

Scanned / image-only PDF produces zero chunks

No text layer

A single sentence is longer than Max words per chunk

By design

tokens is words, not model tokens

Expected

Multi-column or magazine layout interleaves text

Reading-order risk

Free tier file or page ceiling hit

413 limit

overlap can't exceed half the chunk size

Clamped

Re-chunking changes every id

Positional ids

Ligatures or spacing artefacts in extracted text

Source-dependent

Tables are flattened into the sentence stream

Use a sibling tool

Frequently asked questions

Is this true semantic chunking?

What chunk size should I pick for my embedding model?

How much overlap do the chunks have?

Does each chunk tell me which page it came from?

Yes. Every chunk has page (the first page it touches) and pageRange ([start, end] covering every page it spans). Store both as vector metadata so a retrieved answer can cite the exact page.

Can it chunk a scanned PDF?

Not directly — it reads only the embedded text layer and does no OCR, so a pure scan yields no chunks. Run the scan through PDF OCR to add a text layer first, then chunk the result.

Is my document uploaded to a server?

What vector databases does the output work with?

How big a PDF can I chunk?

Free tier allows 2 MB and 50 pages; Pro 50 MB and 500 pages; Pro+Media 500 MB and 2,000 pages; Developer 2 GB and 10,000 pages. For very long books, split with PDF Split (Fixed) and chunk each part.

Why did one chunk come out much bigger than my target?

How do I handle tables inside the PDF?

Are the chunk ids stable if I re-process the file?

Can I run this as part of an automated ingestion job?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Chunk a PDF Document for a RAG Pipeline

How to chunk a pdf document for a rag pipeline

What the chunker actually does (and doesn't)

File and page limits by plan

Cookbook

Default 500-word chunks for a research-paper corpus

Smaller chunks for high-precision Q&A retrieval

Loop the chunks into an embedding + upsert pipeline

Automate it locally with the runner (overlap configurable here)

Citing the source page in a RAG answer

Edge cases and what actually happens

Scanned / image-only PDF produces zero chunks

A single sentence is longer than Max words per chunk

tokens is words, not model tokens

Multi-column or magazine layout interleaves text

Free tier file or page ceiling hit

overlap can't exceed half the chunk size

Re-chunking changes every id

Ligatures or spacing artefacts in extracted text

Tables are flattened into the sentence stream

Frequently asked questions

Is this true semantic chunking?

What chunk size should I pick for my embedding model?

How much overlap do the chunks have?

Does each chunk tell me which page it came from?

Can it chunk a scanned PDF?

Is my document uploaded to a server?

What vector databases does the output work with?

How big a PDF can I chunk?

Why did one chunk come out much bigger than my target?

How do I handle tables inside the PDF?

Are the chunk ids stable if I re-process the file?

Can I run this as part of an automated ingestion job?

Privacy first

Related guides

Chunk a PDF Document for a RAG Pipeline

How to chunk a pdf document for a rag pipeline

What the chunker actually does (and doesn't)

File and page limits by plan

Cookbook

Default 500-word chunks for a research-paper corpus

Smaller chunks for high-precision Q&A retrieval

Loop the chunks into an embedding + upsert pipeline

Automate it locally with the runner (overlap configurable here)

Citing the source page in a RAG answer

Edge cases and what actually happens

Scanned / image-only PDF produces zero chunks

A single sentence is longer than Max words per chunk

tokens is words, not model tokens

Multi-column or magazine layout interleaves text

Free tier file or page ceiling hit

overlap can't exceed half the chunk size

Re-chunking changes every id

Ligatures or spacing artefacts in extracted text

Tables are flattened into the sentence stream

Frequently asked questions

Is this true semantic chunking?

What chunk size should I pick for my embedding model?

How much overlap do the chunks have?

Does each chunk tell me which page it came from?

Can it chunk a scanned PDF?

Is my document uploaded to a server?

What vector databases does the output work with?

How big a PDF can I chunk?

Why did one chunk come out much bigger than my target?

How do I handle tables inside the PDF?

Are the chunk ids stable if I re-process the file?

Can I run this as part of an automated ingestion job?

Privacy first

Related guides