Chunk a PDF for a Knowledge Base — Free Online Tool

How to chunk a pdf for ingestion into a knowledge base

Step 1
Decide whether you even need to chunk yourself — If your KB is a managed assistant that chunks on upload (e.g. OpenAI's file search), upload the PDF directly. Use this tool when you run a custom pipeline (LlamaIndex, LangChain, your own vector search) and want control over chunk boundaries.
Step 2
Load the PDF into the chunker — Open PDF to Text Chunks and drop the document. Extraction runs in your browser via pdf.js — the file is not uploaded.
Step 3
Pick a chunk size for your retrieval strategy — Hybrid and keyword-heavy KBs often prefer slightly larger chunks (400-600 words) so each holds enough lexical signal; pure dense retrieval can go smaller (200-400). Set Max words per chunk accordingly (default 500).
Step 4
Generate the chunks JSON — The tool emits { id, page, pageRange, text, tokens } per chunk, downloaded as <filename>.chunks.json.
Step 5
Build KB nodes with stable ids and metadata — Map each chunk to a LlamaIndex Node or LangChain Document. Attach page/pageRange and your own document id; derive a stable node id (doc + content hash), since the chunker's id renumbers on re-chunk.
Step 6
Ingest and test with real questions — Index the nodes, then ask the questions your users will. Confirm the cited page/pageRange are correct and the chunk size gives precise hits; adjust and re-ingest if answers are too broad or too fragmented.

Where this tool fits per knowledge-base platform

Use the chunker when you control ingestion; skip it when the platform chunks for you.

KB platform	Does it chunk for you?	Use this tool?
OpenAI Assistants / file search	Yes — chunks internally on upload	Optional — only if you want your own boundaries
LlamaIndex (custom)	You choose the node parser	Yes — feed chunks as Nodes with page metadata
LangChain (custom)	You choose the text splitter	Yes — feed chunks as Documents
Custom vector search	No — you own chunking	Yes — this is the chunking step
Managed RAG SaaS	Usually yes	Optional — check whether you can override chunking

Chunk JSON to LlamaIndex / LangChain mapping

How the emitted fields land in the two most common frameworks.

Chunk field	LlamaIndex	LangChain
`text`	`TextNode.text`	`Document.page_content`
`page`	`metadata['page']`	`metadata['page']`
`pageRange`	`metadata['page_range']`	`metadata['page_range']`
`id`	Compose `node_id` (don't reuse raw)	Compose stable id (don't reuse raw)
`tokens`	Diagnostics only (word count)	Diagnostics only (word count)

Cookbook

Recipes for turning chunk JSON into an indexed, citable knowledge base.

LlamaIndex nodes from the chunk JSON

Build TextNodes with page metadata and a content-stable id so re-ingestion is safe.

from llama_index.core.schema import TextNode
import json, hashlib
chunks = json.load(open('handbook.chunks.json'))
nodes = [TextNode(
    text=c['text'],
    id_='hb-'+hashlib.sha1(c['text'].encode()).hexdigest()[:12],
    metadata={'page': c['page'], 'page_range': c['pageRange']}
) for c in chunks]
index = VectorStoreIndex(nodes)

LangChain documents with page metadata

Each chunk becomes a Document; page fields ride along for citation.

from langchain_core.documents import Document
docs = [Document(
    page_content=c['text'],
    metadata={'page': c['page'],
              'page_range': c['pageRange'],
              'source': 'handbook.pdf'}
) for c in chunks]
vectorstore.add_documents(docs)

Citing the source page in a KB answer

Because every chunk carries provenance, the assistant can ground its answer.

Retrieved node metadata: { page: 12, page_range: [12, 13] }

System prompt: "Answer from the context and cite
  the handbook page as [p.N]."
Assistant: "PTO accrues at 1.5 days/month [p.12]."
# a reviewer can open p.12 and verify.

Tables go to a structured index, not the prose chunks

Keep tabular knowledge as structured records so numeric lookups stay accurate.

# prose -> pdf-to-chunks -> text nodes
# tables -> pdf-table-to-json -> structured rows
# index both; route table-style questions to the
# structured store, narrative questions to the chunks.

Re-ingesting an updated handbook safely

Use content-stable ids and a version tag so an updated PDF doesn't orphan old nodes.

version = 'handbook-2026Q2'
for c in chunks:
    node_id = version + '-' + sha1(c['text'])[:12]
# delete old version's nodes, then upsert new ones,
# rather than relying on the chunker's positional id.

Edge cases and what actually happens

KB platform already chunks for you

May be redundant

Managed assistants like OpenAI's file search chunk on upload. Running this tool first is unnecessary unless you specifically want your own boundaries and page metadata. Use it for custom pipelines (LlamaIndex, LangChain, bespoke search) where you own chunking.

Using the positional id as a KB node key

Re-ingest hazard

id renumbers whenever you re-chunk, so a handbook update can shuffle ids and orphan or overwrite the wrong nodes. Derive node ids from document id + content hash, scoped to a version tag.

Scanned manual has no text to index

No text layer

There's no OCR. A scanned policy or runbook yields empty chunks, leaving gaps in your KB. OCR with PDF OCR before chunking image-only documents.

Tables flattened into prose chunks

Use a sibling tool

Benefit schedules, rate cards, and spec tables lack sentence punctuation and chunk poorly, so numeric lookups miss. Extract them with PDF Table to JSON and index the structured rows separately.

Headers/footers repeat across every chunk

Retrieval noise

Running headers ('Confidential — Internal Use') and footers get pulled into the text on every page and dilute embeddings. Strip boilerplate before ingestion, or post-process the chunk text, since the chunker keeps everything pdf.js extracts.

Multi-column knowledge doc scrambles order

Reading-order risk

Two-column manuals can interleave columns in a chunk because extraction follows content order, not reading order. Verify a sample and reflow the source if needed before building the KB.

tokens not equal to model tokens

Expected

The tokens field is a word count for budgeting, not the BPE count your embedder or LLM uses. Multiply by ~1.3 when checking chunk sizes against a model limit.

Overlap fixed at 50 in the browser

Limited control

For KBs that benefit from denser overlap between neighbouring nodes, the browser's fixed 50-word overlap is the only option; use the API where overlap is configurable (capped at half the chunk size).

Manual exceeds the tier page/byte limit

413 limit

Free allows 50 pages / 2 MB; Pro 500 pages / 50 MB. A thick handbook hits the page ceiling first. Split with PDF Split (Fixed) and ingest each part, or upgrade the plan.

Frequently asked questions

Can I use this to populate an OpenAI Assistant's knowledge base?

You can, but you usually don't need to. The OpenAI Assistants API and file search chunk the PDF for you on upload, so direct upload is simplest. Use this tool when you run a custom pipeline (LlamaIndex, LangChain, your own vector search) and want control over chunk boundaries and page metadata.

How do I get the chunks into LlamaIndex or LangChain?

Map each chunk to a node/document: chunk text becomes TextNode.text / Document.page_content, and page/pageRange go into metadata. Derive a stable id from document id + content hash rather than reusing the chunker's positional id. A short loop is all it takes.

How should I handle tables in the document?

Extract them separately with PDF Table to JSON. Tables lack sentence punctuation and chunk poorly, so flattening them into prose makes numeric lookups unreliable. Index the structured rows alongside the text chunks and route table-style questions there.

What's the best retrieval strategy for chunked PDFs?

Hybrid search (dense vector similarity plus BM25 keyword matching) generally outperforms pure vector search for document KBs, because keyword signal catches exact terms (product codes, policy numbers) that embeddings blur. Slightly larger chunks (400-600 words) carry more lexical signal for the keyword side.

Will retrieved answers cite the source page?

Yes, if you keep the chunk's page/pageRange as node metadata and instruct the assistant to cite it. A reviewer can then open that page in the original PDF and verify the answer — important for support, HR, and compliance knowledge.

Is this real semantic chunking?

It is sentence-aware chunking with overlap, not embedding-based topic segmentation. It keeps sentences whole and bridges seams, which is what most knowledge-base retrieval needs. If you want topic-boundary detection, run an embedding pass over these chunks afterward.

Does it chunk scanned PDFs?

No — it reads only the embedded text layer and does no OCR. A scanned manual yields empty chunks. OCR it with PDF OCR first, then chunk the searchable output.

How do I re-ingest an updated document without breaking the KB?

Tag a version (e.g. handbook-2026Q2), derive node ids from version + content hash, delete the old version's nodes, then upsert the new ones. Don't rely on the chunker's positional id, which renumbers on every re-chunk.

What chunk size should I use?

Start at the default 500 words and adjust by retrieval behaviour. Too-broad answers mean smaller chunks; fragmented or missing context means larger chunks. Hybrid KBs lean a little larger for keyword signal; dense-only KBs can go smaller.

Is my knowledge content uploaded anywhere?

No. Chunking runs in your browser with pdf.js, so internal handbooks and runbooks never leave your device. For automated ingestion, the paired runner processes files locally on your own machine.

What about repeated headers and footers?

pdf.js extracts everything on the page, so running headers and footers appear in the text and can dilute embeddings. Strip boilerplate before ingestion or clean the chunk text afterward; the chunker doesn't remove it for you.

Can I automate KB ingestion of many PDFs?

Yes. Read the schema from GET /api/v1/tools/pdf-to-chunks, pair the @jadapps/runner, and POST each PDF to 127.0.0.1:9789/v1/tools/pdf-to-chunks/run. Files are processed locally, so a whole library of proprietary documents can be chunked and ingested without leaving your network.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to chunk a pdf for ingestion into a knowledge base

Step 1
Decide whether you even need to chunk yourself — If your KB is a managed assistant that chunks on upload (e.g. OpenAI's file search), upload the PDF directly. Use this tool when you run a custom pipeline (LlamaIndex, LangChain, your own vector search) and want control over chunk boundaries.
Step 2
Load the PDF into the chunker — Open PDF to Text Chunks and drop the document. Extraction runs in your browser via pdf.js — the file is not uploaded.
Step 3
Pick a chunk size for your retrieval strategy — Hybrid and keyword-heavy KBs often prefer slightly larger chunks (400-600 words) so each holds enough lexical signal; pure dense retrieval can go smaller (200-400). Set Max words per chunk accordingly (default 500).
Step 4
Generate the chunks JSON — The tool emits { id, page, pageRange, text, tokens } per chunk, downloaded as <filename>.chunks.json.
Step 5
Build KB nodes with stable ids and metadata — Map each chunk to a LlamaIndex Node or LangChain Document. Attach page/pageRange and your own document id; derive a stable node id (doc + content hash), since the chunker's id renumbers on re-chunk.
Step 6
Ingest and test with real questions — Index the nodes, then ask the questions your users will. Confirm the cited page/pageRange are correct and the chunk size gives precise hits; adjust and re-ingest if answers are too broad or too fragmented.

Where this tool fits per knowledge-base platform

Use the chunker when you control ingestion; skip it when the platform chunks for you.

KB platform	Does it chunk for you?	Use this tool?
OpenAI Assistants / file search	Yes — chunks internally on upload	Optional — only if you want your own boundaries
LlamaIndex (custom)	You choose the node parser	Yes — feed chunks as Nodes with page metadata
LangChain (custom)	You choose the text splitter	Yes — feed chunks as Documents
Custom vector search	No — you own chunking	Yes — this is the chunking step
Managed RAG SaaS	Usually yes	Optional — check whether you can override chunking

Chunk JSON to LlamaIndex / LangChain mapping

How the emitted fields land in the two most common frameworks.

Chunk field	LlamaIndex	LangChain
`text`	`TextNode.text`	`Document.page_content`
`page`	`metadata['page']`	`metadata['page']`
`pageRange`	`metadata['page_range']`	`metadata['page_range']`
`id`	Compose `node_id` (don't reuse raw)	Compose stable id (don't reuse raw)
`tokens`	Diagnostics only (word count)	Diagnostics only (word count)

Cookbook

Recipes for turning chunk JSON into an indexed, citable knowledge base.

LlamaIndex nodes from the chunk JSON

Build TextNodes with page metadata and a content-stable id so re-ingestion is safe.

from llama_index.core.schema import TextNode
import json, hashlib
chunks = json.load(open('handbook.chunks.json'))
nodes = [TextNode(
    text=c['text'],
    id_='hb-'+hashlib.sha1(c['text'].encode()).hexdigest()[:12],
    metadata={'page': c['page'], 'page_range': c['pageRange']}
) for c in chunks]
index = VectorStoreIndex(nodes)

LangChain documents with page metadata

Each chunk becomes a Document; page fields ride along for citation.

from langchain_core.documents import Document
docs = [Document(
    page_content=c['text'],
    metadata={'page': c['page'],
              'page_range': c['pageRange'],
              'source': 'handbook.pdf'}
) for c in chunks]
vectorstore.add_documents(docs)

Citing the source page in a KB answer

Because every chunk carries provenance, the assistant can ground its answer.

Retrieved node metadata: { page: 12, page_range: [12, 13] }

System prompt: "Answer from the context and cite
  the handbook page as [p.N]."
Assistant: "PTO accrues at 1.5 days/month [p.12]."
# a reviewer can open p.12 and verify.

Tables go to a structured index, not the prose chunks

Keep tabular knowledge as structured records so numeric lookups stay accurate.

# prose -> pdf-to-chunks -> text nodes
# tables -> pdf-table-to-json -> structured rows
# index both; route table-style questions to the
# structured store, narrative questions to the chunks.

Re-ingesting an updated handbook safely

Use content-stable ids and a version tag so an updated PDF doesn't orphan old nodes.

version = 'handbook-2026Q2'
for c in chunks:
    node_id = version + '-' + sha1(c['text'])[:12]
# delete old version's nodes, then upsert new ones,
# rather than relying on the chunker's positional id.

Edge cases and what actually happens

KB platform already chunks for you

May be redundant

Using the positional id as a KB node key

Re-ingest hazard

id renumbers whenever you re-chunk, so a handbook update can shuffle ids and orphan or overwrite the wrong nodes. Derive node ids from document id + content hash, scoped to a version tag.

Scanned manual has no text to index

No text layer

There's no OCR. A scanned policy or runbook yields empty chunks, leaving gaps in your KB. OCR with PDF OCR before chunking image-only documents.

Tables flattened into prose chunks

Use a sibling tool

Benefit schedules, rate cards, and spec tables lack sentence punctuation and chunk poorly, so numeric lookups miss. Extract them with PDF Table to JSON and index the structured rows separately.

Headers/footers repeat across every chunk

Retrieval noise

Multi-column knowledge doc scrambles order

Reading-order risk

Two-column manuals can interleave columns in a chunk because extraction follows content order, not reading order. Verify a sample and reflow the source if needed before building the KB.

tokens not equal to model tokens

Expected

The tokens field is a word count for budgeting, not the BPE count your embedder or LLM uses. Multiply by ~1.3 when checking chunk sizes against a model limit.

Overlap fixed at 50 in the browser

Limited control

Manual exceeds the tier page/byte limit

413 limit

Free allows 50 pages / 2 MB; Pro 500 pages / 50 MB. A thick handbook hits the page ceiling first. Split with PDF Split (Fixed) and ingest each part, or upgrade the plan.

Frequently asked questions

Can I use this to populate an OpenAI Assistant's knowledge base?

How do I get the chunks into LlamaIndex or LangChain?

How should I handle tables in the document?

What's the best retrieval strategy for chunked PDFs?

Will retrieved answers cite the source page?

Is this real semantic chunking?

Does it chunk scanned PDFs?

No — it reads only the embedded text layer and does no OCR. A scanned manual yields empty chunks. OCR it with PDF OCR first, then chunk the searchable output.

How do I re-ingest an updated document without breaking the KB?

What chunk size should I use?

Is my knowledge content uploaded anywhere?

No. Chunking runs in your browser with pdf.js, so internal handbooks and runbooks never leave your device. For automated ingestion, the paired runner processes files locally on your own machine.

What about repeated headers and footers?

Can I automate KB ingestion of many PDFs?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Chunk a PDF for Ingestion into a Knowledge Base

How to chunk a pdf for ingestion into a knowledge base

Where this tool fits per knowledge-base platform

Chunk JSON to LlamaIndex / LangChain mapping

Cookbook

LlamaIndex nodes from the chunk JSON

LangChain documents with page metadata

Citing the source page in a KB answer

Tables go to a structured index, not the prose chunks

Re-ingesting an updated handbook safely

Edge cases and what actually happens

KB platform already chunks for you

Using the positional id as a KB node key

Scanned manual has no text to index

Tables flattened into prose chunks

Headers/footers repeat across every chunk

Multi-column knowledge doc scrambles order

tokens not equal to model tokens

Overlap fixed at 50 in the browser

Manual exceeds the tier page/byte limit

Frequently asked questions

Can I use this to populate an OpenAI Assistant's knowledge base?

How do I get the chunks into LlamaIndex or LangChain?

How should I handle tables in the document?

What's the best retrieval strategy for chunked PDFs?

Will retrieved answers cite the source page?

Is this real semantic chunking?

Does it chunk scanned PDFs?

How do I re-ingest an updated document without breaking the KB?

What chunk size should I use?

Is my knowledge content uploaded anywhere?

What about repeated headers and footers?

Can I automate KB ingestion of many PDFs?

Privacy first

Related guides

Chunk a PDF for Ingestion into a Knowledge Base

How to chunk a pdf for ingestion into a knowledge base

Where this tool fits per knowledge-base platform

Chunk JSON to LlamaIndex / LangChain mapping

Cookbook

LlamaIndex nodes from the chunk JSON

LangChain documents with page metadata

Citing the source page in a KB answer

Tables go to a structured index, not the prose chunks

Re-ingesting an updated handbook safely

Edge cases and what actually happens

KB platform already chunks for you

Using the positional id as a KB node key

Scanned manual has no text to index

Tables flattened into prose chunks

Headers/footers repeat across every chunk

Multi-column knowledge doc scrambles order

tokens not equal to model tokens

Overlap fixed at 50 in the browser

Manual exceeds the tier page/byte limit

Frequently asked questions

Can I use this to populate an OpenAI Assistant's knowledge base?

How do I get the chunks into LlamaIndex or LangChain?

How should I handle tables in the document?

What's the best retrieval strategy for chunked PDFs?

Will retrieved answers cite the source page?

Is this real semantic chunking?

Does it chunk scanned PDFs?

How do I re-ingest an updated document without breaking the KB?

What chunk size should I use?

Is my knowledge content uploaded anywhere?

What about repeated headers and footers?

Can I automate KB ingestion of many PDFs?

Privacy first

Related guides