How to chunk a pdf for ingestion into a knowledge base
- Step 1Decide whether you even need to chunk yourself — If your KB is a managed assistant that chunks on upload (e.g. OpenAI's file search), upload the PDF directly. Use this tool when you run a custom pipeline (LlamaIndex, LangChain, your own vector search) and want control over chunk boundaries.
- Step 2Load the PDF into the chunker — Open PDF to Text Chunks and drop the document. Extraction runs in your browser via pdf.js — the file is not uploaded.
- Step 3Pick a chunk size for your retrieval strategy — Hybrid and keyword-heavy KBs often prefer slightly larger chunks (400-600 words) so each holds enough lexical signal; pure dense retrieval can go smaller (200-400). Set Max words per chunk accordingly (default 500).
- Step 4Generate the chunks JSON — The tool emits
{ id, page, pageRange, text, tokens }per chunk, downloaded as<filename>.chunks.json. - Step 5Build KB nodes with stable ids and metadata — Map each chunk to a LlamaIndex Node or LangChain Document. Attach
page/pageRangeand your own document id; derive a stable node id (doc + content hash), since the chunker'sidrenumbers on re-chunk. - Step 6Ingest and test with real questions — Index the nodes, then ask the questions your users will. Confirm the cited
page/pageRangeare correct and the chunk size gives precise hits; adjust and re-ingest if answers are too broad or too fragmented.
Where this tool fits per knowledge-base platform
Use the chunker when you control ingestion; skip it when the platform chunks for you.
| KB platform | Does it chunk for you? | Use this tool? |
|---|---|---|
| OpenAI Assistants / file search | Yes — chunks internally on upload | Optional — only if you want your own boundaries |
| LlamaIndex (custom) | You choose the node parser | Yes — feed chunks as Nodes with page metadata |
| LangChain (custom) | You choose the text splitter | Yes — feed chunks as Documents |
| Custom vector search | No — you own chunking | Yes — this is the chunking step |
| Managed RAG SaaS | Usually yes | Optional — check whether you can override chunking |
Chunk JSON to LlamaIndex / LangChain mapping
How the emitted fields land in the two most common frameworks.
| Chunk field | LlamaIndex | LangChain |
|---|---|---|
text | TextNode.text | Document.page_content |
page | metadata['page'] | metadata['page'] |
pageRange | metadata['page_range'] | metadata['page_range'] |
id | Compose node_id (don't reuse raw) | Compose stable id (don't reuse raw) |
tokens | Diagnostics only (word count) | Diagnostics only (word count) |
Cookbook
Recipes for turning chunk JSON into an indexed, citable knowledge base.
LlamaIndex nodes from the chunk JSON
Build TextNodes with page metadata and a content-stable id so re-ingestion is safe.
from llama_index.core.schema import TextNode
import json, hashlib
chunks = json.load(open('handbook.chunks.json'))
nodes = [TextNode(
text=c['text'],
id_='hb-'+hashlib.sha1(c['text'].encode()).hexdigest()[:12],
metadata={'page': c['page'], 'page_range': c['pageRange']}
) for c in chunks]
index = VectorStoreIndex(nodes)LangChain documents with page metadata
Each chunk becomes a Document; page fields ride along for citation.
from langchain_core.documents import Document
docs = [Document(
page_content=c['text'],
metadata={'page': c['page'],
'page_range': c['pageRange'],
'source': 'handbook.pdf'}
) for c in chunks]
vectorstore.add_documents(docs)Citing the source page in a KB answer
Because every chunk carries provenance, the assistant can ground its answer.
Retrieved node metadata: { page: 12, page_range: [12, 13] }
System prompt: "Answer from the context and cite
the handbook page as [p.N]."
Assistant: "PTO accrues at 1.5 days/month [p.12]."
# a reviewer can open p.12 and verify.Tables go to a structured index, not the prose chunks
Keep tabular knowledge as structured records so numeric lookups stay accurate.
# prose -> pdf-to-chunks -> text nodes # tables -> pdf-table-to-json -> structured rows # index both; route table-style questions to the # structured store, narrative questions to the chunks.
Re-ingesting an updated handbook safely
Use content-stable ids and a version tag so an updated PDF doesn't orphan old nodes.
version = 'handbook-2026Q2'
for c in chunks:
node_id = version + '-' + sha1(c['text'])[:12]
# delete old version's nodes, then upsert new ones,
# rather than relying on the chunker's positional id.Edge cases and what actually happens
KB platform already chunks for you
May be redundantManaged assistants like OpenAI's file search chunk on upload. Running this tool first is unnecessary unless you specifically want your own boundaries and page metadata. Use it for custom pipelines (LlamaIndex, LangChain, bespoke search) where you own chunking.
Using the positional id as a KB node key
Re-ingest hazardid renumbers whenever you re-chunk, so a handbook update can shuffle ids and orphan or overwrite the wrong nodes. Derive node ids from document id + content hash, scoped to a version tag.
Scanned manual has no text to index
No text layerThere's no OCR. A scanned policy or runbook yields empty chunks, leaving gaps in your KB. OCR with PDF OCR before chunking image-only documents.
Tables flattened into prose chunks
Use a sibling toolBenefit schedules, rate cards, and spec tables lack sentence punctuation and chunk poorly, so numeric lookups miss. Extract them with PDF Table to JSON and index the structured rows separately.
Headers/footers repeat across every chunk
Retrieval noiseRunning headers ('Confidential — Internal Use') and footers get pulled into the text on every page and dilute embeddings. Strip boilerplate before ingestion, or post-process the chunk text, since the chunker keeps everything pdf.js extracts.
Multi-column knowledge doc scrambles order
Reading-order riskTwo-column manuals can interleave columns in a chunk because extraction follows content order, not reading order. Verify a sample and reflow the source if needed before building the KB.
tokens not equal to model tokens
ExpectedThe tokens field is a word count for budgeting, not the BPE count your embedder or LLM uses. Multiply by ~1.3 when checking chunk sizes against a model limit.
Overlap fixed at 50 in the browser
Limited controlFor KBs that benefit from denser overlap between neighbouring nodes, the browser's fixed 50-word overlap is the only option; use the API where overlap is configurable (capped at half the chunk size).
Manual exceeds the tier page/byte limit
413 limitFree allows 50 pages / 2 MB; Pro 500 pages / 50 MB. A thick handbook hits the page ceiling first. Split with PDF Split (Fixed) and ingest each part, or upgrade the plan.
Frequently asked questions
Can I use this to populate an OpenAI Assistant's knowledge base?
You can, but you usually don't need to. The OpenAI Assistants API and file search chunk the PDF for you on upload, so direct upload is simplest. Use this tool when you run a custom pipeline (LlamaIndex, LangChain, your own vector search) and want control over chunk boundaries and page metadata.
How do I get the chunks into LlamaIndex or LangChain?
Map each chunk to a node/document: chunk text becomes TextNode.text / Document.page_content, and page/pageRange go into metadata. Derive a stable id from document id + content hash rather than reusing the chunker's positional id. A short loop is all it takes.
How should I handle tables in the document?
Extract them separately with PDF Table to JSON. Tables lack sentence punctuation and chunk poorly, so flattening them into prose makes numeric lookups unreliable. Index the structured rows alongside the text chunks and route table-style questions there.
What's the best retrieval strategy for chunked PDFs?
Hybrid search (dense vector similarity plus BM25 keyword matching) generally outperforms pure vector search for document KBs, because keyword signal catches exact terms (product codes, policy numbers) that embeddings blur. Slightly larger chunks (400-600 words) carry more lexical signal for the keyword side.
Will retrieved answers cite the source page?
Yes, if you keep the chunk's page/pageRange as node metadata and instruct the assistant to cite it. A reviewer can then open that page in the original PDF and verify the answer — important for support, HR, and compliance knowledge.
Is this real semantic chunking?
It is sentence-aware chunking with overlap, not embedding-based topic segmentation. It keeps sentences whole and bridges seams, which is what most knowledge-base retrieval needs. If you want topic-boundary detection, run an embedding pass over these chunks afterward.
Does it chunk scanned PDFs?
No — it reads only the embedded text layer and does no OCR. A scanned manual yields empty chunks. OCR it with PDF OCR first, then chunk the searchable output.
How do I re-ingest an updated document without breaking the KB?
Tag a version (e.g. handbook-2026Q2), derive node ids from version + content hash, delete the old version's nodes, then upsert the new ones. Don't rely on the chunker's positional id, which renumbers on every re-chunk.
What chunk size should I use?
Start at the default 500 words and adjust by retrieval behaviour. Too-broad answers mean smaller chunks; fragmented or missing context means larger chunks. Hybrid KBs lean a little larger for keyword signal; dense-only KBs can go smaller.
Is my knowledge content uploaded anywhere?
No. Chunking runs in your browser with pdf.js, so internal handbooks and runbooks never leave your device. For automated ingestion, the paired runner processes files locally on your own machine.
What about repeated headers and footers?
pdf.js extracts everything on the page, so running headers and footers appear in the text and can dilute embeddings. Strip boilerplate before ingestion or clean the chunk text afterward; the chunker doesn't remove it for you.
Can I automate KB ingestion of many PDFs?
Yes. Read the schema from GET /api/v1/tools/pdf-to-chunks, pair the @jadapps/runner, and POST each PDF to 127.0.0.1:9789/v1/tools/pdf-to-chunks/run. Files are processed locally, so a whole library of proprietary documents can be chunked and ingested without leaving your network.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.