How to convert a pdf to text chunks for vector database ingestion
- Step 1Load the PDF into the chunker — Open PDF to Text Chunks and drop your document. Extraction runs locally via pdf.js — nothing is uploaded ahead of your embedding step.
- Step 2Pick a chunk size for your index — Set Max words per chunk (50-2000). Smaller chunks raise precision and index size; larger chunks raise recall context per hit. 200-500 words suits most general document indexes.
- Step 3Generate and download the JSON — The tool emits an array of
{ id, page, pageRange, text, tokens }as<filename>.chunks.json. - Step 4Build a stable id for upserts — Don't reuse the chunker's positional
idas your vector key across re-runs. Compose a stable key like<docId>-<sha1(text)>or<docId>-<id>scoped to one ingestion version, so re-chunking doesn't collide with old vectors. - Step 5Embed each chunk in batches — Loop the array, send
textto your embedding API (batch to respect rate limits), and attachpage,pageRange, andtextas metadata on each vector. - Step 6Upsert and verify with a test query — Upsert into your vector DB, then run known queries and confirm the returned
page/pageRangepoint at the right sections. Adjust Max words per chunk if hits are too broad or too narrow.
Mapping the chunk JSON to vector-store metadata
Suggested field mapping. The tool emits the left column; the right is what most vector DBs expect.
| Chunk field | Use as | Why |
|---|---|---|
text | Embedding input + stored text metadata | Embed it, and keep a copy so you can show or re-rank the snippet |
page | Metadata page | Single-number citation for the answer |
pageRange | Metadata page_start / page_end | Spans multi-page chunks; lets you deep-link a range |
id | Part of a composite vector key, not the key alone | Positional — renumbers on re-chunk, so scope it to a doc + version |
tokens | Diagnostics / budgeting only | Word count, not the model's billed token count |
Common vector databases and how the output fits
The JSON array is database-agnostic; these are the typical upsert idioms.
| Vector DB | Upsert idiom | Metadata note |
|---|---|---|
| Pinecone | index.upsert([(id, vec, metadata)]) | Store page/pageRange/text in metadata |
| Weaviate | batch.add_data_object(props, class) | Map page/pageRange to object properties |
| Chroma | collection.add(ids, embeddings, metadatas, documents) | documents = chunk text, metadatas = page fields |
| Qdrant | upsert(points=[PointStruct(id, vector, payload)]) | page/pageRange go in payload |
| pgvector | INSERT ... (embedding, page, page_range, text) | vector column + ordinary columns for metadata |
Cookbook
End-to-end recipes from chunk JSON to a populated vector index.
The output you upsert
The exact shape the tool produces, ready to map onto vector-store records.
[
{ "id": 0, "page": 1, "pageRange": [1, 1],
"text": "...", "tokens": 487 },
{ "id": 1, "page": 2, "pageRange": [2, 3],
"text": "...", "tokens": 502 }
]Pinecone upsert with a stable composite id
Use a doc-scoped, content-hashed id so re-chunking never silently overwrites the wrong vector.
import hashlib, json, openai
chunks = json.load(open('manual.chunks.json'))
batch = []
for c in chunks:
vid = 'manual-' + hashlib.sha1(c['text'].encode()).hexdigest()[:12]
vec = openai.embeddings.create(
model='text-embedding-3-small', input=c['text']).data[0].embedding
batch.append((vid, vec, {
'page': c['page'], 'page_start': c['pageRange'][0],
'page_end': c['pageRange'][1], 'text': c['text']}))
index.upsert(batch)pgvector schema and insert
Store the embedding and the citation metadata as ordinary columns.
CREATE TABLE doc_chunks ( id text PRIMARY KEY, page int, page_start int, page_end int, body text, embedding vector(1536) ); -- insert one row per chunk; page_start/page_end -- come straight from pageRange.
Chroma with documents + metadatas
Chroma keeps the chunk text in documents and the page fields in metadatas.
collection.add(
ids=[f"doc-{c['id']}" for c in chunks],
documents=[c['text'] for c in chunks],
metadatas=[{'page': c['page'],
'range': c['pageRange']} for c in chunks],
embeddings=embeds,
)Batch a folder of PDFs through the runner
For a directory of reports, automate the chunking locally and feed each result to your embed-and-upsert job.
GET /api/v1/tools/pdf-to-chunks # schema
for each report.pdf:
POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
{ "options": { "maxChunkSize": 400, "overlap": 60 } }
# files processed locally; nothing leaves your networkEdge cases and what actually happens
Using id as the vector primary key
Upsert hazardid is a 0-based positional index that renumbers whenever you re-chunk with a different size. Use it as a vector key and a re-ingest will overwrite the wrong vectors or orphan old ones. Compose a stable key (doc id + content hash) instead.
tokens mismatched against model token billing
Expectedtokens is a whitespace word count, not the BPE count your embedding model charges for. A 500-'token' chunk is roughly 650 model tokens. Budget and size against the model limit with a ~1.3x conversion.
Scanned PDF embeds blank vectors
No text layerWith no OCR, a scanned page yields no text — and embedding empty strings pollutes your index with meaningless vectors. Check that chunks are non-empty; OCR with PDF OCR first when the source is a scan.
Duplicate chunks across overlapping PDFs
Index bloatIngesting two PDFs that share boilerplate (headers, legal footers) creates near-duplicate vectors that crowd retrieval. Deduplicate by hashing text before upsert and skipping repeats — the chunker does not deduplicate across files.
Multi-column source scrambles chunk text
Reading-order riskExtraction is in pdf.js content order. A two-column layout can interleave columns inside a chunk, degrading the embedding. Verify a sample chunk; reflow the PDF if order is wrong before you embed at scale.
Overlap not adjustable in the browser
Limited controlThe UI fixes overlap at 50 words. To tune redundancy between neighbouring vectors, run via the API where overlap is configurable (capped at half the chunk size).
Tables stored as one garbled chunk
Use a sibling toolA table lacks sentence punctuation, so it lands in one over-target, hard-to-embed chunk. Extract it with PDF Table to JSON and index structured rows separately for far better recall on tabular questions.
File exceeds the tier byte or page limit
413 limitFree: 2 MB / 50 pages. Pro: 50 MB / 500 pages. A big PDF hits one ceiling before you ever embed it. Split with PDF Split (Fixed) or upgrade the plan.
Re-embedding needed after dimension change
Index rebuildThe chunker is independent of your embedding model, but if you switch models (different vector dimensions) you must re-embed every chunk and rebuild the index — re-run the chunker so chunk boundaries match the new ingestion.
Frequently asked questions
What metadata should I store per chunk?
At minimum page and pageRange from the chunk, plus the chunk text itself for display or re-ranking. Add your own document id and version. That combination lets a retrieved vector cite the exact source page and lets you manage re-ingestion cleanly.
Can I use the chunk id as my vector key?
Not directly across runs. id is positional and renumbers when you re-chunk, so reusing it risks overwriting unrelated vectors. Build a composite, content-stable key such as <docId>-<sha1(text)[:12]>.
How much overlap is built in?
50 words, carried from the tail of each chunk into the next, fixed in the browser UI. The API exposes a configurable overlap (clamped to half the chunk size) if you need denser or sparser redundancy between neighbouring vectors.
What's the right chunk size for a vector DB?
200-500 words covers most cases. Smaller chunks raise precision and index size; larger chunks give more context per hit but blur the embedding. Tune Max words per chunk and test retrieval against known queries.
Should I deduplicate chunks before upserting?
Yes, when ingesting multiple PDFs that share boilerplate. Hash each chunk's text, skip repeats, and you avoid near-duplicate vectors crowding your top-k results. The tool does not deduplicate across files.
Which vector databases does this work with?
All of them. The output is a generic JSON array, so it maps onto Pinecone, Weaviate, Chroma, Qdrant, Milvus, and pgvector with a short loop — embed text, store the page fields as metadata/payload/columns.
Is tokens the same as the tokens my embedding model counts?
No. tokens is a whitespace word count. Your embedding model tokenises with BPE, so the true count is higher — roughly 1.3x. Use tokens for rough budgeting only, and keep margin under the model's hard input limit.
Does it OCR scanned PDFs before chunking?
No. It reads only the embedded text layer. A scan produces empty chunks and would seed your index with blank vectors. OCR with PDF OCR first.
How do I handle tables that I want to query?
Extract them separately. PDF Table to JSON returns structured rows you can index as their own records, which retrieve far better than a table flattened into prose chunks.
Is my document uploaded before embedding?
No. Chunking happens in your browser via pdf.js; the PDF stays on your device. You control when, where, and whether the chunk text is sent to an embedding provider.
Can I automate ingestion of many PDFs?
Yes. Pair the @jadapps/runner, read the schema from GET /api/v1/tools/pdf-to-chunks, and POST each PDF to 127.0.0.1:9789/v1/tools/pdf-to-chunks/run. Files are processed locally, so a folder of confidential reports can be chunked without leaving your network.
What happens if I switch embedding models later?
You must re-embed and rebuild the index because vector dimensions differ. Re-run the chunker too if you want boundaries that match the new ingestion, and use a fresh version tag in your composite ids so old and new vectors don't collide.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.