Convert a PDF to Chunks for a Vector Database

How to convert a pdf to text chunks for vector database ingestion

Step 1
Load the PDF into the chunker — Open PDF to Text Chunks and drop your document. Extraction runs locally via pdf.js — nothing is uploaded ahead of your embedding step.
Step 2
Pick a chunk size for your index — Set Max words per chunk (50-2000). Smaller chunks raise precision and index size; larger chunks raise recall context per hit. 200-500 words suits most general document indexes.
Step 3
Generate and download the JSON — The tool emits an array of { id, page, pageRange, text, tokens } as <filename>.chunks.json.
Step 4
Build a stable id for upserts — Don't reuse the chunker's positional id as your vector key across re-runs. Compose a stable key like <docId>-<sha1(text)> or <docId>-<id> scoped to one ingestion version, so re-chunking doesn't collide with old vectors.
Step 5
Embed each chunk in batches — Loop the array, send text to your embedding API (batch to respect rate limits), and attach page, pageRange, and text as metadata on each vector.
Step 6
Upsert and verify with a test query — Upsert into your vector DB, then run known queries and confirm the returned page/pageRange point at the right sections. Adjust Max words per chunk if hits are too broad or too narrow.

Mapping the chunk JSON to vector-store metadata

Suggested field mapping. The tool emits the left column; the right is what most vector DBs expect.

Chunk field	Use as	Why
`text`	Embedding input + stored `text` metadata	Embed it, and keep a copy so you can show or re-rank the snippet
`page`	Metadata `page`	Single-number citation for the answer
`pageRange`	Metadata `page_start` / `page_end`	Spans multi-page chunks; lets you deep-link a range
`id`	Part of a composite vector key, not the key alone	Positional — renumbers on re-chunk, so scope it to a doc + version
`tokens`	Diagnostics / budgeting only	Word count, not the model's billed token count

Common vector databases and how the output fits

The JSON array is database-agnostic; these are the typical upsert idioms.

Vector DB	Upsert idiom	Metadata note
Pinecone	`index.upsert([(id, vec, metadata)])`	Store page/pageRange/text in `metadata`
Weaviate	`batch.add_data_object(props, class)`	Map page/pageRange to object properties
Chroma	`collection.add(ids, embeddings, metadatas, documents)`	`documents` = chunk text, `metadatas` = page fields
Qdrant	`upsert(points=[PointStruct(id, vector, payload)])`	page/pageRange go in `payload`
pgvector	`INSERT ... (embedding, page, page_range, text)`	vector column + ordinary columns for metadata

Cookbook

End-to-end recipes from chunk JSON to a populated vector index.

The output you upsert

The exact shape the tool produces, ready to map onto vector-store records.

[
  { "id": 0, "page": 1, "pageRange": [1, 1],
    "text": "...", "tokens": 487 },
  { "id": 1, "page": 2, "pageRange": [2, 3],
    "text": "...", "tokens": 502 }
]

Pinecone upsert with a stable composite id

Use a doc-scoped, content-hashed id so re-chunking never silently overwrites the wrong vector.

import hashlib, json, openai
chunks = json.load(open('manual.chunks.json'))
batch = []
for c in chunks:
    vid = 'manual-' + hashlib.sha1(c['text'].encode()).hexdigest()[:12]
    vec = openai.embeddings.create(
        model='text-embedding-3-small', input=c['text']).data[0].embedding
    batch.append((vid, vec, {
        'page': c['page'], 'page_start': c['pageRange'][0],
        'page_end': c['pageRange'][1], 'text': c['text']}))
index.upsert(batch)

pgvector schema and insert

Store the embedding and the citation metadata as ordinary columns.

CREATE TABLE doc_chunks (
  id text PRIMARY KEY,
  page int, page_start int, page_end int,
  body text, embedding vector(1536)
);
-- insert one row per chunk; page_start/page_end
-- come straight from pageRange.

Chroma with documents + metadatas

Chroma keeps the chunk text in documents and the page fields in metadatas.

collection.add(
  ids=[f"doc-{c['id']}" for c in chunks],
  documents=[c['text'] for c in chunks],
  metadatas=[{'page': c['page'],
              'range': c['pageRange']} for c in chunks],
  embeddings=embeds,
)

Batch a folder of PDFs through the runner

For a directory of reports, automate the chunking locally and feed each result to your embed-and-upsert job.

GET /api/v1/tools/pdf-to-chunks            # schema
for each report.pdf:
  POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
  { "options": { "maxChunkSize": 400, "overlap": 60 } }
# files processed locally; nothing leaves your network

Edge cases and what actually happens

Using id as the vector primary key

Upsert hazard

id is a 0-based positional index that renumbers whenever you re-chunk with a different size. Use it as a vector key and a re-ingest will overwrite the wrong vectors or orphan old ones. Compose a stable key (doc id + content hash) instead.

tokens mismatched against model token billing

Expected

tokens is a whitespace word count, not the BPE count your embedding model charges for. A 500-'token' chunk is roughly 650 model tokens. Budget and size against the model limit with a ~1.3x conversion.

Scanned PDF embeds blank vectors

No text layer

With no OCR, a scanned page yields no text — and embedding empty strings pollutes your index with meaningless vectors. Check that chunks are non-empty; OCR with PDF OCR first when the source is a scan.

Duplicate chunks across overlapping PDFs

Index bloat

Ingesting two PDFs that share boilerplate (headers, legal footers) creates near-duplicate vectors that crowd retrieval. Deduplicate by hashing text before upsert and skipping repeats — the chunker does not deduplicate across files.

Multi-column source scrambles chunk text

Reading-order risk

Extraction is in pdf.js content order. A two-column layout can interleave columns inside a chunk, degrading the embedding. Verify a sample chunk; reflow the PDF if order is wrong before you embed at scale.

Overlap not adjustable in the browser

Limited control

The UI fixes overlap at 50 words. To tune redundancy between neighbouring vectors, run via the API where overlap is configurable (capped at half the chunk size).

Tables stored as one garbled chunk

Use a sibling tool

A table lacks sentence punctuation, so it lands in one over-target, hard-to-embed chunk. Extract it with PDF Table to JSON and index structured rows separately for far better recall on tabular questions.

File exceeds the tier byte or page limit

413 limit

Free: 2 MB / 50 pages. Pro: 50 MB / 500 pages. A big PDF hits one ceiling before you ever embed it. Split with PDF Split (Fixed) or upgrade the plan.

Re-embedding needed after dimension change

Index rebuild

The chunker is independent of your embedding model, but if you switch models (different vector dimensions) you must re-embed every chunk and rebuild the index — re-run the chunker so chunk boundaries match the new ingestion.

Frequently asked questions

What metadata should I store per chunk?

At minimum page and pageRange from the chunk, plus the chunk text itself for display or re-ranking. Add your own document id and version. That combination lets a retrieved vector cite the exact source page and lets you manage re-ingestion cleanly.

Can I use the chunk id as my vector key?

Not directly across runs. id is positional and renumbers when you re-chunk, so reusing it risks overwriting unrelated vectors. Build a composite, content-stable key such as <docId>-<sha1(text)[:12]>.

How much overlap is built in?

50 words, carried from the tail of each chunk into the next, fixed in the browser UI. The API exposes a configurable overlap (clamped to half the chunk size) if you need denser or sparser redundancy between neighbouring vectors.

What's the right chunk size for a vector DB?

200-500 words covers most cases. Smaller chunks raise precision and index size; larger chunks give more context per hit but blur the embedding. Tune Max words per chunk and test retrieval against known queries.

Should I deduplicate chunks before upserting?

Yes, when ingesting multiple PDFs that share boilerplate. Hash each chunk's text, skip repeats, and you avoid near-duplicate vectors crowding your top-k results. The tool does not deduplicate across files.

Which vector databases does this work with?

All of them. The output is a generic JSON array, so it maps onto Pinecone, Weaviate, Chroma, Qdrant, Milvus, and pgvector with a short loop — embed text, store the page fields as metadata/payload/columns.

Is tokens the same as the tokens my embedding model counts?

No. tokens is a whitespace word count. Your embedding model tokenises with BPE, so the true count is higher — roughly 1.3x. Use tokens for rough budgeting only, and keep margin under the model's hard input limit.

Does it OCR scanned PDFs before chunking?

No. It reads only the embedded text layer. A scan produces empty chunks and would seed your index with blank vectors. OCR with PDF OCR first.

How do I handle tables that I want to query?

Extract them separately. PDF Table to JSON returns structured rows you can index as their own records, which retrieve far better than a table flattened into prose chunks.

Is my document uploaded before embedding?

No. Chunking happens in your browser via pdf.js; the PDF stays on your device. You control when, where, and whether the chunk text is sent to an embedding provider.

Can I automate ingestion of many PDFs?

Yes. Pair the @jadapps/runner, read the schema from GET /api/v1/tools/pdf-to-chunks, and POST each PDF to 127.0.0.1:9789/v1/tools/pdf-to-chunks/run. Files are processed locally, so a folder of confidential reports can be chunked without leaving your network.

What happens if I switch embedding models later?

You must re-embed and rebuild the index because vector dimensions differ. Re-run the chunker too if you want boundaries that match the new ingestion, and use a fresh version tag in your composite ids so old and new vectors don't collide.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to convert a pdf to text chunks for vector database ingestion

Step 1
Load the PDF into the chunker — Open PDF to Text Chunks and drop your document. Extraction runs locally via pdf.js — nothing is uploaded ahead of your embedding step.
Step 2
Pick a chunk size for your index — Set Max words per chunk (50-2000). Smaller chunks raise precision and index size; larger chunks raise recall context per hit. 200-500 words suits most general document indexes.
Step 3
Generate and download the JSON — The tool emits an array of { id, page, pageRange, text, tokens } as <filename>.chunks.json.
Step 4
Build a stable id for upserts — Don't reuse the chunker's positional id as your vector key across re-runs. Compose a stable key like <docId>-<sha1(text)> or <docId>-<id> scoped to one ingestion version, so re-chunking doesn't collide with old vectors.
Step 5
Embed each chunk in batches — Loop the array, send text to your embedding API (batch to respect rate limits), and attach page, pageRange, and text as metadata on each vector.
Step 6
Upsert and verify with a test query — Upsert into your vector DB, then run known queries and confirm the returned page/pageRange point at the right sections. Adjust Max words per chunk if hits are too broad or too narrow.

Mapping the chunk JSON to vector-store metadata

Suggested field mapping. The tool emits the left column; the right is what most vector DBs expect.

Chunk field	Use as	Why
`text`	Embedding input + stored `text` metadata	Embed it, and keep a copy so you can show or re-rank the snippet
`page`	Metadata `page`	Single-number citation for the answer
`pageRange`	Metadata `page_start` / `page_end`	Spans multi-page chunks; lets you deep-link a range
`id`	Part of a composite vector key, not the key alone	Positional — renumbers on re-chunk, so scope it to a doc + version
`tokens`	Diagnostics / budgeting only	Word count, not the model's billed token count

Common vector databases and how the output fits

The JSON array is database-agnostic; these are the typical upsert idioms.

Vector DB	Upsert idiom	Metadata note
Pinecone	`index.upsert([(id, vec, metadata)])`	Store page/pageRange/text in `metadata`
Weaviate	`batch.add_data_object(props, class)`	Map page/pageRange to object properties
Chroma	`collection.add(ids, embeddings, metadatas, documents)`	`documents` = chunk text, `metadatas` = page fields
Qdrant	`upsert(points=[PointStruct(id, vector, payload)])`	page/pageRange go in `payload`
pgvector	`INSERT ... (embedding, page, page_range, text)`	vector column + ordinary columns for metadata

Cookbook

End-to-end recipes from chunk JSON to a populated vector index.

The output you upsert

The exact shape the tool produces, ready to map onto vector-store records.

[
  { "id": 0, "page": 1, "pageRange": [1, 1],
    "text": "...", "tokens": 487 },
  { "id": 1, "page": 2, "pageRange": [2, 3],
    "text": "...", "tokens": 502 }
]

Pinecone upsert with a stable composite id

Use a doc-scoped, content-hashed id so re-chunking never silently overwrites the wrong vector.

import hashlib, json, openai
chunks = json.load(open('manual.chunks.json'))
batch = []
for c in chunks:
    vid = 'manual-' + hashlib.sha1(c['text'].encode()).hexdigest()[:12]
    vec = openai.embeddings.create(
        model='text-embedding-3-small', input=c['text']).data[0].embedding
    batch.append((vid, vec, {
        'page': c['page'], 'page_start': c['pageRange'][0],
        'page_end': c['pageRange'][1], 'text': c['text']}))
index.upsert(batch)

pgvector schema and insert

Store the embedding and the citation metadata as ordinary columns.

CREATE TABLE doc_chunks (
  id text PRIMARY KEY,
  page int, page_start int, page_end int,
  body text, embedding vector(1536)
);
-- insert one row per chunk; page_start/page_end
-- come straight from pageRange.

Chroma with documents + metadatas

Chroma keeps the chunk text in documents and the page fields in metadatas.

collection.add(
  ids=[f"doc-{c['id']}" for c in chunks],
  documents=[c['text'] for c in chunks],
  metadatas=[{'page': c['page'],
              'range': c['pageRange']} for c in chunks],
  embeddings=embeds,
)

Batch a folder of PDFs through the runner

For a directory of reports, automate the chunking locally and feed each result to your embed-and-upsert job.

GET /api/v1/tools/pdf-to-chunks            # schema
for each report.pdf:
  POST 127.0.0.1:9789/v1/tools/pdf-to-chunks/run
  { "options": { "maxChunkSize": 400, "overlap": 60 } }
# files processed locally; nothing leaves your network

Edge cases and what actually happens

Using id as the vector primary key

Upsert hazard

tokens mismatched against model token billing

Expected

Scanned PDF embeds blank vectors

No text layer

Duplicate chunks across overlapping PDFs

Index bloat

Multi-column source scrambles chunk text

Reading-order risk

Overlap not adjustable in the browser

Limited control

The UI fixes overlap at 50 words. To tune redundancy between neighbouring vectors, run via the API where overlap is configurable (capped at half the chunk size).

Tables stored as one garbled chunk

Use a sibling tool

File exceeds the tier byte or page limit

413 limit

Free: 2 MB / 50 pages. Pro: 50 MB / 500 pages. A big PDF hits one ceiling before you ever embed it. Split with PDF Split (Fixed) or upgrade the plan.

Re-embedding needed after dimension change

Index rebuild

Frequently asked questions

What metadata should I store per chunk?

Can I use the chunk id as my vector key?

How much overlap is built in?

What's the right chunk size for a vector DB?

Should I deduplicate chunks before upserting?

Which vector databases does this work with?

Is tokens the same as the tokens my embedding model counts?

Does it OCR scanned PDFs before chunking?

No. It reads only the embedded text layer. A scan produces empty chunks and would seed your index with blank vectors. OCR with PDF OCR first.

How do I handle tables that I want to query?

Extract them separately. PDF Table to JSON returns structured rows you can index as their own records, which retrieve far better than a table flattened into prose chunks.

Is my document uploaded before embedding?

No. Chunking happens in your browser via pdf.js; the PDF stays on your device. You control when, where, and whether the chunk text is sent to an embedding provider.

Can I automate ingestion of many PDFs?

What happens if I switch embedding models later?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Convert a PDF to Text Chunks for Vector Database Ingestion

How to convert a pdf to text chunks for vector database ingestion

Mapping the chunk JSON to vector-store metadata

Common vector databases and how the output fits

Cookbook

The output you upsert

Pinecone upsert with a stable composite id

pgvector schema and insert

Chroma with documents + metadatas

Batch a folder of PDFs through the runner

Edge cases and what actually happens

Using id as the vector primary key

tokens mismatched against model token billing

Scanned PDF embeds blank vectors

Duplicate chunks across overlapping PDFs

Multi-column source scrambles chunk text

Overlap not adjustable in the browser

Tables stored as one garbled chunk

File exceeds the tier byte or page limit

Re-embedding needed after dimension change

Frequently asked questions

What metadata should I store per chunk?

Can I use the chunk id as my vector key?

How much overlap is built in?

What's the right chunk size for a vector DB?

Should I deduplicate chunks before upserting?

Which vector databases does this work with?

Is tokens the same as the tokens my embedding model counts?

Does it OCR scanned PDFs before chunking?

How do I handle tables that I want to query?

Is my document uploaded before embedding?

Can I automate ingestion of many PDFs?

What happens if I switch embedding models later?

Privacy first

Related guides

Convert a PDF to Text Chunks for Vector Database Ingestion

How to convert a pdf to text chunks for vector database ingestion

Mapping the chunk JSON to vector-store metadata

Common vector databases and how the output fits

Cookbook

The output you upsert

Pinecone upsert with a stable composite id

pgvector schema and insert

Chroma with documents + metadatas

Batch a folder of PDFs through the runner

Edge cases and what actually happens

Using id as the vector primary key

tokens mismatched against model token billing

Scanned PDF embeds blank vectors

Duplicate chunks across overlapping PDFs

Multi-column source scrambles chunk text

Overlap not adjustable in the browser

Tables stored as one garbled chunk

File exceeds the tier byte or page limit

Re-embedding needed after dimension change

Frequently asked questions

What metadata should I store per chunk?

Can I use the chunk id as my vector key?

How much overlap is built in?

What's the right chunk size for a vector DB?

Should I deduplicate chunks before upserting?

Which vector databases does this work with?

Is tokens the same as the tokens my embedding model counts?

Does it OCR scanned PDFs before chunking?

How do I handle tables that I want to query?

Is my document uploaded before embedding?

Can I automate ingestion of many PDFs?

What happens if I switch embedding models later?

Privacy first

Related guides