How to extract pdf text for search engine or site indexing
- Step 1Extract the text — Drop a born-digital PDF onto PDF to Plain Text. It auto-extracts every page and gives you a
.txtdownload — your raw indexing input. - Step 2Confirm it isn't a scan — Check the preview. If a document you know has text comes out blank, it's image-only — route it through OCR first, then re-extract.
- Step 3Split on the page boundary — Pages are separated by a blank line (double newline). Split on it to keep per-page provenance, so a search hit can point users to the exact page.
- Step 4Strip recurring noise — Remove running headers, footers, and page numbers — they extract verbatim and otherwise pollute relevance. A regex that drops the repeated header line per page works well.
- Step 5Chunk if your engine prefers smaller documents — For long PDFs, break the cleaned text into fields/sections (or use PDF to Text Chunks for overlapping segments) so each indexed document is focused and ranks well.
- Step 6Ingest into your search engine — Push the cleaned text (plus metadata like title, URL, page) into Elasticsearch, OpenSearch, Algolia, Typesense, or Meilisearch via your indexing client.
Mapping extracted text into common search engines
The .txt is the input; each engine wants it as a document field. These are typical patterns, not tool features.
| Search engine | How the text lands | Recommended pre-processing |
|---|---|---|
| Elasticsearch / OpenSearch | A text-mapped field on each document | Strip headers/footers; one doc per page or per section |
| Algolia | A searchable attribute (records have size limits) | Chunk long PDFs so records stay within size caps |
| Typesense | A string field in the collection schema | Clean noise; add page number as a separate field |
| Meilisearch | A searchable field in each document | Chunk by page/section; keep a stable document id |
| Custom / database FTS | A TEXT column (e.g. Postgres tsvector) | Normalise whitespace before building the index |
Extraction behaviour relevant to indexing
What the tool does (and doesn't) so your pipeline is built on accurate assumptions.
| Aspect | Behaviour |
|---|---|
| Output | UTF-8 .txt, blank line between pages |
| Headers / footers / page numbers | Extracted verbatim — strip them yourself before indexing |
| Cleaning / dedup | None built in — extraction is faithful |
| Scanned pages | Blank — OCR required first |
| Multi-column | May interleave; clean or reflow before indexing |
| Free tier (ad-hoc) | 2 MB / 50 pages |
| Large corpus | Script ingestion; this tool is per-document |
Cookbook
Recipes for turning extracted PDF text into clean, indexable documents.
Index a single PDF page-by-page for deep links
Keep one search document per page so a hit can link to the exact page (#page=N). The blank-line separator makes this trivial.
Extract handbook.pdf → handbook.txt
Split on blank lines → array of page texts
for i, pageText in enumerate(pages):
index({ id: f'handbook#p{i+1}',
url: f'/docs/handbook.pdf#page={i+1}',
page: i+1, body: pageText })Strip running headers and page numbers before indexing
Repeated header/footer text hurts relevance and clutters snippets. Remove the recurring lines per page.
Raw page text: ACME Handbook — Confidential 7 Onboarding starts on day one ... After cleanup (drop the header/number line): Onboarding starts on day one ...
OCR a scanned manual, then index it
Image-only PDFs are invisible to search until they have a text layer. OCR first, extract second, index third.
1. scanned-manual.pdf → /pdf-tools/pdf-ocr → ocr.pdf 2. ocr.pdf → /pdf-tools/pdf-to-text → ocr.txt 3. clean + chunk ocr.txt → index into your engine
Normalise whitespace for a Postgres full-text index
Collapse runs of spaces/newlines so to_tsvector builds a tidy index and snippets read cleanly.
text = open('doc.txt').read()
import re
clean = re.sub(r'\s+', ' ', text).strip()
-- then: UPDATE docs SET body = clean, tsv = to_tsvector('english', clean);Decide: one blob vs. chunked documents
Short PDFs can be one indexed document; long ones rank better when chunked. Use the chunker for overlapping segments.
Short FAQ (2 pages): index as one document Long policy (40 pages): /pdf-tools/pdf-to-chunks → overlapping segments w/ page ranges → index each
Edge cases and what actually happens
Scanned PDFs are invisible to search
Empty outputAn image-only PDF has no text layer, so extraction returns nothing and the document never appears in search. This is the classic "my PDFs aren't searchable" bug. Run scans through OCR before extracting and indexing.
Running headers/footers inflate relevance scores
needs cleaningThe same header or footer on every page is extracted on every page, so a query matching it scores oddly and snippets look repetitive. Strip recurring lines per page before indexing — the tool won't do this for you.
Multi-column pages interleave
may interleaveTwo-column layouts can mix columns line by line because runs are joined in pdf.js order. Indexed text then contains scrambled phrases that hurt exact-phrase matching. Reflow or detect columns before indexing column-heavy material.
Page number lost during indexing
Preserved if you splitPage provenance is recoverable: pages are separated by a blank line. If you index one document per page (split on the blank line), you can attach the page number and deep-link to it. Concatenate first and you lose that mapping.
Large corpus exceeds the free per-file limits
blockedThe free tier handles 2 MB / 50 pages per file — fine for spot-checks, not for bulk ingestion. For a real corpus, script the extraction at the page-text level and run it server-side or on Pro+ limits.
Records exceed the engine's size cap
split requiredAlgolia and similar engines cap record size; a 40-page PDF as one record will be rejected or truncated. Chunk the text (per page or via PDF to Text Chunks) so each indexed document fits.
Output encoding mismatch in the index
SupportedOutput is UTF-8, which every modern search engine expects, so accented and non-Latin terms index correctly — provided the source font had a proper Unicode map. If you see mojibake, the problem is upstream in the PDF's font encoding, not the extraction.
Duplicate documents from re-indexing
use stable idsRe-running extraction and indexing without a stable document id creates duplicates. Derive the id from a stable key (file path + page) so re-indexing updates rather than duplicates. This is a pipeline concern, not a tool limitation.
Frequently asked questions
What encoding is the output text?
UTF-8 (no BOM). That's exactly what Elasticsearch, OpenSearch, Algolia, Typesense, Meilisearch, and database full-text engines expect, so accented and non-Latin terms index correctly — assuming the source PDF embedded a font with a proper Unicode mapping.
Should I pre-process the text before indexing?
Yes. The tool extracts faithfully, including running headers, footers, and page numbers. Strip those recurring lines, normalise whitespace, and (for long PDFs) chunk the text. Clean input means better relevance and tidier search snippets.
Can I keep the page number so search results deep-link to a page?
Yes. Pages are separated by a blank line in the output, so if you split on it and index one document per page you can store the page number and link to file.pdf#page=N. If you concatenate everything into one document first, that mapping is lost.
How does this handle scanned PDFs in my library?
It can't extract them — image-only pages have no text layer and come out blank, so they'd never appear in search. Run scans through the PDF OCR tool first to add a text layer, then extract and index.
Can I use this for a RAG pipeline?
Yes — extracted plain text is the starting point for chunking and embedding. For RAG specifically, PDF to Text Chunks gives you sentence-aware, overlapping chunks with page ranges, which is usually a better fit than one big text blob.
Is this tool a search engine or an indexer?
No — it only produces the text. You feed that text into your own search engine (Elasticsearch, Algolia, Typesense, etc.) or knowledge base. Think of it as the extraction stage of your ingestion pipeline.
Can it process my whole document library at once?
The web tool is per-file and capped at 2 MB / 50 pages on the free tier (50 MB / 500 pages on Pro). For a large corpus you'll want a scripted pipeline; the per-page extraction behaviour described here is the model to replicate.
Will multi-column PDFs index correctly?
Single-column documents index cleanly. Multi-column pages can interleave columns because runs are joined in pdf.js order, which scrambles phrases and hurts exact-phrase search. Reflow or column-detect before indexing column-heavy content.
How do I avoid duplicate documents when I re-index?
Use a stable document id derived from a stable key (e.g. file path plus page number). Then re-running extraction and indexing updates the existing document instead of creating a duplicate. That's a pipeline decision, not something the tool controls.
Should I index one document per PDF or per page?
Per page (or per section) usually ranks better for long PDFs and enables page-level deep links, and it keeps records within engine size caps. Short PDFs (an FAQ, a one-pager) are fine as a single document. The blank-line separator makes per-page splitting easy.
Does the file get uploaded to extract it?
No. Extraction runs in your browser via pdf.js — the file never leaves your device. That matters when you're indexing internal or confidential documents and don't want them passing through a third-party server.
What about tables and figures in the PDFs?
Table cell text is extracted but flattened into lines (structure is lost), and figure/chart content that's an image isn't extracted at all. For structured table data use PDF Table to JSON; for text inside images, OCR first.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.