How to extract pdf text for nlp processing and text analysis
- Step 1Extract the document text — Drop a born-digital PDF onto PDF to Plain Text. It auto-extracts and gives you a UTF-8
.txt— your raw corpus input. - Step 2Verify it's not a scan — If a paper you know has text extracts blank, it's image-only. Run it through OCR first, then re-extract, or it'll be noise in your corpus.
- Step 3Load it in Python as UTF-8 — Read with
text = open('paper.txt', encoding='utf-8').read(). Split on blank lines (text.split('\n\n')) if you want page-level units. - Step 4Clean the boilerplate — Strip running headers, footers, and page numbers with regex; join hyphenated line-break splits (
re.sub(r'(\w)-\s+(\w)', r'\1\2', text)); collapse whitespace. - Step 5Tokenize and analyse — Pass the cleaned string to spaCy (
nlp(text)), NLTK, or a Hugging Face tokenizer for NER, sentiment, topic modelling, or summarization. - Step 6Chunk for embeddings if needed — For transformer context limits or vector DBs, segment with PDF to Text Chunks (sentence-aware, overlapping, with page ranges) instead of one giant string.
Extracted text → NLP framework
How the UTF-8 .txt feeds common Python NLP tooling. These are usage patterns, not tool features.
| Framework | Entry point | Typical use |
|---|---|---|
| spaCy | nlp(text) | Tokenization, POS, dependency parse, NER |
| NLTK | word_tokenize(text) | Classic tokenization, stopwords, stemming |
| Hugging Face transformers | tokenizer(text, truncation=True) | Embeddings, classification, summarization |
| gensim | list of token lists | LDA topic modelling, word2vec |
| pandas | one row per page/chunk | Tabular text analytics, feature engineering |
Cleaning checklist before tokenization
The tool extracts faithfully; these are the artefacts you handle in code. Each is real PDF-extraction noise.
| Artefact | Why it appears | Fix in code |
|---|---|---|
| Running headers / footers | They're text runs on every page | Drop recurring lines per page (regex) |
| Page numbers inline | Extracted like any other text | Strip lines that are bare numbers |
Hyphenated word breaks (infor- mation) | Line-wrap hyphenation in the source | re.sub(r'(\w)-\s+(\w)', r'\1\2', t) |
| Irregular whitespace | Runs joined with spaces; pages with \n\n | re.sub(r'\s+', ' ', t).strip() |
| Interleaved columns | pdf.js run order on multi-column pages | Column-detect / reflow before tokenizing |
Cookbook
Python-oriented recipes for turning extracted PDF text into clean NLP input.
Load and tokenize a paper with spaCy
The minimal path from extracted .txt to a parsed document. Read as UTF-8, hand the string to spaCy.
import spacy
nlp = spacy.load('en_core_web_sm')
text = open('paper.txt', encoding='utf-8').read()
doc = nlp(text)
ents = [(e.text, e.label_) for e in doc.ents]De-hyphenate words split across line breaks
PDFs often hyphenate at line ends (infor- mation). Rejoin them so tokens and embeddings are correct.
import re
t = open('report.txt', encoding='utf-8').read()
t = re.sub(r'(\w)-\s+(\w)', r'\1\2', t) # infor- mation -> information
t = re.sub(r'\s+', ' ', t).strip() # normalise whitespaceStrip running headers and page numbers
Boilerplate repeated on every page skews term frequencies and topic models. Remove the recurring header and bare-number lines.
pages = open('doc.txt', encoding='utf-8').read().split('\n\n')
clean = []
for p in pages:
lines = [l for l in p.splitlines()
if l.strip() and not l.strip().isdigit()
and 'Confidential' not in l] # drop the known header
clean.append(' '.join(lines))
corpus = ' '.join(clean)Build a per-page DataFrame for analysis
Use the blank-line page breaks to make one row per page, then run sentiment or length features per page.
import pandas as pd
pages = open('doc.txt', encoding='utf-8').read().split('\n\n')
df = pd.DataFrame({'page': range(1, len(pages)+1), 'text': pages})
df['n_tokens'] = df['text'].str.split().str.len()Chunk for a transformer's context window
Long papers exceed model context limits. Use the chunker for sentence-aware overlapping segments instead of truncating.
# paper.pdf -> /pdf-tools/pdf-to-chunks (target ~500 tokens, overlap ~50)
# -> chunks.json with text + pageRange per chunk
import json
chunks = json.load(open('paper.chunks.json'))
for c in chunks:
embed(c['text']) # within model context, context preserved at seamsEdge cases and what actually happens
Scanned papers extract to nothing
Empty outputImage-only PDFs have no text layer, so they extract blank and silently add nothing to your corpus. Many older or photocopied papers are scans. Run them through OCR first; expect some OCR error, which you may want to filter on a confidence pass.
Two-column academic layout interleaves
may interleaveML and science PDFs are usually two-column, and the tool joins runs in pdf.js order, so left and right columns can interleave line by line. That corrupts sentence boundaries and n-grams. Detect/reflow columns before tokenizing, or you'll model scrambled text.
Hyphenated line breaks split tokens
needs cleaningLine-wrap hyphenation produces tokens like infor- + mation. Left unfixed, your vocabulary fragments and frequencies are wrong. De-hyphenate with a regex (see the cookbook) as a standard cleaning step.
Ligatures and special characters
Usually preservedfi/fl ligatures, curly quotes, and em dashes come through correctly when the font has a proper Unicode map (most modern PDFs). A few older PDFs map them to private-use code points; normalise (NFKC) in Python if your model is sensitive to exact characters.
Non-English / multilingual corpus
SupportedAny language with an embedded, Unicode-mapped font extracts in its native script — Cyrillic, Greek, Arabic, CJK. Match your tokenizer/model to the language. If a CJK PDF extracts as boxes, the font lacks a Unicode map and OCR is the workaround.
Headers/footers skew term frequencies
needs cleaningA header repeated on 40 pages becomes 40 occurrences of that phrase — enough to distort TF-IDF and topic models. Strip recurring boilerplate per page before building features; the tool extracts it faithfully and won't remove it for you.
Document exceeds free-tier page/size caps
blockedFree extraction is capped at 2 MB / 50 pages per file. A large monograph is blocked with an upgrade prompt. Split it with PDF Split by Range and extract parts, or use Pro+ limits for bigger corpus documents.
Equations, code listings, and figures
lossyMath typeset as glyphs may extract as disordered symbols; code in a monospace block may lose indentation; figure text that's an image isn't extracted at all. For figure/equation-image text, OCR; for clean code/table structure, use the dedicated converters.
Frequently asked questions
Will ligatures and special characters extract correctly?
Most modern PDFs map ligatures (fi, fl), curly quotes, and dashes to the right Unicode characters, so they extract correctly. A minority of older PDFs use private-use code points; if your model is character-sensitive, run a Unicode normalization (NFKC) pass in Python. Always check the preview when exact characters matter.
Can I extract text from PDFs in multiple languages?
Yes — any language whose font is embedded with a proper Unicode mapping extracts in its native script (Latin, Cyrillic, Greek, Arabic, CJK, etc.). Just pair it with a tokenizer/model for that language. If a non-Latin PDF extracts as boxes, the font lacks a Unicode map and you'll need OCR.
How should I handle extraction from scanned PDFs in a corpus?
Run scanned (image-only) PDFs through the PDF OCR tool first to add a text layer, then extract. OCR introduces some error, so consider a quality/confidence filter before adding OCR'd documents to a corpus you'll train or model on.
What encoding does the output use?
UTF-8 (no BOM), which is exactly what Python and every major NLP library expect. Read it with open(path, encoding='utf-8') and tokenizers will handle multibyte characters without extra work.
Do I still need to clean the text before NLP?
Yes. Extraction is faithful, which means running headers, footers, page numbers, and hyphenated line breaks all come through. Standard hygiene — strip boilerplate, de-hyphenate, normalise whitespace — is on you, and the cookbook above has ready-to-use snippets.
How do I deal with hyphenated words split across lines?
Rejoin them with a regex such as re.sub(r'(\w)-\s+(\w)', r'\1\2', text), which turns infor- mation back into information. Do this before tokenizing so your vocabulary and frequency counts aren't fragmented.
How are pages separated, and can I use that?
Pages are separated by a blank line (double newline). Split on \n\n to get one unit per page — useful for page-level features, provenance, or building a per-page DataFrame for analysis.
Will two-column papers extract in reading order?
Not reliably. The tool joins runs in pdf.js order and doesn't reconstruct columns, so two-column academic PDFs can interleave. Detect and reflow columns before tokenizing, otherwise sentence boundaries and n-grams will be wrong.
Should I extract one big string or chunk it?
For classic NLP (NER, sentiment over a doc), one cleaned string is fine. For embeddings, RAG, or transformer context limits, chunk it — PDF to Text Chunks gives sentence-aware, overlapping segments with page ranges, which beats blindly truncating.
Can I use the output with Hugging Face tokenizers?
Yes — pass the UTF-8 string straight to a tokenizer (tokenizer(text, truncation=True, ...)). For long documents, chunk first so you don't silently truncate past the model's max length.
Is my corpus uploaded anywhere?
No. Extraction runs in your browser via pdf.js, so licensed or proprietary corpus documents stay on your device. The result panel confirms 0 bytes uploaded.
What about equations, code, and tables?
Math glyphs can extract as disordered symbols, code can lose indentation, and table structure flattens into lines. For structured tables use PDF Table to JSON; for equation or figure text that's an image, OCR it first.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.