Extract PDF Text for NLP and Text Analysis

How to extract pdf text for nlp processing and text analysis

Step 1
Extract the document text — Drop a born-digital PDF onto PDF to Plain Text. It auto-extracts and gives you a UTF-8 .txt — your raw corpus input.
Step 2
Verify it's not a scan — If a paper you know has text extracts blank, it's image-only. Run it through OCR first, then re-extract, or it'll be noise in your corpus.
Step 3
Load it in Python as UTF-8 — Read with text = open('paper.txt', encoding='utf-8').read(). Split on blank lines (text.split('\n\n')) if you want page-level units.
Step 4
Clean the boilerplate — Strip running headers, footers, and page numbers with regex; join hyphenated line-break splits (re.sub(r'(\w)-\s+(\w)', r'\1\2', text)); collapse whitespace.
Step 5
Tokenize and analyse — Pass the cleaned string to spaCy (nlp(text)), NLTK, or a Hugging Face tokenizer for NER, sentiment, topic modelling, or summarization.
Step 6
Chunk for embeddings if needed — For transformer context limits or vector DBs, segment with PDF to Text Chunks (sentence-aware, overlapping, with page ranges) instead of one giant string.

Extracted text → NLP framework

How the UTF-8 .txt feeds common Python NLP tooling. These are usage patterns, not tool features.

Framework	Entry point	Typical use
spaCy	`nlp(text)`	Tokenization, POS, dependency parse, NER
NLTK	`word_tokenize(text)`	Classic tokenization, stopwords, stemming
Hugging Face transformers	`tokenizer(text, truncation=True)`	Embeddings, classification, summarization
gensim	list of token lists	LDA topic modelling, word2vec
pandas	one row per page/chunk	Tabular text analytics, feature engineering

Cleaning checklist before tokenization

The tool extracts faithfully; these are the artefacts you handle in code. Each is real PDF-extraction noise.

Artefact	Why it appears	Fix in code
Running headers / footers	They're text runs on every page	Drop recurring lines per page (regex)
Page numbers inline	Extracted like any other text	Strip lines that are bare numbers
Hyphenated word breaks (`infor- mation`)	Line-wrap hyphenation in the source	`re.sub(r'(\w)-\s+(\w)', r'\1\2', t)`
Irregular whitespace	Runs joined with spaces; pages with `\n\n`	`re.sub(r'\s+', ' ', t).strip()`
Interleaved columns	pdf.js run order on multi-column pages	Column-detect / reflow before tokenizing

Cookbook

Python-oriented recipes for turning extracted PDF text into clean NLP input.

Load and tokenize a paper with spaCy

The minimal path from extracted .txt to a parsed document. Read as UTF-8, hand the string to spaCy.

import spacy
nlp = spacy.load('en_core_web_sm')
text = open('paper.txt', encoding='utf-8').read()
doc = nlp(text)
ents = [(e.text, e.label_) for e in doc.ents]

De-hyphenate words split across line breaks

PDFs often hyphenate at line ends (infor- mation). Rejoin them so tokens and embeddings are correct.

import re
t = open('report.txt', encoding='utf-8').read()
t = re.sub(r'(\w)-\s+(\w)', r'\1\2', t)   # infor- mation -> information
t = re.sub(r'\s+', ' ', t).strip()        # normalise whitespace

Strip running headers and page numbers

Boilerplate repeated on every page skews term frequencies and topic models. Remove the recurring header and bare-number lines.

pages = open('doc.txt', encoding='utf-8').read().split('\n\n')
clean = []
for p in pages:
    lines = [l for l in p.splitlines()
             if l.strip() and not l.strip().isdigit()
             and 'Confidential' not in l]   # drop the known header
    clean.append(' '.join(lines))
corpus = ' '.join(clean)

Build a per-page DataFrame for analysis

Use the blank-line page breaks to make one row per page, then run sentiment or length features per page.

import pandas as pd
pages = open('doc.txt', encoding='utf-8').read().split('\n\n')
df = pd.DataFrame({'page': range(1, len(pages)+1), 'text': pages})
df['n_tokens'] = df['text'].str.split().str.len()

Chunk for a transformer's context window

Long papers exceed model context limits. Use the chunker for sentence-aware overlapping segments instead of truncating.

# paper.pdf -> /pdf-tools/pdf-to-chunks (target ~500 tokens, overlap ~50)
# -> chunks.json with text + pageRange per chunk
import json
chunks = json.load(open('paper.chunks.json'))
for c in chunks:
    embed(c['text'])   # within model context, context preserved at seams

Edge cases and what actually happens

Scanned papers extract to nothing

Empty output

Image-only PDFs have no text layer, so they extract blank and silently add nothing to your corpus. Many older or photocopied papers are scans. Run them through OCR first; expect some OCR error, which you may want to filter on a confidence pass.

Two-column academic layout interleaves

may interleave

ML and science PDFs are usually two-column, and the tool joins runs in pdf.js order, so left and right columns can interleave line by line. That corrupts sentence boundaries and n-grams. Detect/reflow columns before tokenizing, or you'll model scrambled text.

Hyphenated line breaks split tokens

needs cleaning

Line-wrap hyphenation produces tokens like infor- + mation. Left unfixed, your vocabulary fragments and frequencies are wrong. De-hyphenate with a regex (see the cookbook) as a standard cleaning step.

Ligatures and special characters

Usually preserved

ﬁ/ﬂ ligatures, curly quotes, and em dashes come through correctly when the font has a proper Unicode map (most modern PDFs). A few older PDFs map them to private-use code points; normalise (NFKC) in Python if your model is sensitive to exact characters.

Non-English / multilingual corpus

Supported

Any language with an embedded, Unicode-mapped font extracts in its native script — Cyrillic, Greek, Arabic, CJK. Match your tokenizer/model to the language. If a CJK PDF extracts as boxes, the font lacks a Unicode map and OCR is the workaround.

Headers/footers skew term frequencies

needs cleaning

A header repeated on 40 pages becomes 40 occurrences of that phrase — enough to distort TF-IDF and topic models. Strip recurring boilerplate per page before building features; the tool extracts it faithfully and won't remove it for you.

Document exceeds free-tier page/size caps

blocked

Free extraction is capped at 2 MB / 50 pages per file. A large monograph is blocked with an upgrade prompt. Split it with PDF Split by Range and extract parts, or use Pro+ limits for bigger corpus documents.

Equations, code listings, and figures

lossy

Math typeset as glyphs may extract as disordered symbols; code in a monospace block may lose indentation; figure text that's an image isn't extracted at all. For figure/equation-image text, OCR; for clean code/table structure, use the dedicated converters.

Frequently asked questions

Will ligatures and special characters extract correctly?

Most modern PDFs map ligatures (ﬁ, ﬂ), curly quotes, and dashes to the right Unicode characters, so they extract correctly. A minority of older PDFs use private-use code points; if your model is character-sensitive, run a Unicode normalization (NFKC) pass in Python. Always check the preview when exact characters matter.

Can I extract text from PDFs in multiple languages?

Yes — any language whose font is embedded with a proper Unicode mapping extracts in its native script (Latin, Cyrillic, Greek, Arabic, CJK, etc.). Just pair it with a tokenizer/model for that language. If a non-Latin PDF extracts as boxes, the font lacks a Unicode map and you'll need OCR.

How should I handle extraction from scanned PDFs in a corpus?

Run scanned (image-only) PDFs through the PDF OCR tool first to add a text layer, then extract. OCR introduces some error, so consider a quality/confidence filter before adding OCR'd documents to a corpus you'll train or model on.

What encoding does the output use?

UTF-8 (no BOM), which is exactly what Python and every major NLP library expect. Read it with open(path, encoding='utf-8') and tokenizers will handle multibyte characters without extra work.

Do I still need to clean the text before NLP?

Yes. Extraction is faithful, which means running headers, footers, page numbers, and hyphenated line breaks all come through. Standard hygiene — strip boilerplate, de-hyphenate, normalise whitespace — is on you, and the cookbook above has ready-to-use snippets.

How do I deal with hyphenated words split across lines?

Rejoin them with a regex such as re.sub(r'(\w)-\s+(\w)', r'\1\2', text), which turns infor- mation back into information. Do this before tokenizing so your vocabulary and frequency counts aren't fragmented.

How are pages separated, and can I use that?

Pages are separated by a blank line (double newline). Split on \n\n to get one unit per page — useful for page-level features, provenance, or building a per-page DataFrame for analysis.

Will two-column papers extract in reading order?

Not reliably. The tool joins runs in pdf.js order and doesn't reconstruct columns, so two-column academic PDFs can interleave. Detect and reflow columns before tokenizing, otherwise sentence boundaries and n-grams will be wrong.

Should I extract one big string or chunk it?

For classic NLP (NER, sentiment over a doc), one cleaned string is fine. For embeddings, RAG, or transformer context limits, chunk it — PDF to Text Chunks gives sentence-aware, overlapping segments with page ranges, which beats blindly truncating.

Can I use the output with Hugging Face tokenizers?

Yes — pass the UTF-8 string straight to a tokenizer (tokenizer(text, truncation=True, ...)). For long documents, chunk first so you don't silently truncate past the model's max length.

Is my corpus uploaded anywhere?

No. Extraction runs in your browser via pdf.js, so licensed or proprietary corpus documents stay on your device. The result panel confirms 0 bytes uploaded.

What about equations, code, and tables?

Math glyphs can extract as disordered symbols, code can lose indentation, and table structure flattens into lines. For structured tables use PDF Table to JSON; for equation or figure text that's an image, OCR it first.

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

How to extract pdf text for nlp processing and text analysis

Step 1
Extract the document text — Drop a born-digital PDF onto PDF to Plain Text. It auto-extracts and gives you a UTF-8 .txt — your raw corpus input.
Step 2
Verify it's not a scan — If a paper you know has text extracts blank, it's image-only. Run it through OCR first, then re-extract, or it'll be noise in your corpus.
Step 3
Load it in Python as UTF-8 — Read with text = open('paper.txt', encoding='utf-8').read(). Split on blank lines (text.split('\n\n')) if you want page-level units.
Step 4
Clean the boilerplate — Strip running headers, footers, and page numbers with regex; join hyphenated line-break splits (re.sub(r'(\w)-\s+(\w)', r'\1\2', text)); collapse whitespace.
Step 5
Tokenize and analyse — Pass the cleaned string to spaCy (nlp(text)), NLTK, or a Hugging Face tokenizer for NER, sentiment, topic modelling, or summarization.
Step 6
Chunk for embeddings if needed — For transformer context limits or vector DBs, segment with PDF to Text Chunks (sentence-aware, overlapping, with page ranges) instead of one giant string.

Extracted text → NLP framework

How the UTF-8 .txt feeds common Python NLP tooling. These are usage patterns, not tool features.

Framework	Entry point	Typical use
spaCy	`nlp(text)`	Tokenization, POS, dependency parse, NER
NLTK	`word_tokenize(text)`	Classic tokenization, stopwords, stemming
Hugging Face transformers	`tokenizer(text, truncation=True)`	Embeddings, classification, summarization
gensim	list of token lists	LDA topic modelling, word2vec
pandas	one row per page/chunk	Tabular text analytics, feature engineering

Cleaning checklist before tokenization

The tool extracts faithfully; these are the artefacts you handle in code. Each is real PDF-extraction noise.

Artefact	Why it appears	Fix in code
Running headers / footers	They're text runs on every page	Drop recurring lines per page (regex)
Page numbers inline	Extracted like any other text	Strip lines that are bare numbers
Hyphenated word breaks (`infor- mation`)	Line-wrap hyphenation in the source	`re.sub(r'(\w)-\s+(\w)', r'\1\2', t)`
Irregular whitespace	Runs joined with spaces; pages with `\n\n`	`re.sub(r'\s+', ' ', t).strip()`
Interleaved columns	pdf.js run order on multi-column pages	Column-detect / reflow before tokenizing

Cookbook

Python-oriented recipes for turning extracted PDF text into clean NLP input.

Load and tokenize a paper with spaCy

The minimal path from extracted .txt to a parsed document. Read as UTF-8, hand the string to spaCy.

import spacy
nlp = spacy.load('en_core_web_sm')
text = open('paper.txt', encoding='utf-8').read()
doc = nlp(text)
ents = [(e.text, e.label_) for e in doc.ents]

De-hyphenate words split across line breaks

PDFs often hyphenate at line ends (infor- mation). Rejoin them so tokens and embeddings are correct.

import re
t = open('report.txt', encoding='utf-8').read()
t = re.sub(r'(\w)-\s+(\w)', r'\1\2', t)   # infor- mation -> information
t = re.sub(r'\s+', ' ', t).strip()        # normalise whitespace

Strip running headers and page numbers

Boilerplate repeated on every page skews term frequencies and topic models. Remove the recurring header and bare-number lines.

pages = open('doc.txt', encoding='utf-8').read().split('\n\n')
clean = []
for p in pages:
    lines = [l for l in p.splitlines()
             if l.strip() and not l.strip().isdigit()
             and 'Confidential' not in l]   # drop the known header
    clean.append(' '.join(lines))
corpus = ' '.join(clean)

Build a per-page DataFrame for analysis

Use the blank-line page breaks to make one row per page, then run sentiment or length features per page.

import pandas as pd
pages = open('doc.txt', encoding='utf-8').read().split('\n\n')
df = pd.DataFrame({'page': range(1, len(pages)+1), 'text': pages})
df['n_tokens'] = df['text'].str.split().str.len()

Chunk for a transformer's context window

Long papers exceed model context limits. Use the chunker for sentence-aware overlapping segments instead of truncating.

# paper.pdf -> /pdf-tools/pdf-to-chunks (target ~500 tokens, overlap ~50)
# -> chunks.json with text + pageRange per chunk
import json
chunks = json.load(open('paper.chunks.json'))
for c in chunks:
    embed(c['text'])   # within model context, context preserved at seams

Edge cases and what actually happens

Scanned papers extract to nothing

Empty output

Two-column academic layout interleaves

may interleave

Hyphenated line breaks split tokens

needs cleaning

Ligatures and special characters

Usually preserved

Non-English / multilingual corpus

Supported

Headers/footers skew term frequencies

needs cleaning

Document exceeds free-tier page/size caps

blocked

Equations, code listings, and figures

lossy

Frequently asked questions

Will ligatures and special characters extract correctly?

Can I extract text from PDFs in multiple languages?

How should I handle extraction from scanned PDFs in a corpus?

What encoding does the output use?

UTF-8 (no BOM), which is exactly what Python and every major NLP library expect. Read it with open(path, encoding='utf-8') and tokenizers will handle multibyte characters without extra work.

Do I still need to clean the text before NLP?

How do I deal with hyphenated words split across lines?

How are pages separated, and can I use that?

Pages are separated by a blank line (double newline). Split on \n\n to get one unit per page — useful for page-level features, provenance, or building a per-page DataFrame for analysis.

Will two-column papers extract in reading order?

Should I extract one big string or chunk it?

Can I use the output with Hugging Face tokenizers?

Yes — pass the UTF-8 string straight to a tokenizer (tokenizer(text, truncation=True, ...)). For long documents, chunk first so you don't silently truncate past the model's max length.

Is my corpus uploaded anywhere?

No. Extraction runs in your browser via pdf.js, so licensed or proprietary corpus documents stay on your device. The result panel confirms 0 bytes uploaded.

What about equations, code, and tables?

Privacy first

All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.

Extract PDF Text for NLP Processing and Text Analysis

How to extract pdf text for nlp processing and text analysis

Extracted text → NLP framework

Cleaning checklist before tokenization

Cookbook

Load and tokenize a paper with spaCy

De-hyphenate words split across line breaks

Strip running headers and page numbers

Build a per-page DataFrame for analysis

Chunk for a transformer's context window

Edge cases and what actually happens

Scanned papers extract to nothing

Two-column academic layout interleaves

Hyphenated line breaks split tokens

Ligatures and special characters

Non-English / multilingual corpus

Headers/footers skew term frequencies

Document exceeds free-tier page/size caps

Equations, code listings, and figures

Frequently asked questions

Will ligatures and special characters extract correctly?

Can I extract text from PDFs in multiple languages?

How should I handle extraction from scanned PDFs in a corpus?

What encoding does the output use?

Do I still need to clean the text before NLP?

How do I deal with hyphenated words split across lines?

How are pages separated, and can I use that?

Will two-column papers extract in reading order?

Should I extract one big string or chunk it?

Can I use the output with Hugging Face tokenizers?

Is my corpus uploaded anywhere?

What about equations, code, and tables?

Privacy first

Related guides

Extract PDF Text for NLP Processing and Text Analysis

How to extract pdf text for nlp processing and text analysis

Extracted text → NLP framework

Cleaning checklist before tokenization

Cookbook

Load and tokenize a paper with spaCy

De-hyphenate words split across line breaks

Strip running headers and page numbers

Build a per-page DataFrame for analysis

Chunk for a transformer's context window

Edge cases and what actually happens

Scanned papers extract to nothing

Two-column academic layout interleaves

Hyphenated line breaks split tokens

Ligatures and special characters

Non-English / multilingual corpus

Headers/footers skew term frequencies

Document exceeds free-tier page/size caps

Equations, code listings, and figures

Frequently asked questions

Will ligatures and special characters extract correctly?

Can I extract text from PDFs in multiple languages?

How should I handle extraction from scanned PDFs in a corpus?

What encoding does the output use?

Do I still need to clean the text before NLP?

How do I deal with hyphenated words split across lines?

How are pages separated, and can I use that?

Will two-column papers extract in reading order?

Should I extract one big string or chunk it?

Can I use the output with Hugging Face tokenizers?

Is my corpus uploaded anywhere?

What about equations, code, and tables?

Privacy first

Related guides