Clean PDF-Extracted CSV Data — Strip Special Characters Online

How to strip special characters from pdf-extracted csv data

Step 1
Extract the PDF table to CSV — Use Tabula, pdfplumber, Camelot, or Acrobat to export the table as CSV. The cleaner also accepts XLSX/XLS/ODS if your extractor wrote a spreadsheet.
Step 2
Drop the file onto the stripper — Free: 2 MB / 500 rows. Pro: 100 MB / 100,000 rows. PapaParse auto-detects the delimiter from the first rows.
Step 3
Keep all four boxes on — Letters, Digits, Spaces, Punctuation default to on. This removes soft hyphens, NBSPs, bullets, and control characters while keeping the real text. The strip applies to every data cell — there is no per-column selector.
Step 4
Run Strip special chars — Soft hyphens, non-breaking spaces, bullets, and other non-keep characters are deleted. The header row is left untouched.
Step 5
Review the preview for word-fusion — Check the first-10-row preview. Watch for words that ran together where a non-breaking space was removed (extractors often substitute NBSP for a normal space), and confirm ligatures rendered acceptably (they are kept, not expanded).
Step 6
Download and use the cleaned CSV — Download writes <name>.stripped.csv as UTF-8 (or .stripped.xlsx for a spreadsheet). Import into your database, spreadsheet, or analysis pipeline.

PDF-extraction artefacts and how the stripper handles them

Verified character by character against the keep-pattern, all four boxes on. 'Removed' means deleted entirely.

Artefact	Where it comes from	Kept or removed	Result
Soft hyphen U+00AD	Justified / hyphenated text	Removed	`cooperate` → `cooperate`
Non-breaking space U+00A0	Layout spacing	Removed (not converted)	`12 345` (NBSP) → `12345`
Bullet `•`, dingbats	List and decorative glyphs	Removed	`• Item` → `Item`
Ligature `ﬁ` U+FB01	Typeset fi/fl combinations	Kept (it is a letter)	`ﬁnally` stays `ﬁnally` — NOT expanded to fi
Control characters	Extraction-engine residue	Removed	Invisible junk deleted
Em/en dash `—` `–`	Typeset ranges and breaks	Removed	`10—20` → `1020`; `pp. 3–5` → `pp. 35`
Zero-width space U+200B	Soft-wrap hints	Removed	Joins split words
Accented letters `é ü`	Genuine extracted text	Kept	Preserved

What the stripper can't fix — use these instead

Structural and substitution tasks need a different tool or step.

Problem from PDF extraction	Stripper handles it?	What to use
Cells split across the wrong columns	No — structural	Re-extract with better table detection (Camelot lattice mode), or fix manually
Expand ligature `ﬁ` to `fi`	No — `ﬁ` is kept as a letter	Unicode NFKC normalisation upstream before export
Convert NBSP to a regular space	No — NBSP is deleted	csv-cleaner (hidden-whitespace normalise)
Replace one specific glyph (e.g. `•` → `-`)	No — it deletes, not substitutes	csv-find-replace
Merge wrapped rows back into one record	No	Manual fix or a dedicated extraction post-processor

Cookbook

Real PDF-to-CSV rows, before and after. Each shows the typographic artefact and exactly what the keep-list does — including where it cannot help.

Soft hyphen scattered through justified text

Example

Justified PDF columns insert soft hyphens that pdfplumber preserves. Invisible in a viewer, they break exact matching. The stripper deletes them, rejoining the word.

Input (soft hyphen shown as ¬):
id,term
1,inter¬national
2,manage¬ment

Output (all boxes on):
id,term
1,international
2,management

Non-breaking space inside a number — the gotcha

Example

Extractors often emit a non-breaking space as a thousands separator. The stripper removes it (only regular space is kept), which collapses the number — usually fine for numeric parsing, but verify it is what you want.

Input (NBSP shown as ~):
id,population
1,12~345
2,1~200~000

Output (all boxes on):
id,population
1,12345
2,1200000

If you wanted '12 345' kept, fold NBSP → space first
with /tool/csv-cleaner.

Bullet glyphs from a list layout

Example

A bulleted list extracted into a cell carries • markers. They are deleted because they are symbols, leaving the text (and a leading space where the bullet was).

Input:
id,features
1,• Waterproof • Wireless

Output (all boxes on):
id,features
1, Waterproof  Wireless

Tidy the spacing with /tool/csv-whitespace-trimmer.

Ligature is KEPT, not expanded — set expectations

Example

A frequent misconception: that the tool turns ﬁ into fi. It does not — ﬁ (U+FB01) is a Unicode letter, so the keep-list preserves it unchanged. If you need real fi, normalise upstream.

Input:
id,word
1,ﬁnally
2,ﬂoor

Output (all boxes on) — UNCHANGED:
id,word
1,ﬁnally
2,ﬂoor

Use Unicode NFKC normalisation before export to get
'finally' and 'floor'.

Em dash in a page-range cell

Example

Typeset ranges use em/en dashes (— –), which are not in the kept punctuation set, so they are deleted and the surrounding tokens fuse. Decide whether to replace them with a hyphen instead.

Input:
id,pages
1,pp. 10—24
2,Vol. 3–5

Output (all boxes on):
id,pages
1,pp. 1024
2,Vol. 35

To turn '—' into '-' instead of deleting it, use
/tool/csv-find-replace.

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Ligatures are kept, never expanded to ASCII

Preserved

ﬁ (U+FB01), ﬂ (U+FB02), and other typographic ligatures are classified as letters by \p{L}, so they survive unchanged. The tool does NOT turn them into fi/fl. For that, apply Unicode NFKC normalisation before exporting the CSV.

Non-breaking space is deleted, fusing tokens

Expected

Extractors substitute NBSP (U+00A0) for spacing; the stripper deletes it because only the regular space is kept. 12 345 and New York fuse. If you need the space preserved, fold NBSP to a regular space first with csv-cleaner.

Em/en dashes are removed, joining tokens

Expected

— and – are not in the kept punctuation set (only the ASCII hyphen - is), so 10—20 becomes 1020. To convert a dash to a hyphen instead of deleting it, use csv-find-replace.

Column misalignment from extraction is not fixed

Not fixed

If the extractor split a cell across the wrong columns, the stripper cannot repair it — it only edits cell contents, not structure. Re-extract with better table detection (e.g. Camelot lattice mode) or correct the columns manually.

Bullet removal leaves leading/double spaces

Expected

Deleting a • that was followed by a space leaves a leading or double space. The stripper does not collapse whitespace; chain csv-whitespace-trimmer to clean it up.

Header row is never stripped

Preserved

The first row is protected. If the extractor put artefacts (soft hyphens, NBSPs) in the header cells, they survive. Clean the header separately with csv-find-replace or remove and re-add it.

Decimal points and hyphens in extracted figures survive

Preserved

Periods, commas, and ASCII hyphens are kept punctuation, so 19.99, 1,200, and 2026-01-15 pass through intact as long as Digits and Punctuation stay on. Don't untick those for tabular figures.

Currency and math symbols are deleted

Expected

€, £, $, %, ±, ×, ÷ are not in the keep-set and are removed, which can strip meaning from financial or scientific tables. Use csv-find-replace if a symbol must be preserved or substituted.

File over the free limit is blocked

Blocked

Free is 2 MB / 500 rows; Pro is 100 MB / 100,000 rows. A large multi-page extraction may exceed free — split with csv-row-splitter or upgrade before stripping.

In-cell line breaks from wrapped text are removed

Expected

If a cell holds a multi-line value (extracted wrapped text, properly quoted), the newline is not in the keep-set and is deleted, concatenating the lines. If you need to preserve line breaks, do not strip that column.

Frequently asked questions

Why does 'finally' come out of a PDF as 'ﬁnally'?

PDFs use the typographic ligature ﬁ (U+FB01) for the fi pair. Important: this tool does NOT expand it to 'fi' — ﬁ is a Unicode letter and is kept as-is. To get real 'fi', apply Unicode NFKC normalisation before exporting the CSV.

What is a soft hyphen and does the tool remove it?

A soft hyphen (U+00AD) is an invisible hyphenation hint from typeset text. Yes — it is removed, because it is not a letter, digit, space, or kept punctuation. That rejoins words like 'inter-national' into 'international'.

Does it fix column-alignment problems from extraction?

No. Cells split across the wrong columns are a structural issue the stripper can't touch — it only edits cell contents. Re-extract with better table detection (e.g. Camelot lattice mode) or fix the columns manually.

What happens to non-breaking spaces?

They are deleted entirely, because only the regular space (U+0020) is kept. A value spaced with NBSPs will lose those gaps and tokens fuse. Fold NBSP to a regular space first with csv-cleaner if you need to preserve spacing.

Are em and en dashes preserved?

No. Only the ASCII hyphen is kept; em (—) and en (–) dashes are removed. To convert them to a hyphen instead of deleting, use csv-find-replace.

Does it remove bullet points and dingbats?

Yes. Bullets (•) and dingbats are symbols, so they are deleted. You may be left with a leading or double space where the bullet was — clean it with the whitespace trimmer.

Are accented or non-Latin letters from the PDF kept?

Yes. The Letters class uses \p{L}, so accented Latin and all other scripts are preserved while symbols and invisibles are removed.

Can I clean only one column of the extracted table?

No. The strip applies to all data cells. Isolate a column with csv-column-filter first, or use csv-find-replace for targeted edits.

Is the extracted document data uploaded anywhere?

No. Parsing and stripping run entirely in your browser via PapaParse. No document or CSV is uploaded.

What file types and limits apply?

CSV, XLSX, XLS, ODS. Free: 2 MB and 500 data rows; Pro: 100 MB and 100,000 rows. Larger files are blocked at upload.

Will it remove currency or math symbols from a financial/scientific table?

Yes — €, £, $, %, ±, × and similar are deleted because they are not in the keep-set. If a symbol must survive, use csv-find-replace to preserve or substitute it precisely.

How do I tidy the double spaces left after stripping bullets and dashes?

Run csv-whitespace-trimmer on the stripped file. This tool deletes characters but never collapses adjacent spaces.

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

How to strip special characters from pdf-extracted csv data

Step 1
Extract the PDF table to CSV — Use Tabula, pdfplumber, Camelot, or Acrobat to export the table as CSV. The cleaner also accepts XLSX/XLS/ODS if your extractor wrote a spreadsheet.
Step 2
Drop the file onto the stripper — Free: 2 MB / 500 rows. Pro: 100 MB / 100,000 rows. PapaParse auto-detects the delimiter from the first rows.
Step 3
Keep all four boxes on — Letters, Digits, Spaces, Punctuation default to on. This removes soft hyphens, NBSPs, bullets, and control characters while keeping the real text. The strip applies to every data cell — there is no per-column selector.
Step 4
Run Strip special chars — Soft hyphens, non-breaking spaces, bullets, and other non-keep characters are deleted. The header row is left untouched.
Step 5
Review the preview for word-fusion — Check the first-10-row preview. Watch for words that ran together where a non-breaking space was removed (extractors often substitute NBSP for a normal space), and confirm ligatures rendered acceptably (they are kept, not expanded).
Step 6
Download and use the cleaned CSV — Download writes <name>.stripped.csv as UTF-8 (or .stripped.xlsx for a spreadsheet). Import into your database, spreadsheet, or analysis pipeline.

PDF-extraction artefacts and how the stripper handles them

Verified character by character against the keep-pattern, all four boxes on. 'Removed' means deleted entirely.

Artefact	Where it comes from	Kept or removed	Result
Soft hyphen U+00AD	Justified / hyphenated text	Removed	`cooperate` → `cooperate`
Non-breaking space U+00A0	Layout spacing	Removed (not converted)	`12 345` (NBSP) → `12345`
Bullet `•`, dingbats	List and decorative glyphs	Removed	`• Item` → `Item`
Ligature `ﬁ` U+FB01	Typeset fi/fl combinations	Kept (it is a letter)	`ﬁnally` stays `ﬁnally` — NOT expanded to fi
Control characters	Extraction-engine residue	Removed	Invisible junk deleted
Em/en dash `—` `–`	Typeset ranges and breaks	Removed	`10—20` → `1020`; `pp. 3–5` → `pp. 35`
Zero-width space U+200B	Soft-wrap hints	Removed	Joins split words
Accented letters `é ü`	Genuine extracted text	Kept	Preserved

What the stripper can't fix — use these instead

Structural and substitution tasks need a different tool or step.

Problem from PDF extraction	Stripper handles it?	What to use
Cells split across the wrong columns	No — structural	Re-extract with better table detection (Camelot lattice mode), or fix manually
Expand ligature `ﬁ` to `fi`	No — `ﬁ` is kept as a letter	Unicode NFKC normalisation upstream before export
Convert NBSP to a regular space	No — NBSP is deleted	csv-cleaner (hidden-whitespace normalise)
Replace one specific glyph (e.g. `•` → `-`)	No — it deletes, not substitutes	csv-find-replace
Merge wrapped rows back into one record	No	Manual fix or a dedicated extraction post-processor

Cookbook

Real PDF-to-CSV rows, before and after. Each shows the typographic artefact and exactly what the keep-list does — including where it cannot help.

Soft hyphen scattered through justified text

Example

Justified PDF columns insert soft hyphens that pdfplumber preserves. Invisible in a viewer, they break exact matching. The stripper deletes them, rejoining the word.

Input (soft hyphen shown as ¬):
id,term
1,inter¬national
2,manage¬ment

Output (all boxes on):
id,term
1,international
2,management

Non-breaking space inside a number — the gotcha

Example

Input (NBSP shown as ~):
id,population
1,12~345
2,1~200~000

Output (all boxes on):
id,population
1,12345
2,1200000

If you wanted '12 345' kept, fold NBSP → space first
with /tool/csv-cleaner.

Bullet glyphs from a list layout

Example

A bulleted list extracted into a cell carries • markers. They are deleted because they are symbols, leaving the text (and a leading space where the bullet was).

Input:
id,features
1,• Waterproof • Wireless

Output (all boxes on):
id,features
1, Waterproof  Wireless

Tidy the spacing with /tool/csv-whitespace-trimmer.

Ligature is KEPT, not expanded — set expectations

Example

A frequent misconception: that the tool turns ﬁ into fi. It does not — ﬁ (U+FB01) is a Unicode letter, so the keep-list preserves it unchanged. If you need real fi, normalise upstream.

Input:
id,word
1,ﬁnally
2,ﬂoor

Output (all boxes on) — UNCHANGED:
id,word
1,ﬁnally
2,ﬂoor

Use Unicode NFKC normalisation before export to get
'finally' and 'floor'.

Em dash in a page-range cell

Example

Typeset ranges use em/en dashes (— –), which are not in the kept punctuation set, so they are deleted and the surrounding tokens fuse. Decide whether to replace them with a hyphen instead.

Input:
id,pages
1,pp. 10—24
2,Vol. 3–5

Output (all boxes on):
id,pages
1,pp. 1024
2,Vol. 35

To turn '—' into '-' instead of deleting it, use
/tool/csv-find-replace.

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Ligatures are kept, never expanded to ASCII

Preserved

Non-breaking space is deleted, fusing tokens

Expected

Em/en dashes are removed, joining tokens

Expected

— and – are not in the kept punctuation set (only the ASCII hyphen - is), so 10—20 becomes 1020. To convert a dash to a hyphen instead of deleting it, use csv-find-replace.

Column misalignment from extraction is not fixed

Not fixed

Bullet removal leaves leading/double spaces

Expected

Deleting a • that was followed by a space leaves a leading or double space. The stripper does not collapse whitespace; chain csv-whitespace-trimmer to clean it up.

Header row is never stripped

Preserved

The first row is protected. If the extractor put artefacts (soft hyphens, NBSPs) in the header cells, they survive. Clean the header separately with csv-find-replace or remove and re-add it.

Decimal points and hyphens in extracted figures survive

Preserved

Periods, commas, and ASCII hyphens are kept punctuation, so 19.99, 1,200, and 2026-01-15 pass through intact as long as Digits and Punctuation stay on. Don't untick those for tabular figures.

Currency and math symbols are deleted

Expected

File over the free limit is blocked

Blocked

Free is 2 MB / 500 rows; Pro is 100 MB / 100,000 rows. A large multi-page extraction may exceed free — split with csv-row-splitter or upgrade before stripping.

In-cell line breaks from wrapped text are removed

Expected

Frequently asked questions

Why does 'finally' come out of a PDF as 'ﬁnally'?

What is a soft hyphen and does the tool remove it?

Does it fix column-alignment problems from extraction?

What happens to non-breaking spaces?

Are em and en dashes preserved?

No. Only the ASCII hyphen is kept; em (—) and en (–) dashes are removed. To convert them to a hyphen instead of deleting, use csv-find-replace.

Does it remove bullet points and dingbats?

Yes. Bullets (•) and dingbats are symbols, so they are deleted. You may be left with a leading or double space where the bullet was — clean it with the whitespace trimmer.

Are accented or non-Latin letters from the PDF kept?

Yes. The Letters class uses \p{L}, so accented Latin and all other scripts are preserved while symbols and invisibles are removed.

Can I clean only one column of the extracted table?

No. The strip applies to all data cells. Isolate a column with csv-column-filter first, or use csv-find-replace for targeted edits.

Is the extracted document data uploaded anywhere?

No. Parsing and stripping run entirely in your browser via PapaParse. No document or CSV is uploaded.

What file types and limits apply?

CSV, XLSX, XLS, ODS. Free: 2 MB and 500 data rows; Pro: 100 MB and 100,000 rows. Larger files are blocked at upload.

Will it remove currency or math symbols from a financial/scientific table?

Yes — €, £, $, %, ±, × and similar are deleted because they are not in the keep-set. If a symbol must survive, use csv-find-replace to preserve or substitute it precisely.

How do I tidy the double spaces left after stripping bullets and dashes?

Run csv-whitespace-trimmer on the stripped file. This tool deletes characters but never collapses adjacent spaces.

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

Strip Special Characters From PDF-Extracted CSV Data

How to strip special characters from pdf-extracted csv data

PDF-extraction artefacts and how the stripper handles them

What the stripper can't fix — use these instead

Cookbook

Soft hyphen scattered through justified text

Non-breaking space inside a number — the gotcha

Bullet glyphs from a list layout

Ligature is KEPT, not expanded — set expectations

Em dash in a page-range cell

Errors and edge cases

Ligatures are kept, never expanded to ASCII

Non-breaking space is deleted, fusing tokens

Em/en dashes are removed, joining tokens

Column misalignment from extraction is not fixed

Bullet removal leaves leading/double spaces

Header row is never stripped

Decimal points and hyphens in extracted figures survive

Currency and math symbols are deleted

File over the free limit is blocked

In-cell line breaks from wrapped text are removed

Frequently asked questions

Why does 'finally' come out of a PDF as 'ﬁnally'?

What is a soft hyphen and does the tool remove it?

Does it fix column-alignment problems from extraction?

What happens to non-breaking spaces?

Are em and en dashes preserved?

Does it remove bullet points and dingbats?

Are accented or non-Latin letters from the PDF kept?

Can I clean only one column of the extracted table?

Is the extracted document data uploaded anywhere?

What file types and limits apply?

Will it remove currency or math symbols from a financial/scientific table?

How do I tidy the double spaces left after stripping bullets and dashes?

Privacy first

Related guides

Strip Special Characters From PDF-Extracted CSV Data

How to strip special characters from pdf-extracted csv data

PDF-extraction artefacts and how the stripper handles them

What the stripper can't fix — use these instead

Cookbook

Soft hyphen scattered through justified text

Non-breaking space inside a number — the gotcha

Bullet glyphs from a list layout

Ligature is KEPT, not expanded — set expectations

Em dash in a page-range cell

Errors and edge cases

Ligatures are kept, never expanded to ASCII

Non-breaking space is deleted, fusing tokens

Em/en dashes are removed, joining tokens

Column misalignment from extraction is not fixed

Bullet removal leaves leading/double spaces

Header row is never stripped

Decimal points and hyphens in extracted figures survive

Currency and math symbols are deleted

File over the free limit is blocked

In-cell line breaks from wrapped text are removed

Frequently asked questions

Why does 'finally' come out of a PDF as 'ﬁnally'?

What is a soft hyphen and does the tool remove it?

Does it fix column-alignment problems from extraction?

What happens to non-breaking spaces?

Are em and en dashes preserved?

Does it remove bullet points and dingbats?

Are accented or non-Latin letters from the PDF kept?

Can I clean only one column of the extracted table?

Is the extracted document data uploaded anywhere?

What file types and limits apply?

Will it remove currency or math symbols from a financial/scientific table?

How do I tidy the double spaces left after stripping bullets and dashes?

Privacy first

Related guides