How to strip special characters from pdf-extracted csv data
- Step 1Extract the PDF table to CSV — Use Tabula, pdfplumber, Camelot, or Acrobat to export the table as CSV. The cleaner also accepts XLSX/XLS/ODS if your extractor wrote a spreadsheet.
- Step 2Drop the file onto the stripper — Free: 2 MB / 500 rows. Pro: 100 MB / 100,000 rows. PapaParse auto-detects the delimiter from the first rows.
- Step 3Keep all four boxes on — Letters, Digits, Spaces, Punctuation default to on. This removes soft hyphens, NBSPs, bullets, and control characters while keeping the real text. The strip applies to every data cell — there is no per-column selector.
- Step 4Run Strip special chars — Soft hyphens, non-breaking spaces, bullets, and other non-keep characters are deleted. The header row is left untouched.
- Step 5Review the preview for word-fusion — Check the first-10-row preview. Watch for words that ran together where a non-breaking space was removed (extractors often substitute NBSP for a normal space), and confirm ligatures rendered acceptably (they are kept, not expanded).
- Step 6Download and use the cleaned CSV — Download writes
<name>.stripped.csvas UTF-8 (or.stripped.xlsxfor a spreadsheet). Import into your database, spreadsheet, or analysis pipeline.
PDF-extraction artefacts and how the stripper handles them
Verified character by character against the keep-pattern, all four boxes on. 'Removed' means deleted entirely.
| Artefact | Where it comes from | Kept or removed | Result |
|---|---|---|---|
| Soft hyphen U+00AD | Justified / hyphenated text | Removed | cooperate → cooperate |
| Non-breaking space U+00A0 | Layout spacing | Removed (not converted) | 12 345 (NBSP) → 12345 |
Bullet •, dingbats | List and decorative glyphs | Removed | • Item → Item |
Ligature fi U+FB01 | Typeset fi/fl combinations | Kept (it is a letter) | finally stays finally — NOT expanded to fi |
| Control characters | Extraction-engine residue | Removed | Invisible junk deleted |
Em/en dash — – | Typeset ranges and breaks | Removed | 10—20 → 1020; pp. 3–5 → pp. 35 |
| Zero-width space U+200B | Soft-wrap hints | Removed | Joins split words |
Accented letters é ü | Genuine extracted text | Kept | Preserved |
What the stripper can't fix — use these instead
Structural and substitution tasks need a different tool or step.
| Problem from PDF extraction | Stripper handles it? | What to use |
|---|---|---|
| Cells split across the wrong columns | No — structural | Re-extract with better table detection (Camelot lattice mode), or fix manually |
Expand ligature fi to fi | No — fi is kept as a letter | Unicode NFKC normalisation upstream before export |
| Convert NBSP to a regular space | No — NBSP is deleted | csv-cleaner (hidden-whitespace normalise) |
Replace one specific glyph (e.g. • → -) | No — it deletes, not substitutes | csv-find-replace |
| Merge wrapped rows back into one record | No | Manual fix or a dedicated extraction post-processor |
Cookbook
Real PDF-to-CSV rows, before and after. Each shows the typographic artefact and exactly what the keep-list does — including where it cannot help.
Soft hyphen scattered through justified text
ExampleJustified PDF columns insert soft hyphens that pdfplumber preserves. Invisible in a viewer, they break exact matching. The stripper deletes them, rejoining the word.
Input (soft hyphen shown as ¬): id,term 1,inter¬national 2,manage¬ment Output (all boxes on): id,term 1,international 2,management
Non-breaking space inside a number — the gotcha
ExampleExtractors often emit a non-breaking space as a thousands separator. The stripper removes it (only regular space is kept), which collapses the number — usually fine for numeric parsing, but verify it is what you want.
Input (NBSP shown as ~): id,population 1,12~345 2,1~200~000 Output (all boxes on): id,population 1,12345 2,1200000 If you wanted '12 345' kept, fold NBSP → space first with /tool/csv-cleaner.
Bullet glyphs from a list layout
ExampleA bulleted list extracted into a cell carries • markers. They are deleted because they are symbols, leaving the text (and a leading space where the bullet was).
Input: id,features 1,• Waterproof • Wireless Output (all boxes on): id,features 1, Waterproof Wireless Tidy the spacing with /tool/csv-whitespace-trimmer.
Ligature is KEPT, not expanded — set expectations
ExampleA frequent misconception: that the tool turns fi into fi. It does not — fi (U+FB01) is a Unicode letter, so the keep-list preserves it unchanged. If you need real fi, normalise upstream.
Input: id,word 1,finally 2,floor Output (all boxes on) — UNCHANGED: id,word 1,finally 2,floor Use Unicode NFKC normalisation before export to get 'finally' and 'floor'.
Em dash in a page-range cell
ExampleTypeset ranges use em/en dashes (— –), which are not in the kept punctuation set, so they are deleted and the surrounding tokens fuse. Decide whether to replace them with a hyphen instead.
Input: id,pages 1,pp. 10—24 2,Vol. 3–5 Output (all boxes on): id,pages 1,pp. 1024 2,Vol. 35 To turn '—' into '-' instead of deleting it, use /tool/csv-find-replace.
Errors and edge cases
Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.
Ligatures are kept, never expanded to ASCII
Preservedfi (U+FB01), fl (U+FB02), and other typographic ligatures are classified as letters by \p{L}, so they survive unchanged. The tool does NOT turn them into fi/fl. For that, apply Unicode NFKC normalisation before exporting the CSV.
Non-breaking space is deleted, fusing tokens
ExpectedExtractors substitute NBSP (U+00A0) for spacing; the stripper deletes it because only the regular space is kept. 12 345 and New York fuse. If you need the space preserved, fold NBSP to a regular space first with csv-cleaner.
Em/en dashes are removed, joining tokens
Expected— and – are not in the kept punctuation set (only the ASCII hyphen - is), so 10—20 becomes 1020. To convert a dash to a hyphen instead of deleting it, use csv-find-replace.
Column misalignment from extraction is not fixed
Not fixedIf the extractor split a cell across the wrong columns, the stripper cannot repair it — it only edits cell contents, not structure. Re-extract with better table detection (e.g. Camelot lattice mode) or correct the columns manually.
Bullet removal leaves leading/double spaces
ExpectedDeleting a • that was followed by a space leaves a leading or double space. The stripper does not collapse whitespace; chain csv-whitespace-trimmer to clean it up.
Header row is never stripped
PreservedThe first row is protected. If the extractor put artefacts (soft hyphens, NBSPs) in the header cells, they survive. Clean the header separately with csv-find-replace or remove and re-add it.
Decimal points and hyphens in extracted figures survive
PreservedPeriods, commas, and ASCII hyphens are kept punctuation, so 19.99, 1,200, and 2026-01-15 pass through intact as long as Digits and Punctuation stay on. Don't untick those for tabular figures.
Currency and math symbols are deleted
Expected€, £, $, %, ±, ×, ÷ are not in the keep-set and are removed, which can strip meaning from financial or scientific tables. Use csv-find-replace if a symbol must be preserved or substituted.
File over the free limit is blocked
BlockedFree is 2 MB / 500 rows; Pro is 100 MB / 100,000 rows. A large multi-page extraction may exceed free — split with csv-row-splitter or upgrade before stripping.
In-cell line breaks from wrapped text are removed
ExpectedIf a cell holds a multi-line value (extracted wrapped text, properly quoted), the newline is not in the keep-set and is deleted, concatenating the lines. If you need to preserve line breaks, do not strip that column.
Frequently asked questions
Why does 'finally' come out of a PDF as 'finally'?
PDFs use the typographic ligature fi (U+FB01) for the fi pair. Important: this tool does NOT expand it to 'fi' — fi is a Unicode letter and is kept as-is. To get real 'fi', apply Unicode NFKC normalisation before exporting the CSV.
What is a soft hyphen and does the tool remove it?
A soft hyphen (U+00AD) is an invisible hyphenation hint from typeset text. Yes — it is removed, because it is not a letter, digit, space, or kept punctuation. That rejoins words like 'inter-national' into 'international'.
Does it fix column-alignment problems from extraction?
No. Cells split across the wrong columns are a structural issue the stripper can't touch — it only edits cell contents. Re-extract with better table detection (e.g. Camelot lattice mode) or fix the columns manually.
What happens to non-breaking spaces?
They are deleted entirely, because only the regular space (U+0020) is kept. A value spaced with NBSPs will lose those gaps and tokens fuse. Fold NBSP to a regular space first with csv-cleaner if you need to preserve spacing.
Are em and en dashes preserved?
No. Only the ASCII hyphen is kept; em (—) and en (–) dashes are removed. To convert them to a hyphen instead of deleting, use csv-find-replace.
Does it remove bullet points and dingbats?
Yes. Bullets (•) and dingbats are symbols, so they are deleted. You may be left with a leading or double space where the bullet was — clean it with the whitespace trimmer.
Are accented or non-Latin letters from the PDF kept?
Yes. The Letters class uses \p{L}, so accented Latin and all other scripts are preserved while symbols and invisibles are removed.
Can I clean only one column of the extracted table?
No. The strip applies to all data cells. Isolate a column with csv-column-filter first, or use csv-find-replace for targeted edits.
Is the extracted document data uploaded anywhere?
No. Parsing and stripping run entirely in your browser via PapaParse. No document or CSV is uploaded.
What file types and limits apply?
CSV, XLSX, XLS, ODS. Free: 2 MB and 500 data rows; Pro: 100 MB and 100,000 rows. Larger files are blocked at upload.
Will it remove currency or math symbols from a financial/scientific table?
Yes — €, £, $, %, ±, × and similar are deleted because they are not in the keep-set. If a symbol must survive, use csv-find-replace to preserve or substitute it precisely.
How do I tidy the double spaces left after stripping bullets and dashes?
Run csv-whitespace-trimmer on the stripped file. This tool deletes characters but never collapses adjacent spaces.
Privacy first
Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.