How to normalise a multilingual csv by stripping special characters
- Step 1Export the multilingual CSV — Download from your PIM, CMS, CRM, or translation-management system. CSV, XLSX, XLS, and ODS are all accepted.
- Step 2Drop the file onto the stripper — Free tier: 2 MB / 500 rows. Pro: 100 MB / 100,000 rows. PapaParse auto-detects the delimiter, which is helpful for EU-locale semicolon exports.
- Step 3Keep Letters, Digits, Spaces, Punctuation on — All four boxes default to on. Crucially, keep Letters on — that is what preserves every script's characters via
\p{L}. There is no separate 'safe mode' and no per-column control; the strip applies to all data cells. - Step 4Run Strip special chars — Emoji, symbols, and invisibles are deleted; letters of all scripts, digits, spaces, and common punctuation remain.
- Step 5Spot-check localised columns — In the first-10-row preview, confirm accented and non-Latin names read correctly. Watch for words that fused because a non-breaking space was removed (see the edge cases).
- Step 6Download and import — Download writes
<name>.stripped.csvas UTF-8 (the safest encoding for multilingual data). Import into your target system.
How scripts and invisibles are handled (all boxes on)
Verified against the keep-pattern. The Letters class is \p{L}, which is script-agnostic.
| Input | Category | Kept or removed | Result |
|---|---|---|---|
José, Müller, Niño | Accented Latin letters | Kept | Unchanged |
中文, 日本語, 한국어 | CJK letters | Kept | Unchanged |
العربية, עברית | Arabic / Hebrew letters | Kept | Unchanged |
Привет, Ελληνικά | Cyrillic / Greek letters | Kept | Unchanged |
| Zero-width space U+200B | Invisible | Removed | Joins surrounding characters |
| Non-breaking space U+00A0 | Invisible space variant | Removed | New York (with NBSP) → NewYork |
| Soft hyphen U+00AD | Invisible hyphenation hint | Removed | cooperate → cooperate |
Emoji 😀, 🇫🇷 | Pictographic | Removed | Deleted from the cell |
Curly quotes “ ” ‘ ’ | Typographic punctuation | Removed (not folded) | «bonjour»-style quotes deleted |
What this tool does NOT do (and where to go instead)
Honest limits so you reach for the right tool for normalisation tasks the stripper cannot perform.
| Task | Does this tool do it? | Use instead |
|---|---|---|
| Fold curly quotes to straight ASCII | No — it deletes them | csv-cleaner (smart-quote normalise) |
| Fold NBSP to a regular space | No — it deletes NBSP entirely | csv-cleaner (hidden-whitespace normalise) |
| Change file encoding (e.g. to UTF-16) | No — output is always UTF-8 | A dedicated encoder after download |
Transliterate é → e or CJK → pinyin | No — letters are kept as-is | A transliteration library; not a CSV micro-tool |
| Lower/upper-case localised text | No | csv-case-converter |
Cookbook
Real multilingual rows, before and after. The point of each: legitimate non-Latin and accented letters survive; only noise is deleted — but watch the NBSP and curly-quote cases.
Accents and CJK preserved, emoji removed
ExampleA mixed-locale customer table has French, German, and Japanese names plus a stray emoji from a signup form. With Letters on, all the names survive; only the emoji goes.
Input: id,name 1,José García 🇪🇸 2,Jürgen Müller 3,田中 太郎 Output (all boxes on): id,name 1,José García 2,Jürgen Müller 3,田中 太郎
Zero-width space breaking a duplicate check
ExampleTwo records look identical but a zero-width space hides in one, so a multilingual dedup misses it. The stripper removes the invisible character so the values become genuinely equal.
Input (U+200B shown as ·): id,city 1,Mü·nchen 2,München These won't match in a dedup. Output (all boxes on): id,city 1,München 2,München Now run /tool/csv-deduplicator to collapse them.
Non-breaking space fuses two words — the gotcha
ExampleA place name typed with a non-breaking space (common from web copy-paste) loses the NBSP entirely, because only the regular space is kept. The two words run together. This is the case where you should fold NBSP to a space first instead.
Input (NBSP shown as ~): id,place 1,New~York 2,São~Paulo Output (all boxes on): id,place 1,NewYork 2,SãoPaulo To keep them separate, fold NBSP → space first with /tool/csv-cleaner, then strip.
Curly quotes in localised text are deleted
ExampleFrench guillemets and curly quotes used around localised phrases are removed, not converted. Decide whether that loss is acceptable or whether you want them folded to straight quotes via csv-cleaner.
Input: id,phrase 1,“Bonjour” 2,‘Hola’ Output (all boxes on): id,phrase 1,Bonjour 2,Hola
Soft hyphen removed from a hyphenated translation
ExampleTranslation tools insert soft hyphens (U+00AD) as hyphenation hints. They are invisible but break exact matching. The stripper deletes them, rejoining the word.
Input (soft hyphen shown as ¬): id,word 1,Zusammen¬arbeit Output (all boxes on): id,word 1,Zusammenarbeit
Errors and edge cases
Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.
Letters of every script are preserved
SupportedBecause the Letters class is \p{L}, accented Latin, CJK, Arabic, Hebrew, Cyrillic, Greek, and other scripts' letters are all kept. This is the core reason to use this tool for multilingual data rather than an ASCII-only stripper.
Non-breaking space is deleted, joining words
ExpectedOnly the regular space (U+0020) is kept; the non-breaking space (U+00A0) is removed entirely, so New York written with an NBSP becomes NewYork. If NBSPs are real separators in your data, fold them to regular spaces first with csv-cleaner.
Curly quotes and guillemets are deleted, not folded
By design“ ” ‘ ’ « » are removed because they are not in the kept punctuation set. If localised quotation marks carry meaning, normalise them with csv-cleaner instead of stripping.
There is no 'safe mode' toggle
By designOlder descriptions mention a 'safe mode'; the actual UI has four keep checkboxes (Letters, Digits, Spaces, Punctuation), all on by default. Keeping Letters on is the equivalent of preserving real characters — there is no separate safe-mode switch.
Encoding is not changed — output is UTF-8
By designThe tool always downloads UTF-8 and does not convert encodings. UTF-8 is the safest choice for multilingual data, but if your target needs UTF-16 or a legacy codepage, convert after download with a dedicated encoder.
Combining marks could be affected if separated
EdgePrecomposed accented letters (NFC, e.g. a single é) are single letters and are kept. Decomposed forms (NFD: base letter + combining accent) keep the base letter and the combining mark, which is also a \p{L}-adjacent mark category — verify rare decomposed data renders correctly after stripping.
Header row is left as-is
PreservedLocalised or English column names in the first row are never modified, including any invisibles in them. If a header carries a BOM or zero-width character, clean it separately with csv-find-replace.
Digits/Punctuation off would damage numbers and dates
Data lossInternational data still includes IDs, prices, and dates. Turning Digits or Punctuation off removes those characters globally (19,99 → 1999 or worse). Keep both on for normalisation; use csv-find-replace for targeted numeric edits.
File over the tier limit is blocked
BlockedFree is 2 MB / 500 rows; Pro is 100 MB / 100,000 rows. Large multilingual catalogues exceed free — split with csv-row-splitter or upgrade.
Right-to-left text is preserved but bidi marks may go
EdgeArabic and Hebrew letters are kept, but explicit bidirectional control marks (LRM/RLM, U+200E/U+200F) are not letters and are removed. This usually has no visual effect but can change directionality in rare mixed-direction strings — verify RTL columns after stripping.
Frequently asked questions
Will this remove é, ü, or ñ from names?
No. The Letters class uses \p{L}, which matches every Unicode letter including accented Latin. As long as Letters is checked (the default), accented characters are preserved.
What about Chinese, Japanese, Arabic, or Cyrillic text?
All preserved. \p{L} covers letters of every script — CJK, Arabic, Hebrew, Cyrillic, Greek, and more. Only emoji, symbols, and invisibles are removed.
Is there a 'safe mode'?
No. The interface has four keep checkboxes: Letters, Digits, Spaces, Punctuation, all on by default. Keeping Letters on is what preserves real characters — there is no separate safe-mode toggle.
Does it change the file encoding?
It always outputs UTF-8 and does not convert encodings. UTF-8 is the safest format for multilingual data. For another encoding, convert after download.
Why did 'New York' become 'NewYork'?
It was typed with a non-breaking space (U+00A0), which the tool removes because only the regular space is kept. Fold NBSPs to regular spaces first with csv-cleaner if you need to keep word separation.
Does it fold curly quotes to straight quotes?
No — it deletes them. Curly quotes and guillemets are removed. For folding rather than deletion, use csv-cleaner's smart-quote normalisation.
Does it remove zero-width spaces and soft hyphens?
Yes. Zero-width space (U+200B), soft hyphen (U+00AD), and BOM (U+FEFF) are all deleted because they are not letters, digits, spaces, or kept punctuation.
Can I limit the strip to certain columns?
No. The strip covers all data cells (the header row is excluded). Isolate columns with csv-column-filter first, or use csv-find-replace for targeted changes.
Will it transliterate accented or CJK text to ASCII?
No. Letters are kept as-is; there is no transliteration. é stays é, and CJK stays CJK. For transliteration you need a different tool.
Is my international customer data uploaded?
No. Processing is entirely in-browser via PapaParse. PII never leaves your machine.
What are the size and row limits?
Free: 2 MB and 500 data rows. Pro: 100 MB and 100,000 rows. Larger files are blocked at upload.
How do I dedup multilingual rows after cleaning?
Run csv-deduplicator on the stripped file. Removing zero-width and invisible characters first makes near-identical rows compare equal, so the dedup catches the duplicates it would otherwise miss.
Privacy first
Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.