Normalise a Multilingual CSV — Strip Special Characters Online

How to normalise a multilingual csv by stripping special characters

Step 1
Export the multilingual CSV — Download from your PIM, CMS, CRM, or translation-management system. CSV, XLSX, XLS, and ODS are all accepted.
Step 2
Drop the file onto the stripper — Free tier: 2 MB / 500 rows. Pro: 100 MB / 100,000 rows. PapaParse auto-detects the delimiter, which is helpful for EU-locale semicolon exports.
Step 3
Keep Letters, Digits, Spaces, Punctuation on — All four boxes default to on. Crucially, keep Letters on — that is what preserves every script's characters via \p{L}. There is no separate 'safe mode' and no per-column control; the strip applies to all data cells.
Step 4
Run Strip special chars — Emoji, symbols, and invisibles are deleted; letters of all scripts, digits, spaces, and common punctuation remain.
Step 5
Spot-check localised columns — In the first-10-row preview, confirm accented and non-Latin names read correctly. Watch for words that fused because a non-breaking space was removed (see the edge cases).
Step 6
Download and import — Download writes <name>.stripped.csv as UTF-8 (the safest encoding for multilingual data). Import into your target system.

How scripts and invisibles are handled (all boxes on)

Verified against the keep-pattern. The Letters class is \p{L}, which is script-agnostic.

Input	Category	Kept or removed	Result
`José`, `Müller`, `Niño`	Accented Latin letters	Kept	Unchanged
`中文`, `日本語`, `한국어`	CJK letters	Kept	Unchanged
`العربية`, `עברית`	Arabic / Hebrew letters	Kept	Unchanged
`Привет`, `Ελληνικά`	Cyrillic / Greek letters	Kept	Unchanged
Zero-width space U+200B	Invisible	Removed	Joins surrounding characters
Non-breaking space U+00A0	Invisible space variant	Removed	`New York` (with NBSP) → `NewYork`
Soft hyphen U+00AD	Invisible hyphenation hint	Removed	`cooperate` → `cooperate`
Emoji `😀`, `🇫🇷`	Pictographic	Removed	Deleted from the cell
Curly quotes `“ ” ‘ ’`	Typographic punctuation	Removed (not folded)	`«bonjour»`-style quotes deleted

What this tool does NOT do (and where to go instead)

Honest limits so you reach for the right tool for normalisation tasks the stripper cannot perform.

Task	Does this tool do it?	Use instead
Fold curly quotes to straight ASCII	No — it deletes them	csv-cleaner (smart-quote normalise)
Fold NBSP to a regular space	No — it deletes NBSP entirely	csv-cleaner (hidden-whitespace normalise)
Change file encoding (e.g. to UTF-16)	No — output is always UTF-8	A dedicated encoder after download
Transliterate `é` → `e` or CJK → pinyin	No — letters are kept as-is	A transliteration library; not a CSV micro-tool
Lower/upper-case localised text	No	csv-case-converter

Cookbook

Real multilingual rows, before and after. The point of each: legitimate non-Latin and accented letters survive; only noise is deleted — but watch the NBSP and curly-quote cases.

Accents and CJK preserved, emoji removed

Example

A mixed-locale customer table has French, German, and Japanese names plus a stray emoji from a signup form. With Letters on, all the names survive; only the emoji goes.

Input:
id,name
1,José García 🇪🇸
2,Jürgen Müller
3,田中 太郎

Output (all boxes on):
id,name
1,José García 
2,Jürgen Müller
3,田中 太郎

Zero-width space breaking a duplicate check

Example

Two records look identical but a zero-width space hides in one, so a multilingual dedup misses it. The stripper removes the invisible character so the values become genuinely equal.

Input (U+200B shown as ·):
id,city
1,Mü·nchen
2,München

These won't match in a dedup.

Output (all boxes on):
id,city
1,München
2,München

Now run /tool/csv-deduplicator to collapse them.

Non-breaking space fuses two words — the gotcha

Example

A place name typed with a non-breaking space (common from web copy-paste) loses the NBSP entirely, because only the regular space is kept. The two words run together. This is the case where you should fold NBSP to a space first instead.

Input (NBSP shown as ~):
id,place
1,New~York
2,São~Paulo

Output (all boxes on):
id,place
1,NewYork
2,SãoPaulo

To keep them separate, fold NBSP → space first with
/tool/csv-cleaner, then strip.

Curly quotes in localised text are deleted

Example

French guillemets and curly quotes used around localised phrases are removed, not converted. Decide whether that loss is acceptable or whether you want them folded to straight quotes via csv-cleaner.

Input:
id,phrase
1,“Bonjour”
2,‘Hola’

Output (all boxes on):
id,phrase
1,Bonjour
2,Hola

Soft hyphen removed from a hyphenated translation

Example

Translation tools insert soft hyphens (U+00AD) as hyphenation hints. They are invisible but break exact matching. The stripper deletes them, rejoining the word.

Input (soft hyphen shown as ¬):
id,word
1,Zusammen¬arbeit

Output (all boxes on):
id,word
1,Zusammenarbeit

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Letters of every script are preserved

Supported

Because the Letters class is \p{L}, accented Latin, CJK, Arabic, Hebrew, Cyrillic, Greek, and other scripts' letters are all kept. This is the core reason to use this tool for multilingual data rather than an ASCII-only stripper.

Non-breaking space is deleted, joining words

Expected

Only the regular space (U+0020) is kept; the non-breaking space (U+00A0) is removed entirely, so New York written with an NBSP becomes NewYork. If NBSPs are real separators in your data, fold them to regular spaces first with csv-cleaner.

Curly quotes and guillemets are deleted, not folded

By design

“ ” ‘ ’ « » are removed because they are not in the kept punctuation set. If localised quotation marks carry meaning, normalise them with csv-cleaner instead of stripping.

There is no 'safe mode' toggle

By design

Older descriptions mention a 'safe mode'; the actual UI has four keep checkboxes (Letters, Digits, Spaces, Punctuation), all on by default. Keeping Letters on is the equivalent of preserving real characters — there is no separate safe-mode switch.

Encoding is not changed — output is UTF-8

By design

The tool always downloads UTF-8 and does not convert encodings. UTF-8 is the safest choice for multilingual data, but if your target needs UTF-16 or a legacy codepage, convert after download with a dedicated encoder.

Combining marks could be affected if separated

Edge

Precomposed accented letters (NFC, e.g. a single é) are single letters and are kept. Decomposed forms (NFD: base letter + combining accent) keep the base letter and the combining mark, which is also a \p{L}-adjacent mark category — verify rare decomposed data renders correctly after stripping.

Header row is left as-is

Preserved

Localised or English column names in the first row are never modified, including any invisibles in them. If a header carries a BOM or zero-width character, clean it separately with csv-find-replace.

Digits/Punctuation off would damage numbers and dates

Data loss

International data still includes IDs, prices, and dates. Turning Digits or Punctuation off removes those characters globally (19,99 → 1999 or worse). Keep both on for normalisation; use csv-find-replace for targeted numeric edits.

File over the tier limit is blocked

Blocked

Free is 2 MB / 500 rows; Pro is 100 MB / 100,000 rows. Large multilingual catalogues exceed free — split with csv-row-splitter or upgrade.

Right-to-left text is preserved but bidi marks may go

Edge

Arabic and Hebrew letters are kept, but explicit bidirectional control marks (LRM/RLM, U+200E/U+200F) are not letters and are removed. This usually has no visual effect but can change directionality in rare mixed-direction strings — verify RTL columns after stripping.

Frequently asked questions

Will this remove é, ü, or ñ from names?

No. The Letters class uses \p{L}, which matches every Unicode letter including accented Latin. As long as Letters is checked (the default), accented characters are preserved.

What about Chinese, Japanese, Arabic, or Cyrillic text?

All preserved. \p{L} covers letters of every script — CJK, Arabic, Hebrew, Cyrillic, Greek, and more. Only emoji, symbols, and invisibles are removed.

Is there a 'safe mode'?

No. The interface has four keep checkboxes: Letters, Digits, Spaces, Punctuation, all on by default. Keeping Letters on is what preserves real characters — there is no separate safe-mode toggle.

Does it change the file encoding?

It always outputs UTF-8 and does not convert encodings. UTF-8 is the safest format for multilingual data. For another encoding, convert after download.

Why did 'New York' become 'NewYork'?

It was typed with a non-breaking space (U+00A0), which the tool removes because only the regular space is kept. Fold NBSPs to regular spaces first with csv-cleaner if you need to keep word separation.

Does it fold curly quotes to straight quotes?

No — it deletes them. Curly quotes and guillemets are removed. For folding rather than deletion, use csv-cleaner's smart-quote normalisation.

Does it remove zero-width spaces and soft hyphens?

Yes. Zero-width space (U+200B), soft hyphen (U+00AD), and BOM (U+FEFF) are all deleted because they are not letters, digits, spaces, or kept punctuation.

Can I limit the strip to certain columns?

No. The strip covers all data cells (the header row is excluded). Isolate columns with csv-column-filter first, or use csv-find-replace for targeted changes.

Will it transliterate accented or CJK text to ASCII?

No. Letters are kept as-is; there is no transliteration. é stays é, and CJK stays CJK. For transliteration you need a different tool.

Is my international customer data uploaded?

No. Processing is entirely in-browser via PapaParse. PII never leaves your machine.

What are the size and row limits?

Free: 2 MB and 500 data rows. Pro: 100 MB and 100,000 rows. Larger files are blocked at upload.

How do I dedup multilingual rows after cleaning?

Run csv-deduplicator on the stripped file. Removing zero-width and invisible characters first makes near-identical rows compare equal, so the dedup catches the duplicates it would otherwise miss.

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

How to normalise a multilingual csv by stripping special characters

Step 1
Export the multilingual CSV — Download from your PIM, CMS, CRM, or translation-management system. CSV, XLSX, XLS, and ODS are all accepted.
Step 2
Drop the file onto the stripper — Free tier: 2 MB / 500 rows. Pro: 100 MB / 100,000 rows. PapaParse auto-detects the delimiter, which is helpful for EU-locale semicolon exports.
Step 3
Keep Letters, Digits, Spaces, Punctuation on — All four boxes default to on. Crucially, keep Letters on — that is what preserves every script's characters via \p{L}. There is no separate 'safe mode' and no per-column control; the strip applies to all data cells.
Step 4
Run Strip special chars — Emoji, symbols, and invisibles are deleted; letters of all scripts, digits, spaces, and common punctuation remain.
Step 5
Spot-check localised columns — In the first-10-row preview, confirm accented and non-Latin names read correctly. Watch for words that fused because a non-breaking space was removed (see the edge cases).
Step 6
Download and import — Download writes <name>.stripped.csv as UTF-8 (the safest encoding for multilingual data). Import into your target system.

How scripts and invisibles are handled (all boxes on)

Verified against the keep-pattern. The Letters class is \p{L}, which is script-agnostic.

Input	Category	Kept or removed	Result
`José`, `Müller`, `Niño`	Accented Latin letters	Kept	Unchanged
`中文`, `日本語`, `한국어`	CJK letters	Kept	Unchanged
`العربية`, `עברית`	Arabic / Hebrew letters	Kept	Unchanged
`Привет`, `Ελληνικά`	Cyrillic / Greek letters	Kept	Unchanged
Zero-width space U+200B	Invisible	Removed	Joins surrounding characters
Non-breaking space U+00A0	Invisible space variant	Removed	`New York` (with NBSP) → `NewYork`
Soft hyphen U+00AD	Invisible hyphenation hint	Removed	`cooperate` → `cooperate`
Emoji `😀`, `🇫🇷`	Pictographic	Removed	Deleted from the cell
Curly quotes `“ ” ‘ ’`	Typographic punctuation	Removed (not folded)	`«bonjour»`-style quotes deleted

What this tool does NOT do (and where to go instead)

Honest limits so you reach for the right tool for normalisation tasks the stripper cannot perform.

Task	Does this tool do it?	Use instead
Fold curly quotes to straight ASCII	No — it deletes them	csv-cleaner (smart-quote normalise)
Fold NBSP to a regular space	No — it deletes NBSP entirely	csv-cleaner (hidden-whitespace normalise)
Change file encoding (e.g. to UTF-16)	No — output is always UTF-8	A dedicated encoder after download
Transliterate `é` → `e` or CJK → pinyin	No — letters are kept as-is	A transliteration library; not a CSV micro-tool
Lower/upper-case localised text	No	csv-case-converter

Cookbook

Real multilingual rows, before and after. The point of each: legitimate non-Latin and accented letters survive; only noise is deleted — but watch the NBSP and curly-quote cases.

Accents and CJK preserved, emoji removed

Example

A mixed-locale customer table has French, German, and Japanese names plus a stray emoji from a signup form. With Letters on, all the names survive; only the emoji goes.

Input:
id,name
1,José García 🇪🇸
2,Jürgen Müller
3,田中 太郎

Output (all boxes on):
id,name
1,José García 
2,Jürgen Müller
3,田中 太郎

Zero-width space breaking a duplicate check

Example

Two records look identical but a zero-width space hides in one, so a multilingual dedup misses it. The stripper removes the invisible character so the values become genuinely equal.

Input (U+200B shown as ·):
id,city
1,Mü·nchen
2,München

These won't match in a dedup.

Output (all boxes on):
id,city
1,München
2,München

Now run /tool/csv-deduplicator to collapse them.

Non-breaking space fuses two words — the gotcha

Example

Input (NBSP shown as ~):
id,place
1,New~York
2,São~Paulo

Output (all boxes on):
id,place
1,NewYork
2,SãoPaulo

To keep them separate, fold NBSP → space first with
/tool/csv-cleaner, then strip.

Curly quotes in localised text are deleted

Example

French guillemets and curly quotes used around localised phrases are removed, not converted. Decide whether that loss is acceptable or whether you want them folded to straight quotes via csv-cleaner.

Input:
id,phrase
1,“Bonjour”
2,‘Hola’

Output (all boxes on):
id,phrase
1,Bonjour
2,Hola

Soft hyphen removed from a hyphenated translation

Example

Translation tools insert soft hyphens (U+00AD) as hyphenation hints. They are invisible but break exact matching. The stripper deletes them, rejoining the word.

Input (soft hyphen shown as ¬):
id,word
1,Zusammen¬arbeit

Output (all boxes on):
id,word
1,Zusammenarbeit

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Letters of every script are preserved

Supported

Non-breaking space is deleted, joining words

Expected

Curly quotes and guillemets are deleted, not folded

By design

“ ” ‘ ’ « » are removed because they are not in the kept punctuation set. If localised quotation marks carry meaning, normalise them with csv-cleaner instead of stripping.

There is no 'safe mode' toggle

By design

Encoding is not changed — output is UTF-8

By design

Combining marks could be affected if separated

Edge

Header row is left as-is

Preserved

Localised or English column names in the first row are never modified, including any invisibles in them. If a header carries a BOM or zero-width character, clean it separately with csv-find-replace.

Digits/Punctuation off would damage numbers and dates

Data loss

File over the tier limit is blocked

Blocked

Free is 2 MB / 500 rows; Pro is 100 MB / 100,000 rows. Large multilingual catalogues exceed free — split with csv-row-splitter or upgrade.

Right-to-left text is preserved but bidi marks may go

Edge

Frequently asked questions

Will this remove é, ü, or ñ from names?

No. The Letters class uses \p{L}, which matches every Unicode letter including accented Latin. As long as Letters is checked (the default), accented characters are preserved.

What about Chinese, Japanese, Arabic, or Cyrillic text?

All preserved. \p{L} covers letters of every script — CJK, Arabic, Hebrew, Cyrillic, Greek, and more. Only emoji, symbols, and invisibles are removed.

Is there a 'safe mode'?

No. The interface has four keep checkboxes: Letters, Digits, Spaces, Punctuation, all on by default. Keeping Letters on is what preserves real characters — there is no separate safe-mode toggle.

Does it change the file encoding?

It always outputs UTF-8 and does not convert encodings. UTF-8 is the safest format for multilingual data. For another encoding, convert after download.

Why did 'New York' become 'NewYork'?

Does it fold curly quotes to straight quotes?

No — it deletes them. Curly quotes and guillemets are removed. For folding rather than deletion, use csv-cleaner's smart-quote normalisation.

Does it remove zero-width spaces and soft hyphens?

Yes. Zero-width space (U+200B), soft hyphen (U+00AD), and BOM (U+FEFF) are all deleted because they are not letters, digits, spaces, or kept punctuation.

Can I limit the strip to certain columns?

No. The strip covers all data cells (the header row is excluded). Isolate columns with csv-column-filter first, or use csv-find-replace for targeted changes.

Will it transliterate accented or CJK text to ASCII?

No. Letters are kept as-is; there is no transliteration. é stays é, and CJK stays CJK. For transliteration you need a different tool.

Is my international customer data uploaded?

No. Processing is entirely in-browser via PapaParse. PII never leaves your machine.

What are the size and row limits?

Free: 2 MB and 500 data rows. Pro: 100 MB and 100,000 rows. Larger files are blocked at upload.

How do I dedup multilingual rows after cleaning?

Run csv-deduplicator on the stripped file. Removing zero-width and invisible characters first makes near-identical rows compare equal, so the dedup catches the duplicates it would otherwise miss.

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

Normalise a Multilingual CSV by Stripping Special Characters

How to normalise a multilingual csv by stripping special characters

How scripts and invisibles are handled (all boxes on)

What this tool does NOT do (and where to go instead)

Cookbook

Accents and CJK preserved, emoji removed

Zero-width space breaking a duplicate check

Non-breaking space fuses two words — the gotcha

Curly quotes in localised text are deleted

Soft hyphen removed from a hyphenated translation

Errors and edge cases

Letters of every script are preserved

Non-breaking space is deleted, joining words

Curly quotes and guillemets are deleted, not folded

There is no 'safe mode' toggle

Encoding is not changed — output is UTF-8

Combining marks could be affected if separated

Header row is left as-is

Digits/Punctuation off would damage numbers and dates

File over the tier limit is blocked

Right-to-left text is preserved but bidi marks may go

Frequently asked questions

Will this remove é, ü, or ñ from names?

What about Chinese, Japanese, Arabic, or Cyrillic text?

Is there a 'safe mode'?

Does it change the file encoding?

Why did 'New York' become 'NewYork'?

Does it fold curly quotes to straight quotes?

Does it remove zero-width spaces and soft hyphens?

Can I limit the strip to certain columns?

Will it transliterate accented or CJK text to ASCII?

Is my international customer data uploaded?

What are the size and row limits?

How do I dedup multilingual rows after cleaning?

Privacy first

Related guides

Normalise a Multilingual CSV by Stripping Special Characters

How to normalise a multilingual csv by stripping special characters

How scripts and invisibles are handled (all boxes on)

What this tool does NOT do (and where to go instead)

Cookbook

Accents and CJK preserved, emoji removed

Zero-width space breaking a duplicate check

Non-breaking space fuses two words — the gotcha

Curly quotes in localised text are deleted

Soft hyphen removed from a hyphenated translation

Errors and edge cases

Letters of every script are preserved

Non-breaking space is deleted, joining words

Curly quotes and guillemets are deleted, not folded

There is no 'safe mode' toggle

Encoding is not changed — output is UTF-8

Combining marks could be affected if separated

Header row is left as-is

Digits/Punctuation off would damage numbers and dates

File over the tier limit is blocked

Right-to-left text is preserved but bidi marks may go

Frequently asked questions

Will this remove é, ü, or ñ from names?

What about Chinese, Japanese, Arabic, or Cyrillic text?

Is there a 'safe mode'?

Does it change the file encoding?

Why did 'New York' become 'NewYork'?

Does it fold curly quotes to straight quotes?

Does it remove zero-width spaces and soft hyphens?

Can I limit the strip to certain columns?

Will it transliterate accented or CJK text to ASCII?

Is my international customer data uploaded?

What are the size and row limits?

How do I dedup multilingual rows after cleaning?

Privacy first

Related guides