Clean Scraped Product Data CSV — Strip Special Characters Online

How to strip special characters from scraped product data csv

Step 1
Export the scraped product CSV — Save the output from your scraping run as .csv. The tool also accepts .xlsx, .xls, and .ods — it converts the first sheet to rows before stripping, and downloads back to the same format.
Step 2
Drop the file onto the stripper — The free tier accepts files up to 2 MB / 500 data rows; Pro raises this to 100 MB / 100,000 rows. PapaParse auto-detects the delimiter (comma, semicolon, or tab) from the first rows.
Step 3
Choose which character classes to keep — Four checkboxes, all on by default: Letters (incl. accents), Digits (0–9), Spaces, Punctuation (.,!?@-_ etc.). Leave all four on for a standard product clean. Note there is no per-column selection — the strip applies to every column except the header row.
Step 4
Run Strip special chars — Every character not in your keep-set is deleted from each cell. Emoji, currency symbols, ™/®, and zero-width characters disappear; letters, digits, spaces, and common punctuation remain.
Step 5
Review the preview and counts — The result panel shows cells modified, data-row count, and a preview of the first 10 rows. Scan product titles to confirm nothing legitimate was lost (e.g. a + in 2+ pack is removed — see the edge cases).
Step 6
Download and import into your PIM — Download writes <name>.stripped.csv (UTF-8) — or .stripped.xlsx if you uploaded a spreadsheet. Map columns and ingest into your PIM or product database.

The four keep-list options (what survives vs. what is removed)

The tool is a whitelist: a character is kept only if it matches at least one enabled class. All four checkboxes are ON by default. Behaviour verified against the tool's buildPattern logic.

Checkbox	Characters kept when ON	Examples removed if OFF	Default
Letters (incl. accents)	`\p{L}` — every Unicode letter: ASCII `a-z`/`A-Z`, accented Latin (`é ü ñ ç`), and all scripts (CJK, Cyrillic, Arabic, Greek)	All alphabetic text — turning OFF leaves only digits/spaces/punctuation, rarely wanted for product data	On
Digits (0–9)	ASCII digits `0` through `9`	`SKU-12345` becomes `SKU-`; sizes, quantities, model numbers lose their digits — keep ON for product data	On
Spaces	The regular space character (U+0020) only	`Wireless Earbuds` becomes `WirelessEarbuds` — word boundaries collapse	On
Punctuation (.,!?@-_ etc.)	Exactly this set: `. , - _ @ / ( ) ! ? : ; ' "`	`Earbuds, Black` becomes `Earbuds Black`; hyphenated SKUs lose the hyphen	On

Common scraped-data characters and their fate (all boxes on)

With the default keep-set, these are the outcomes for characters that show up in storefront-scraped CSVs. Verified by running the keep-pattern character by character.

Character	Where it comes from	Kept or removed	Result on a sample cell
Emoji `🎧` `😀`	Marketing-styled product titles	Removed	`Earbuds 🎧` → `Earbuds` (trailing space remains)
`™` `®` `©`	Brand names in titles	Removed	`Acme™ Pro` → `Acme Pro`
`€` `£` `$` `%`	Prices scraped into title/description cells	Removed	`50% off €19` → `50 off 19`
`&` `#` `*` `+` `=`	Spec sheets, bullet markers	Removed	`2+ pack` → `2 pack`; `A&B` → `AB`
`[` `]` `{` `}`	Template residue, JSON-in-cell	Removed	`size [M]` → `size M`
Zero-width space U+200B	Anti-scrape injection	Removed (silently)	Joins split words: `Earbuds` → `Earbuds`
Curly quotes `“ ” ‘ ’`	Copied from rendered HTML	Removed (not converted)	`“Pro”` → `Pro`; `it’s` → `its`
Accented letters `é ü ñ`	International product names	Kept	`Café Crème` → `Café Crème`

Cookbook

Before/after rows from real scraped product feeds. Notice that this tool deletes — it does not transliterate or substitute — so plan the keep-set with that in mind.

Emoji and trademark symbols in marketing titles

Example

Storefront teams add emoji and ™ to titles for visual punch. Those characters break PIM validators and storefront search. The default keep-set removes them, leaving plain readable text. Watch the trailing space where the emoji used to be.

Input:
sku,title
EB-100,Wireless Earbuds 🎧 Acme™
EB-200,Smart Watch ⌚ Pro®

Output (all boxes on):
sku,title
EB-100,Wireless Earbuds  Acme
EB-200,Smart Watch  Pro

Note the double space where the emoji sat — run
/tool/csv-whitespace-trimmer afterwards to collapse it.

Prices and percent signs scraped into the wrong column

Example

Scrapers often grab a price string into a description cell. The stripper removes €, %, and $ but keeps the digits — which can leave a misleading bare number. Decide whether you actually want digits kept here, or strip this column with csv-find-replace targeting the symbol only.

Input:
sku,blurb
P1,Now 50% off — only €19.99!

Output (all boxes on):
sku,blurb
P1,Now 50 off  only 19.99!

The em dash and € are gone; '50' and '19.99' remain.
If you only want the symbol removed, use /tool/csv-find-replace.

Zero-width space breaking a SKU join

Example

Anti-scrape scripts inject U+200B into product codes so copied SKUs silently differ from the catalogue. Invisible in every editor, fatal for exact-match joins. The stripper removes it because it is not a letter, digit, space, or kept punctuation.

Input (U+200B shown as ·):
sku,name
AB·123,Bluetooth Speaker
AB123,Bluetooth Speaker

These look identical but won't join.

Output (all boxes on):
sku,name
AB123,Bluetooth Speaker
AB123,Bluetooth Speaker

Now both SKUs match the catalogue value AB123.

Keeping non-Latin product names while removing symbols

Example

A scrape of a multi-region store mixes CJK and Latin titles plus emoji. Because the keep-list uses \p{L}, every script's letters survive; only the emoji and symbols go. This is the key reason to use this tool over an ASCII-only stripper.

Input:
sku,title
JP-1,ワイヤレスイヤホン 🎧
FR-1,Écouteurs sans fil ®

Output (all boxes on):
sku,title
JP-1,ワイヤレスイヤホン 
FR-1,Écouteurs sans fil

Turning Digits off destroys model numbers — don't

Example

A common mistake: unticking Digits hoping to drop stray numbers. It removes ALL digits everywhere, gutting SKUs, sizes, and model numbers. Shown as a cautionary before/after.

Input:
sku,model
MW-300,XR-2000 v3

With Digits UNCHECKED:
sku,model
MW-,XR- v

Keep Digits ON for product data. To remove only specific
numeric noise, use /tool/csv-find-replace instead.

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Emoji removal leaves a double space

Expected

An emoji surrounded by spaces (Earbuds 🎧 Pro) is deleted but the spaces stay, so you get Earbuds Pro with two spaces. The stripper does not collapse whitespace. Run csv-whitespace-trimmer afterwards to tidy the gaps.

Curly quotes are deleted, not converted to ASCII quotes

By design

“Pro” becomes Pro and it’s becomes its — the smart quotes are removed entirely because they are not in the kept punctuation set (which contains straight ' and " only). If you need curly quotes folded to straight ASCII instead of deleted, use csv-cleaner's smart-quote normalisation.

Ligatures like `ﬁ` survive untouched

Preserved

A scraped title containing the typographic ligature ﬁ (U+FB01) keeps it, because \p{L} classifies ligatures as letters. This tool does NOT expand ﬁ to fi. If you need that, normalise the text upstream or use a Unicode NFKC step before exporting the CSV.

Mojibake is mangled, not fixed

Not fixed

If a price scrape produced mojibake like â‚¬19 (a mis-decoded €), the stripper removes ‚ and ¬ but keeps the letter â, leaving â19. It cannot reconstruct the original character. Fix encoding at the source, or see the fix-encoding-artefacts guide.

`&`, `+`, `#`, `%`, `=` are all removed

By design

These common spec-sheet characters are not in the kept punctuation set, so 2+ pack, A&B, 50% off, model #3 lose those symbols. If a symbol carries meaning in your data, either accept the loss or use csv-find-replace to handle it precisely instead.

Header row is left exactly as-is

Preserved

The first row is treated as a header and is never modified — even if it contains special characters. If your scrape produced a headerless CSV, the first data row will be skipped from stripping. Add a header row first, or strip per your data shape.

Unticking all four boxes is a no-op

No change

If no class is enabled the keep-pattern falls back to /./, which matches every character — so nothing is removed. The result will show 0 cells modified. Re-enable at least one class for the strip to do anything.

File exceeds the free tier limit

Blocked

Free accounts cap at 2 MB and 500 data rows; a large scrape feed will be blocked at upload. Split the file with csv-row-splitter, trim to a sample with csv-row-limiter, or upgrade to Pro (100 MB / 100,000 rows).

Non-breaking space joins two words

Expected

A non-breaking space (U+00A0) scraped from HTML is removed entirely because only the regular space (U+0020) is kept. Out of stock written with NBSPs becomes Outofstock. If your data uses NBSPs as real separators, replace them with regular spaces using csv-find-replace before stripping.

Numbers that were stored with thousands separators lose them

Expected

A scraped price 1,299 keeps the comma (comma is kept punctuation) but 1 299 (space separator) keeps the space, and 1’299 (Swiss apostrophe separator) drops the apostrophe to give 1299. Review numeric columns and standardise separators before any downstream parsing.

Frequently asked questions

Is this a blacklist or a whitelist stripper?

A whitelist. You choose which character classes to keep (Letters, Digits, Spaces, Punctuation) and every character outside that set is deleted. There is no list of 'bad' characters to remove — anything not explicitly kept is gone.

Will it remove accented characters like é or ü from product names?

No. The Letters option uses \p{L}, which matches every Unicode letter including accented Latin and all other scripts. As long as Letters is checked (the default), Café and Müller survive intact.

Does it convert curly quotes to straight quotes?

No — it deletes them. Curly quotes are not in the kept punctuation set, so “Pro” becomes Pro. For conversion (folding curly to straight) rather than deletion, use the csv-cleaner tool's smart-quote normalisation.

Can I clean only specific columns?

Not in this tool's interface — the strip applies to every column except the header row. If you need column-scoped edits, isolate columns first with csv-column-filter, or use csv-find-replace for targeted changes.

What happens to emoji in titles?

Emoji are removed because they are not letters, digits, spaces, or kept punctuation. Any surrounding spaces remain, so you may end up with a double space where the emoji was — run the whitespace trimmer afterwards.

Does it remove zero-width spaces and BOM bytes?

Yes. Zero-width space (U+200B), the BOM (U+FEFF), and similar invisibles are not in any keep class, so they are deleted. This is the main fix for SKUs that won't match your catalogue.

What file types can I upload?

CSV plus XLSX, XLS, and ODS spreadsheets. Spreadsheets are converted to rows for stripping and downloaded back in the same format; CSV downloads as UTF-8 with a .stripped.csv suffix.

Is the file uploaded anywhere?

No. Parsing and stripping run entirely in your browser via PapaParse. Scraped pricing and product data never leave your machine.

What are the size and row limits?

Free: 2 MB and 500 data rows. Pro: 100 MB and 100,000 rows. Files over the limit are blocked at upload — split or sample them first.

Why did a hyphen in my SKU disappear?

It shouldn't, if Punctuation is checked — the hyphen - is in the kept set. If you unticked Punctuation, hyphens, periods, commas, and slashes are all removed. Re-enable Punctuation for SKU-style data.

Will it fix garbled mojibake like Ã© or â‚¬?

No. It may remove some bytes of a mojibake sequence but it cannot reconstruct the original character, and it often leaves a partial mess. Fix encoding at the export step or with csv-cleaner, then strip.

How do I get rid of the double spaces left behind?

Chain the csv-whitespace-trimmer tool after stripping. This stripper deletes characters but never collapses adjacent spaces, so a separate trim pass is the clean way to tidy the output.

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

How to strip special characters from scraped product data csv

Step 1
Export the scraped product CSV — Save the output from your scraping run as .csv. The tool also accepts .xlsx, .xls, and .ods — it converts the first sheet to rows before stripping, and downloads back to the same format.
Step 2
Drop the file onto the stripper — The free tier accepts files up to 2 MB / 500 data rows; Pro raises this to 100 MB / 100,000 rows. PapaParse auto-detects the delimiter (comma, semicolon, or tab) from the first rows.
Step 3
Choose which character classes to keep — Four checkboxes, all on by default: Letters (incl. accents), Digits (0–9), Spaces, Punctuation (.,!?@-_ etc.). Leave all four on for a standard product clean. Note there is no per-column selection — the strip applies to every column except the header row.
Step 4
Run Strip special chars — Every character not in your keep-set is deleted from each cell. Emoji, currency symbols, ™/®, and zero-width characters disappear; letters, digits, spaces, and common punctuation remain.
Step 5
Review the preview and counts — The result panel shows cells modified, data-row count, and a preview of the first 10 rows. Scan product titles to confirm nothing legitimate was lost (e.g. a + in 2+ pack is removed — see the edge cases).
Step 6
Download and import into your PIM — Download writes <name>.stripped.csv (UTF-8) — or .stripped.xlsx if you uploaded a spreadsheet. Map columns and ingest into your PIM or product database.

The four keep-list options (what survives vs. what is removed)

The tool is a whitelist: a character is kept only if it matches at least one enabled class. All four checkboxes are ON by default. Behaviour verified against the tool's buildPattern logic.

Checkbox	Characters kept when ON	Examples removed if OFF	Default
Letters (incl. accents)	`\p{L}` — every Unicode letter: ASCII `a-z`/`A-Z`, accented Latin (`é ü ñ ç`), and all scripts (CJK, Cyrillic, Arabic, Greek)	All alphabetic text — turning OFF leaves only digits/spaces/punctuation, rarely wanted for product data	On
Digits (0–9)	ASCII digits `0` through `9`	`SKU-12345` becomes `SKU-`; sizes, quantities, model numbers lose their digits — keep ON for product data	On
Spaces	The regular space character (U+0020) only	`Wireless Earbuds` becomes `WirelessEarbuds` — word boundaries collapse	On
Punctuation (.,!?@-_ etc.)	Exactly this set: `. , - _ @ / ( ) ! ? : ; ' "`	`Earbuds, Black` becomes `Earbuds Black`; hyphenated SKUs lose the hyphen	On

Common scraped-data characters and their fate (all boxes on)

With the default keep-set, these are the outcomes for characters that show up in storefront-scraped CSVs. Verified by running the keep-pattern character by character.

Character	Where it comes from	Kept or removed	Result on a sample cell
Emoji `🎧` `😀`	Marketing-styled product titles	Removed	`Earbuds 🎧` → `Earbuds` (trailing space remains)
`™` `®` `©`	Brand names in titles	Removed	`Acme™ Pro` → `Acme Pro`
`€` `£` `$` `%`	Prices scraped into title/description cells	Removed	`50% off €19` → `50 off 19`
`&` `#` `*` `+` `=`	Spec sheets, bullet markers	Removed	`2+ pack` → `2 pack`; `A&B` → `AB`
`[` `]` `{` `}`	Template residue, JSON-in-cell	Removed	`size [M]` → `size M`
Zero-width space U+200B	Anti-scrape injection	Removed (silently)	Joins split words: `Earbuds` → `Earbuds`
Curly quotes `“ ” ‘ ’`	Copied from rendered HTML	Removed (not converted)	`“Pro”` → `Pro`; `it’s` → `its`
Accented letters `é ü ñ`	International product names	Kept	`Café Crème` → `Café Crème`

Cookbook

Before/after rows from real scraped product feeds. Notice that this tool deletes — it does not transliterate or substitute — so plan the keep-set with that in mind.

Emoji and trademark symbols in marketing titles

Example

Input:
sku,title
EB-100,Wireless Earbuds 🎧 Acme™
EB-200,Smart Watch ⌚ Pro®

Output (all boxes on):
sku,title
EB-100,Wireless Earbuds  Acme
EB-200,Smart Watch  Pro

Note the double space where the emoji sat — run
/tool/csv-whitespace-trimmer afterwards to collapse it.

Prices and percent signs scraped into the wrong column

Example

Input:
sku,blurb
P1,Now 50% off — only €19.99!

Output (all boxes on):
sku,blurb
P1,Now 50 off  only 19.99!

The em dash and € are gone; '50' and '19.99' remain.
If you only want the symbol removed, use /tool/csv-find-replace.

Zero-width space breaking a SKU join

Example

Input (U+200B shown as ·):
sku,name
AB·123,Bluetooth Speaker
AB123,Bluetooth Speaker

These look identical but won't join.

Output (all boxes on):
sku,name
AB123,Bluetooth Speaker
AB123,Bluetooth Speaker

Now both SKUs match the catalogue value AB123.

Keeping non-Latin product names while removing symbols

Example

Input:
sku,title
JP-1,ワイヤレスイヤホン 🎧
FR-1,Écouteurs sans fil ®

Output (all boxes on):
sku,title
JP-1,ワイヤレスイヤホン 
FR-1,Écouteurs sans fil

Turning Digits off destroys model numbers — don't

Example

A common mistake: unticking Digits hoping to drop stray numbers. It removes ALL digits everywhere, gutting SKUs, sizes, and model numbers. Shown as a cautionary before/after.

Input:
sku,model
MW-300,XR-2000 v3

With Digits UNCHECKED:
sku,model
MW-,XR- v

Keep Digits ON for product data. To remove only specific
numeric noise, use /tool/csv-find-replace instead.

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Emoji removal leaves a double space

Expected

Curly quotes are deleted, not converted to ASCII quotes

By design

Ligatures like `ﬁ` survive untouched

Preserved

Mojibake is mangled, not fixed

Not fixed

`&`, `+`, `#`, `%`, `=` are all removed

By design

Header row is left exactly as-is

Preserved

Unticking all four boxes is a no-op

No change

File exceeds the free tier limit

Blocked

Non-breaking space joins two words

Expected

Numbers that were stored with thousands separators lose them

Expected

Frequently asked questions

Is this a blacklist or a whitelist stripper?

Will it remove accented characters like é or ü from product names?

Does it convert curly quotes to straight quotes?

Can I clean only specific columns?

What happens to emoji in titles?

Does it remove zero-width spaces and BOM bytes?

Yes. Zero-width space (U+200B), the BOM (U+FEFF), and similar invisibles are not in any keep class, so they are deleted. This is the main fix for SKUs that won't match your catalogue.

What file types can I upload?

CSV plus XLSX, XLS, and ODS spreadsheets. Spreadsheets are converted to rows for stripping and downloaded back in the same format; CSV downloads as UTF-8 with a .stripped.csv suffix.

Is the file uploaded anywhere?

No. Parsing and stripping run entirely in your browser via PapaParse. Scraped pricing and product data never leave your machine.

What are the size and row limits?

Free: 2 MB and 500 data rows. Pro: 100 MB and 100,000 rows. Files over the limit are blocked at upload — split or sample them first.

Why did a hyphen in my SKU disappear?

How do I get rid of the double spaces left behind?

Chain the csv-whitespace-trimmer tool after stripping. This stripper deletes characters but never collapses adjacent spaces, so a separate trim pass is the clean way to tidy the output.

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

Strip Special Characters From Scraped Product Data CSV

How to strip special characters from scraped product data csv

The four keep-list options (what survives vs. what is removed)

Common scraped-data characters and their fate (all boxes on)

Cookbook

Emoji and trademark symbols in marketing titles

Prices and percent signs scraped into the wrong column

Zero-width space breaking a SKU join

Keeping non-Latin product names while removing symbols

Turning Digits off destroys model numbers — don't

Errors and edge cases

Emoji removal leaves a double space

Curly quotes are deleted, not converted to ASCII quotes

Ligatures like `ﬁ` survive untouched

Mojibake is mangled, not fixed

`&`, `+`, `#`, `%`, `=` are all removed

Header row is left exactly as-is

Unticking all four boxes is a no-op

File exceeds the free tier limit

Non-breaking space joins two words

Numbers that were stored with thousands separators lose them

Frequently asked questions

Is this a blacklist or a whitelist stripper?

Will it remove accented characters like é or ü from product names?

Does it convert curly quotes to straight quotes?

Can I clean only specific columns?

What happens to emoji in titles?

Does it remove zero-width spaces and BOM bytes?

What file types can I upload?

Is the file uploaded anywhere?

What are the size and row limits?

Why did a hyphen in my SKU disappear?

Will it fix garbled mojibake like Ã© or â‚¬?

How do I get rid of the double spaces left behind?

Privacy first

Related guides

Strip Special Characters From Scraped Product Data CSV

How to strip special characters from scraped product data csv

The four keep-list options (what survives vs. what is removed)

Common scraped-data characters and their fate (all boxes on)

Cookbook

Emoji and trademark symbols in marketing titles

Prices and percent signs scraped into the wrong column

Zero-width space breaking a SKU join

Keeping non-Latin product names while removing symbols

Turning Digits off destroys model numbers — don't

Errors and edge cases

Emoji removal leaves a double space

Curly quotes are deleted, not converted to ASCII quotes

Ligatures like `ﬁ` survive untouched

Mojibake is mangled, not fixed

`&`, `+`, `#`, `%`, `=` are all removed

Header row is left exactly as-is

Unticking all four boxes is a no-op

File exceeds the free tier limit

Non-breaking space joins two words

Numbers that were stored with thousands separators lose them

Frequently asked questions

Is this a blacklist or a whitelist stripper?

Will it remove accented characters like é or ü from product names?

Does it convert curly quotes to straight quotes?

Can I clean only specific columns?

What happens to emoji in titles?

Does it remove zero-width spaces and BOM bytes?

What file types can I upload?

Is the file uploaded anywhere?

What are the size and row limits?

Why did a hyphen in my SKU disappear?

Will it fix garbled mojibake like Ã© or â‚¬?

How do I get rid of the double spaces left behind?

Privacy first

Related guides