How to strip special characters from scraped product data csv
- Step 1Export the scraped product CSV — Save the output from your scraping run as
.csv. The tool also accepts.xlsx,.xls, and.ods— it converts the first sheet to rows before stripping, and downloads back to the same format. - Step 2Drop the file onto the stripper — The free tier accepts files up to 2 MB / 500 data rows; Pro raises this to 100 MB / 100,000 rows. PapaParse auto-detects the delimiter (comma, semicolon, or tab) from the first rows.
- Step 3Choose which character classes to keep — Four checkboxes, all on by default: Letters (incl. accents), Digits (0–9), Spaces, Punctuation (.,!?@-_ etc.). Leave all four on for a standard product clean. Note there is no per-column selection — the strip applies to every column except the header row.
- Step 4Run Strip special chars — Every character not in your keep-set is deleted from each cell. Emoji, currency symbols,
™/®, and zero-width characters disappear; letters, digits, spaces, and common punctuation remain. - Step 5Review the preview and counts — The result panel shows cells modified, data-row count, and a preview of the first 10 rows. Scan product titles to confirm nothing legitimate was lost (e.g. a
+in2+ packis removed — see the edge cases). - Step 6Download and import into your PIM — Download writes
<name>.stripped.csv(UTF-8) — or.stripped.xlsxif you uploaded a spreadsheet. Map columns and ingest into your PIM or product database.
The four keep-list options (what survives vs. what is removed)
The tool is a whitelist: a character is kept only if it matches at least one enabled class. All four checkboxes are ON by default. Behaviour verified against the tool's buildPattern logic.
| Checkbox | Characters kept when ON | Examples removed if OFF | Default |
|---|---|---|---|
| Letters (incl. accents) | \p{L} — every Unicode letter: ASCII a-z/A-Z, accented Latin (é ü ñ ç), and all scripts (CJK, Cyrillic, Arabic, Greek) | All alphabetic text — turning OFF leaves only digits/spaces/punctuation, rarely wanted for product data | On |
| Digits (0–9) | ASCII digits 0 through 9 | SKU-12345 becomes SKU-; sizes, quantities, model numbers lose their digits — keep ON for product data | On |
| Spaces | The regular space character (U+0020) only | Wireless Earbuds becomes WirelessEarbuds — word boundaries collapse | On |
| Punctuation (.,!?@-_ etc.) | Exactly this set: . , - _ @ / ( ) ! ? : ; ' " | Earbuds, Black becomes Earbuds Black; hyphenated SKUs lose the hyphen | On |
Common scraped-data characters and their fate (all boxes on)
With the default keep-set, these are the outcomes for characters that show up in storefront-scraped CSVs. Verified by running the keep-pattern character by character.
| Character | Where it comes from | Kept or removed | Result on a sample cell |
|---|---|---|---|
Emoji 🎧 😀 | Marketing-styled product titles | Removed | Earbuds 🎧 → Earbuds (trailing space remains) |
™ ® © | Brand names in titles | Removed | Acme™ Pro → Acme Pro |
€ £ $ % | Prices scraped into title/description cells | Removed | 50% off €19 → 50 off 19 |
& # * + = | Spec sheets, bullet markers | Removed | 2+ pack → 2 pack; A&B → AB |
[ ] { } | Template residue, JSON-in-cell | Removed | size [M] → size M |
| Zero-width space U+200B | Anti-scrape injection | Removed (silently) | Joins split words: Earbuds → Earbuds |
Curly quotes “ ” ‘ ’ | Copied from rendered HTML | Removed (not converted) | “Pro” → Pro; it’s → its |
Accented letters é ü ñ | International product names | Kept | Café Crème → Café Crème |
Cookbook
Before/after rows from real scraped product feeds. Notice that this tool deletes — it does not transliterate or substitute — so plan the keep-set with that in mind.
Emoji and trademark symbols in marketing titles
ExampleStorefront teams add emoji and ™ to titles for visual punch. Those characters break PIM validators and storefront search. The default keep-set removes them, leaving plain readable text. Watch the trailing space where the emoji used to be.
Input: sku,title EB-100,Wireless Earbuds 🎧 Acme™ EB-200,Smart Watch ⌚ Pro® Output (all boxes on): sku,title EB-100,Wireless Earbuds Acme EB-200,Smart Watch Pro Note the double space where the emoji sat — run /tool/csv-whitespace-trimmer afterwards to collapse it.
Prices and percent signs scraped into the wrong column
ExampleScrapers often grab a price string into a description cell. The stripper removes €, %, and $ but keeps the digits — which can leave a misleading bare number. Decide whether you actually want digits kept here, or strip this column with csv-find-replace targeting the symbol only.
Input: sku,blurb P1,Now 50% off — only €19.99! Output (all boxes on): sku,blurb P1,Now 50 off only 19.99! The em dash and € are gone; '50' and '19.99' remain. If you only want the symbol removed, use /tool/csv-find-replace.
Zero-width space breaking a SKU join
ExampleAnti-scrape scripts inject U+200B into product codes so copied SKUs silently differ from the catalogue. Invisible in every editor, fatal for exact-match joins. The stripper removes it because it is not a letter, digit, space, or kept punctuation.
Input (U+200B shown as ·): sku,name AB·123,Bluetooth Speaker AB123,Bluetooth Speaker These look identical but won't join. Output (all boxes on): sku,name AB123,Bluetooth Speaker AB123,Bluetooth Speaker Now both SKUs match the catalogue value AB123.
Keeping non-Latin product names while removing symbols
ExampleA scrape of a multi-region store mixes CJK and Latin titles plus emoji. Because the keep-list uses \p{L}, every script's letters survive; only the emoji and symbols go. This is the key reason to use this tool over an ASCII-only stripper.
Input: sku,title JP-1,ワイヤレスイヤホン 🎧 FR-1,Écouteurs sans fil ® Output (all boxes on): sku,title JP-1,ワイヤレスイヤホン FR-1,Écouteurs sans fil
Turning Digits off destroys model numbers — don't
ExampleA common mistake: unticking Digits hoping to drop stray numbers. It removes ALL digits everywhere, gutting SKUs, sizes, and model numbers. Shown as a cautionary before/after.
Input: sku,model MW-300,XR-2000 v3 With Digits UNCHECKED: sku,model MW-,XR- v Keep Digits ON for product data. To remove only specific numeric noise, use /tool/csv-find-replace instead.
Errors and edge cases
Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.
Emoji removal leaves a double space
ExpectedAn emoji surrounded by spaces (Earbuds 🎧 Pro) is deleted but the spaces stay, so you get Earbuds Pro with two spaces. The stripper does not collapse whitespace. Run csv-whitespace-trimmer afterwards to tidy the gaps.
Curly quotes are deleted, not converted to ASCII quotes
By design“Pro” becomes Pro and it’s becomes its — the smart quotes are removed entirely because they are not in the kept punctuation set (which contains straight ' and " only). If you need curly quotes folded to straight ASCII instead of deleted, use csv-cleaner's smart-quote normalisation.
Ligatures like `fi` survive untouched
PreservedA scraped title containing the typographic ligature fi (U+FB01) keeps it, because \p{L} classifies ligatures as letters. This tool does NOT expand fi to fi. If you need that, normalise the text upstream or use a Unicode NFKC step before exporting the CSV.
Mojibake is mangled, not fixed
Not fixedIf a price scrape produced mojibake like €19 (a mis-decoded €), the stripper removes ‚ and ¬ but keeps the letter â, leaving â19. It cannot reconstruct the original character. Fix encoding at the source, or see the fix-encoding-artefacts guide.
`&`, `+`, `#`, `%`, `=` are all removed
By designThese common spec-sheet characters are not in the kept punctuation set, so 2+ pack, A&B, 50% off, model #3 lose those symbols. If a symbol carries meaning in your data, either accept the loss or use csv-find-replace to handle it precisely instead.
Header row is left exactly as-is
PreservedThe first row is treated as a header and is never modified — even if it contains special characters. If your scrape produced a headerless CSV, the first data row will be skipped from stripping. Add a header row first, or strip per your data shape.
Unticking all four boxes is a no-op
No changeIf no class is enabled the keep-pattern falls back to /./, which matches every character — so nothing is removed. The result will show 0 cells modified. Re-enable at least one class for the strip to do anything.
File exceeds the free tier limit
BlockedFree accounts cap at 2 MB and 500 data rows; a large scrape feed will be blocked at upload. Split the file with csv-row-splitter, trim to a sample with csv-row-limiter, or upgrade to Pro (100 MB / 100,000 rows).
Non-breaking space joins two words
ExpectedA non-breaking space (U+00A0) scraped from HTML is removed entirely because only the regular space (U+0020) is kept. Out of stock written with NBSPs becomes Outofstock. If your data uses NBSPs as real separators, replace them with regular spaces using csv-find-replace before stripping.
Numbers that were stored with thousands separators lose them
ExpectedA scraped price 1,299 keeps the comma (comma is kept punctuation) but 1 299 (space separator) keeps the space, and 1’299 (Swiss apostrophe separator) drops the apostrophe to give 1299. Review numeric columns and standardise separators before any downstream parsing.
Frequently asked questions
Is this a blacklist or a whitelist stripper?
A whitelist. You choose which character classes to keep (Letters, Digits, Spaces, Punctuation) and every character outside that set is deleted. There is no list of 'bad' characters to remove — anything not explicitly kept is gone.
Will it remove accented characters like é or ü from product names?
No. The Letters option uses \p{L}, which matches every Unicode letter including accented Latin and all other scripts. As long as Letters is checked (the default), Café and Müller survive intact.
Does it convert curly quotes to straight quotes?
No — it deletes them. Curly quotes are not in the kept punctuation set, so “Pro” becomes Pro. For conversion (folding curly to straight) rather than deletion, use the csv-cleaner tool's smart-quote normalisation.
Can I clean only specific columns?
Not in this tool's interface — the strip applies to every column except the header row. If you need column-scoped edits, isolate columns first with csv-column-filter, or use csv-find-replace for targeted changes.
What happens to emoji in titles?
Emoji are removed because they are not letters, digits, spaces, or kept punctuation. Any surrounding spaces remain, so you may end up with a double space where the emoji was — run the whitespace trimmer afterwards.
Does it remove zero-width spaces and BOM bytes?
Yes. Zero-width space (U+200B), the BOM (U+FEFF), and similar invisibles are not in any keep class, so they are deleted. This is the main fix for SKUs that won't match your catalogue.
What file types can I upload?
CSV plus XLSX, XLS, and ODS spreadsheets. Spreadsheets are converted to rows for stripping and downloaded back in the same format; CSV downloads as UTF-8 with a .stripped.csv suffix.
Is the file uploaded anywhere?
No. Parsing and stripping run entirely in your browser via PapaParse. Scraped pricing and product data never leave your machine.
What are the size and row limits?
Free: 2 MB and 500 data rows. Pro: 100 MB and 100,000 rows. Files over the limit are blocked at upload — split or sample them first.
Why did a hyphen in my SKU disappear?
It shouldn't, if Punctuation is checked — the hyphen - is in the kept set. If you unticked Punctuation, hyphens, periods, commas, and slashes are all removed. Re-enable Punctuation for SKU-style data.
Will it fix garbled mojibake like é or €?
No. It may remove some bytes of a mojibake sequence but it cannot reconstruct the original character, and it often leaves a partial mess. Fix encoding at the export step or with csv-cleaner, then strip.
How do I get rid of the double spaces left behind?
Chain the csv-whitespace-trimmer tool after stripping. This stripper deletes characters but never collapses adjacent spaces, so a separate trim pass is the clean way to tidy the output.
Privacy first
Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.