How to strip non-ascii and control characters from excel data before elasticsearch index
- Step 1Export the source data to a spreadsheet — Pull the documents you intend to index into an
.xlsxor.csv(one row per document, columns = fields). Drop it onto the tool; the first sheet is read into rows. - Step 2Keep all four toggles for index-safe text — Defaults (Letters + Digits + Spaces + Punctuation) keep searchable text and accents while deleting zero-width characters, control bytes, and symbols that pollute the index.
- Step 3Decide how to handle word-joining whitespace — Spaces-on keeps only the ASCII space; NBSP is deleted, which can fuse two words. If you need NBSP to become a real space so tokens split correctly, run the whitespace trimmer first, then this tool.
- Step 4Run the strip — Click Strip special chars. Every data cell is filtered; field names (header row) are left untouched so your mapping stays valid.
- Step 5Check token-affecting changes — Use the Cells modified stat and the preview to confirm only noise was removed and that multilingual titles survived — these are the cells that most affect search relevance.
- Step 6Export and bulk-index — Download
.stripped.csv(or.stripped.xlsx), convert to your_bulkNDJSON, and index. The inverted index now holds clean tokens.
How noise characters affect the index — and what this tool does
Effect on Elasticsearch/OpenSearch indexing and the tool's behaviour with all four toggles on.
| Character | Effect on the index | Removed by defaults? | Result |
|---|---|---|---|
| Zero-width space U+200B | Title indexed as one un-matchable token | Yes | Tokens match user queries again |
| NBSP U+00A0 | Two words fused into one term | Yes (deleted) | Run space-normalisation first to split them |
| Control bytes (C0/C1) | Can break _bulk NDJSON parsing | Yes | Bulk request stays valid |
| Emoji 🚀 | Bloats term dictionary, rarely queried | Yes | Leaner index |
| Box-drawing / symbols | Noise tokens | Yes | Cleaner analysis |
| Accented letter é, ñ | Valid token (with icu/folding) | No — kept | Multilingual search preserved |
| CJK ideograph 中 | Tokenised by cjk/icu analyzer | No — kept | CJK search preserved |
Toggle presets by field type
Match the toggles to how each field will be analysed in the index.
| Field | Letters | Digits | Spaces | Punctuation |
|---|---|---|---|---|
| Full-text title/body (multilingual) | On | On | On | On |
| Keyword/ID field (exact match) | On | On | Off | Off |
| Numeric code field | Off | On | Off | Off |
| Tag field (letters only) | On | Off | On | Off |
Cookbook
Before/after focused on how the analyzer tokenises the cell. Hidden characters shown as escapes; they are invisible in the source.
Zero-width space hiding a title from search
A product title carried a zero-width space from a CMS paste. The standard analyzer indexed 'Wireless\u200bHeadphones' as one token, so a search for 'headphones' never hit it. Defaults delete the ZWSP.
Input (CSV): title WirelessHeadphones Indexed token (before): [wirelessheadphones] (no match) Output (defaults): title WirelessHeadphones Indexed token (after): [wirelessheadphones] (Tip: add a space-normalisation pass if you need two tokens.)
NBSP fusing two words
An NBSP between 'New' and 'Arrival' fused them into one term. Spaces-on deletes the NBSP, so words still fuse here — the fix is to convert NBSP to space first, then strip the rest.
Input: tag New Arrival This tool alone (defaults): NewArrival (still one token) Recommended: whitespace trimmer (NBSP -> space) THEN this tool: New Arrival -> tokens [new] [arrival]
Control byte breaking the _bulk payload
A control character in a description corrupted the NDJSON line, aborting the bulk batch with a parse error. Removing control bytes makes the payload valid.
Input (\x01 = control): desc Fast\x01 charging _bulk error (before): mapper_parsing_exception / invalid JSON Output (defaults): desc Fast charging
Keep multilingual content for ICU analysis
Documents in mixed scripts must keep their letters so the icu_analyzer tokenises correctly. Defaults preserve accented Latin and CJK while removing emoji noise.
Input: title Café déjà vu 🚀 東京タワー Output (defaults): title Café déjà vu 東京タワー
Sanitise an XLSX catalog before indexing
A 9,000-row product catalog in XLSX. Clean all data cells, keep field-name headers, then export and convert to _bulk NDJSON.
Input: catalog.xlsx (Sheet1) header: sku | title | description row: A1 | Headphones | Loud 🔊 Download: catalog.stripped.xlsx row becomes: A1 | Headphones | Loud
Edge cases and what actually happens
Multilingual letters are preserved for the analyzer
Preserved\p{L} keeps accented Latin, Greek, Cyrillic, and CJK, so icu/cjk analyzers still receive real tokens. This tool is not an ASCII-only strip that would destroy non-English search.
Zero-width space removed, restoring matches
ResolvedU+200B is deleted, so a title that was one un-matchable token becomes searchable. Note that deletion closes up text — if the ZWSP sat between two words, they fuse into one token; add a space-normalisation pass if you need them split.
NBSP is deleted, not converted to a space
ExpectedSpaces-on keeps only the ASCII space, so NBSP is removed and adjacent words fuse. For correct token boundaries, convert NBSP→space in the whitespace trimmer first, then run this.
Control bytes that break _bulk are removed
ResolvedC0/C1 control characters are never kept, so the cleaned export produces valid NDJSON and the bulk indexing batch no longer aborts on parse errors.
Header (field-name) row preserved
PreservedRow 1 stays verbatim so your columns still line up with index field mappings. Sanitise a dirty field name with the header rename tool.
Emoji removed from term dictionary
ExpectedEmoji are symbols and are deleted with defaults, trimming low-value terms from the inverted index. There is no option to keep them.
Removal closes up text
ExpectedDeleted characters leave no placeholder; surrounding text joins. This is why a ZWSP/NBSP between words causes fusion — verify token-critical fields in the preview.
Multi-sheet workbook
First sheet onlyOnly the first sheet is read and exported. Move the documents you intend to index to the first sheet.
Over the tier limit
RejectedFree: 5 MB / 10,000 rows / 1 file. Larger document sets need Pro (50 MB / 100,000 rows / 5 files), Pro-media (200 MB / 500,000 rows), or Developer (500 MB / unlimited rows).
Decomposed accents
EdgePrecomposed é is kept; a decomposed base + combining mark may lose the mark (it is \p{M}). Normalise documents to NFC before indexing for consistent tokens.
Frequently asked questions
Why does a product never appear in search even though it's indexed?
A very common cause is an invisible zero-width space (U+200B) inside the title, so the analyzer indexed the whole phrase as one un-matchable token. This tool deletes the ZWSP so normal tokens are produced again.
Will sanitising break my non-English documents?
No. The Letters toggle keeps all Unicode letters (\p{L}), so accented Latin and CJK survive and your icu/cjk analyzers still get real tokens. Only noise is removed.
Does it remove emoji and symbols from the index?
Yes. Emoji and box-drawing symbols are deleted with default toggles, which keeps the term dictionary lean without affecting real search terms.
Does deleting an NBSP fix or break tokenisation?
It removes the NBSP but does not insert a space, so two words it joined will fuse into one token. For correct boundaries, convert NBSP→space with the whitespace trimmer first, then run this tool for the remaining noise.
Will it stop my _bulk request from failing?
If the failure is caused by control bytes in field values, yes — those are removed, producing valid NDJSON. Mapping/type errors are unrelated and need index-side fixes.
Are my field names (headers) changed?
No. The header row is preserved so your columns still match the index mapping. Clean a problematic field name with the header rename tool.
Should I keep punctuation for full-text fields?
Usually yes — keep all four toggles on for title/body fields; the standard analyzer handles ASCII punctuation. For keyword/exact-match fields, untick Spaces and Punctuation for tighter values.
What output do I get to feed my indexer?
CSV input → .stripped.csv (easy to convert to NDJSON); XLSX input → .stripped.xlsx. Both are produced in-browser.
Is the source data uploaded anywhere?
No. Parsing and stripping run entirely in the browser, so pre-index source data never leaves your machine.
Can I sanitise just the title field?
Not directly — the filter runs on all data columns. To target one field, export it alone or pull it first with the regex extractor.
How many documents can I sanitise at once?
Free: 10,000 rows / 5 MB / 1 file. Pro: 100,000 rows / 50 MB / 5 files. Pro-media: 500,000 rows / 200 MB / 20 files. Developer: unlimited rows / 500 MB.
Does it deduplicate documents too?
No. It only removes characters. After sanitising (which is what often makes near-duplicate titles look distinct), deduplicate with the deduplicator or fuzzy-match with the fuzzy dedup tool.
Privacy first
Every JAD Excel tool runs entirely in your browser using SheetJS and ExcelJS. Your spreadsheets, formulas, and data never leave your device — verified by zero outbound network requests during processing.