Sanitize Excel Data Before Elasticsearch Indexing — Remove Special Characters

How to strip non-ascii and control characters from excel data before elasticsearch index

Step 1
Export the source data to a spreadsheet — Pull the documents you intend to index into an .xlsx or .csv (one row per document, columns = fields). Drop it onto the tool; the first sheet is read into rows.
Step 2
Keep all four toggles for index-safe text — Defaults (Letters + Digits + Spaces + Punctuation) keep searchable text and accents while deleting zero-width characters, control bytes, and symbols that pollute the index.
Step 3
Decide how to handle word-joining whitespace — Spaces-on keeps only the ASCII space; NBSP is deleted, which can fuse two words. If you need NBSP to become a real space so tokens split correctly, run the whitespace trimmer first, then this tool.
Step 4
Run the strip — Click Strip special chars. Every data cell is filtered; field names (header row) are left untouched so your mapping stays valid.
Step 5
Check token-affecting changes — Use the Cells modified stat and the preview to confirm only noise was removed and that multilingual titles survived — these are the cells that most affect search relevance.
Step 6
Export and bulk-index — Download .stripped.csv (or .stripped.xlsx), convert to your _bulk NDJSON, and index. The inverted index now holds clean tokens.

How noise characters affect the index — and what this tool does

Effect on Elasticsearch/OpenSearch indexing and the tool's behaviour with all four toggles on.

Character	Effect on the index	Removed by defaults?	Result
Zero-width space U+200B	Title indexed as one un-matchable token	Yes	Tokens match user queries again
NBSP U+00A0	Two words fused into one term	Yes (deleted)	Run space-normalisation first to split them
Control bytes (C0/C1)	Can break `_bulk` NDJSON parsing	Yes	Bulk request stays valid
Emoji 🚀	Bloats term dictionary, rarely queried	Yes	Leaner index
Box-drawing / symbols	Noise tokens	Yes	Cleaner analysis
Accented letter é, ñ	Valid token (with icu/folding)	No — kept	Multilingual search preserved
CJK ideograph 中	Tokenised by cjk/icu analyzer	No — kept	CJK search preserved

Toggle presets by field type

Match the toggles to how each field will be analysed in the index.

Field	Letters	Digits	Spaces	Punctuation
Full-text title/body (multilingual)	On	On	On	On
Keyword/ID field (exact match)	On	On	Off	Off
Numeric code field	Off	On	Off	Off
Tag field (letters only)	On	Off	On	Off

Cookbook

Before/after focused on how the analyzer tokenises the cell. Hidden characters shown as escapes; they are invisible in the source.

Zero-width space hiding a title from search

A product title carried a zero-width space from a CMS paste. The standard analyzer indexed 'Wireless\u200bHeadphones' as one token, so a search for 'headphones' never hit it. Defaults delete the ZWSP.

Input (CSV):
title
WirelessHeadphones

Indexed token (before): [wirelessheadphones]  (no match)

Output (defaults):
title
WirelessHeadphones
Indexed token (after): [wirelessheadphones]
(Tip: add a space-normalisation pass if you need two tokens.)

NBSP fusing two words

An NBSP between 'New' and 'Arrival' fused them into one term. Spaces-on deletes the NBSP, so words still fuse here — the fix is to convert NBSP to space first, then strip the rest.

Input:
tag
New Arrival

This tool alone (defaults): NewArrival  (still one token)

Recommended: whitespace trimmer (NBSP -> space) THEN this tool:
New Arrival  -> tokens [new] [arrival]

Control byte breaking the _bulk payload

A control character in a description corrupted the NDJSON line, aborting the bulk batch with a parse error. Removing control bytes makes the payload valid.

Input (\x01 = control):
desc
Fast\x01 charging

_bulk error (before): mapper_parsing_exception / invalid JSON

Output (defaults):
desc
Fast charging

Keep multilingual content for ICU analysis

Documents in mixed scripts must keep their letters so the icu_analyzer tokenises correctly. Defaults preserve accented Latin and CJK while removing emoji noise.

Input:
title
Café déjà vu 🚀
東京タワー

Output (defaults):
title
Café déjà vu 
東京タワー

Sanitise an XLSX catalog before indexing

A 9,000-row product catalog in XLSX. Clean all data cells, keep field-name headers, then export and convert to _bulk NDJSON.

Input: catalog.xlsx (Sheet1)
header: sku | title | description
row:    A1  | Headphones | Loud 🔊

Download: catalog.stripped.xlsx
row becomes: A1 | Headphones | Loud

Edge cases and what actually happens

Multilingual letters are preserved for the analyzer

Preserved

\p{L} keeps accented Latin, Greek, Cyrillic, and CJK, so icu/cjk analyzers still receive real tokens. This tool is not an ASCII-only strip that would destroy non-English search.

Zero-width space removed, restoring matches

Resolved

U+200B is deleted, so a title that was one un-matchable token becomes searchable. Note that deletion closes up text — if the ZWSP sat between two words, they fuse into one token; add a space-normalisation pass if you need them split.

NBSP is deleted, not converted to a space

Expected

Spaces-on keeps only the ASCII space, so NBSP is removed and adjacent words fuse. For correct token boundaries, convert NBSP→space in the whitespace trimmer first, then run this.

Control bytes that break _bulk are removed

Resolved

C0/C1 control characters are never kept, so the cleaned export produces valid NDJSON and the bulk indexing batch no longer aborts on parse errors.

Header (field-name) row preserved

Preserved

Row 1 stays verbatim so your columns still line up with index field mappings. Sanitise a dirty field name with the header rename tool.

Emoji removed from term dictionary

Expected

Emoji are symbols and are deleted with defaults, trimming low-value terms from the inverted index. There is no option to keep them.

Removal closes up text

Expected

Deleted characters leave no placeholder; surrounding text joins. This is why a ZWSP/NBSP between words causes fusion — verify token-critical fields in the preview.

Multi-sheet workbook

First sheet only

Only the first sheet is read and exported. Move the documents you intend to index to the first sheet.

Over the tier limit

Rejected

Free: 5 MB / 10,000 rows / 1 file. Larger document sets need Pro (50 MB / 100,000 rows / 5 files), Pro-media (200 MB / 500,000 rows), or Developer (500 MB / unlimited rows).

Decomposed accents

Edge

Precomposed é is kept; a decomposed base + combining mark may lose the mark (it is \p{M}). Normalise documents to NFC before indexing for consistent tokens.

Frequently asked questions

Why does a product never appear in search even though it's indexed?

A very common cause is an invisible zero-width space (U+200B) inside the title, so the analyzer indexed the whole phrase as one un-matchable token. This tool deletes the ZWSP so normal tokens are produced again.

Will sanitising break my non-English documents?

No. The Letters toggle keeps all Unicode letters (\p{L}), so accented Latin and CJK survive and your icu/cjk analyzers still get real tokens. Only noise is removed.

Does it remove emoji and symbols from the index?

Yes. Emoji and box-drawing symbols are deleted with default toggles, which keeps the term dictionary lean without affecting real search terms.

Does deleting an NBSP fix or break tokenisation?

It removes the NBSP but does not insert a space, so two words it joined will fuse into one token. For correct boundaries, convert NBSP→space with the whitespace trimmer first, then run this tool for the remaining noise.

Will it stop my _bulk request from failing?

If the failure is caused by control bytes in field values, yes — those are removed, producing valid NDJSON. Mapping/type errors are unrelated and need index-side fixes.

Are my field names (headers) changed?

No. The header row is preserved so your columns still match the index mapping. Clean a problematic field name with the header rename tool.

Should I keep punctuation for full-text fields?

Usually yes — keep all four toggles on for title/body fields; the standard analyzer handles ASCII punctuation. For keyword/exact-match fields, untick Spaces and Punctuation for tighter values.

What output do I get to feed my indexer?

CSV input → .stripped.csv (easy to convert to NDJSON); XLSX input → .stripped.xlsx. Both are produced in-browser.

Is the source data uploaded anywhere?

No. Parsing and stripping run entirely in the browser, so pre-index source data never leaves your machine.

Can I sanitise just the title field?

Not directly — the filter runs on all data columns. To target one field, export it alone or pull it first with the regex extractor.

How many documents can I sanitise at once?

Free: 10,000 rows / 5 MB / 1 file. Pro: 100,000 rows / 50 MB / 5 files. Pro-media: 500,000 rows / 200 MB / 20 files. Developer: unlimited rows / 500 MB.

Does it deduplicate documents too?

No. It only removes characters. After sanitising (which is what often makes near-duplicate titles look distinct), deduplicate with the deduplicator or fuzzy-match with the fuzzy dedup tool.

Privacy first

Every JAD Excel tool runs entirely in your browser using SheetJS and ExcelJS. Your spreadsheets, formulas, and data never leave your device — verified by zero outbound network requests during processing.

How to strip non-ascii and control characters from excel data before elasticsearch index

Step 1
Export the source data to a spreadsheet — Pull the documents you intend to index into an .xlsx or .csv (one row per document, columns = fields). Drop it onto the tool; the first sheet is read into rows.
Step 2
Keep all four toggles for index-safe text — Defaults (Letters + Digits + Spaces + Punctuation) keep searchable text and accents while deleting zero-width characters, control bytes, and symbols that pollute the index.
Step 3
Decide how to handle word-joining whitespace — Spaces-on keeps only the ASCII space; NBSP is deleted, which can fuse two words. If you need NBSP to become a real space so tokens split correctly, run the whitespace trimmer first, then this tool.
Step 4
Run the strip — Click Strip special chars. Every data cell is filtered; field names (header row) are left untouched so your mapping stays valid.
Step 5
Check token-affecting changes — Use the Cells modified stat and the preview to confirm only noise was removed and that multilingual titles survived — these are the cells that most affect search relevance.
Step 6
Export and bulk-index — Download .stripped.csv (or .stripped.xlsx), convert to your _bulk NDJSON, and index. The inverted index now holds clean tokens.

How noise characters affect the index — and what this tool does

Effect on Elasticsearch/OpenSearch indexing and the tool's behaviour with all four toggles on.

Character	Effect on the index	Removed by defaults?	Result
Zero-width space U+200B	Title indexed as one un-matchable token	Yes	Tokens match user queries again
NBSP U+00A0	Two words fused into one term	Yes (deleted)	Run space-normalisation first to split them
Control bytes (C0/C1)	Can break `_bulk` NDJSON parsing	Yes	Bulk request stays valid
Emoji 🚀	Bloats term dictionary, rarely queried	Yes	Leaner index
Box-drawing / symbols	Noise tokens	Yes	Cleaner analysis
Accented letter é, ñ	Valid token (with icu/folding)	No — kept	Multilingual search preserved
CJK ideograph 中	Tokenised by cjk/icu analyzer	No — kept	CJK search preserved

Toggle presets by field type

Match the toggles to how each field will be analysed in the index.

Field	Letters	Digits	Spaces	Punctuation
Full-text title/body (multilingual)	On	On	On	On
Keyword/ID field (exact match)	On	On	Off	Off
Numeric code field	Off	On	Off	Off
Tag field (letters only)	On	Off	On	Off

Cookbook

Before/after focused on how the analyzer tokenises the cell. Hidden characters shown as escapes; they are invisible in the source.

Zero-width space hiding a title from search

Input (CSV):
title
WirelessHeadphones

Indexed token (before): [wirelessheadphones]  (no match)

Output (defaults):
title
WirelessHeadphones
Indexed token (after): [wirelessheadphones]
(Tip: add a space-normalisation pass if you need two tokens.)

NBSP fusing two words

An NBSP between 'New' and 'Arrival' fused them into one term. Spaces-on deletes the NBSP, so words still fuse here — the fix is to convert NBSP to space first, then strip the rest.

Input:
tag
New Arrival

This tool alone (defaults): NewArrival  (still one token)

Recommended: whitespace trimmer (NBSP -> space) THEN this tool:
New Arrival  -> tokens [new] [arrival]

Control byte breaking the _bulk payload

A control character in a description corrupted the NDJSON line, aborting the bulk batch with a parse error. Removing control bytes makes the payload valid.

Input (\x01 = control):
desc
Fast\x01 charging

_bulk error (before): mapper_parsing_exception / invalid JSON

Output (defaults):
desc
Fast charging

Keep multilingual content for ICU analysis

Documents in mixed scripts must keep their letters so the icu_analyzer tokenises correctly. Defaults preserve accented Latin and CJK while removing emoji noise.

Input:
title
Café déjà vu 🚀
東京タワー

Output (defaults):
title
Café déjà vu 
東京タワー

Sanitise an XLSX catalog before indexing

A 9,000-row product catalog in XLSX. Clean all data cells, keep field-name headers, then export and convert to _bulk NDJSON.

Input: catalog.xlsx (Sheet1)
header: sku | title | description
row:    A1  | Headphones | Loud 🔊

Download: catalog.stripped.xlsx
row becomes: A1 | Headphones | Loud

Edge cases and what actually happens

Multilingual letters are preserved for the analyzer

Preserved

\p{L} keeps accented Latin, Greek, Cyrillic, and CJK, so icu/cjk analyzers still receive real tokens. This tool is not an ASCII-only strip that would destroy non-English search.

Zero-width space removed, restoring matches

Resolved

NBSP is deleted, not converted to a space

Expected

Spaces-on keeps only the ASCII space, so NBSP is removed and adjacent words fuse. For correct token boundaries, convert NBSP→space in the whitespace trimmer first, then run this.

Control bytes that break _bulk are removed

Resolved

C0/C1 control characters are never kept, so the cleaned export produces valid NDJSON and the bulk indexing batch no longer aborts on parse errors.

Header (field-name) row preserved

Preserved

Row 1 stays verbatim so your columns still line up with index field mappings. Sanitise a dirty field name with the header rename tool.

Emoji removed from term dictionary

Expected

Emoji are symbols and are deleted with defaults, trimming low-value terms from the inverted index. There is no option to keep them.

Removal closes up text

Expected

Deleted characters leave no placeholder; surrounding text joins. This is why a ZWSP/NBSP between words causes fusion — verify token-critical fields in the preview.

Multi-sheet workbook

First sheet only

Only the first sheet is read and exported. Move the documents you intend to index to the first sheet.

Over the tier limit

Rejected

Free: 5 MB / 10,000 rows / 1 file. Larger document sets need Pro (50 MB / 100,000 rows / 5 files), Pro-media (200 MB / 500,000 rows), or Developer (500 MB / unlimited rows).

Decomposed accents

Edge

Precomposed é is kept; a decomposed base + combining mark may lose the mark (it is \p{M}). Normalise documents to NFC before indexing for consistent tokens.

Frequently asked questions

Why does a product never appear in search even though it's indexed?

Will sanitising break my non-English documents?

No. The Letters toggle keeps all Unicode letters (\p{L}), so accented Latin and CJK survive and your icu/cjk analyzers still get real tokens. Only noise is removed.

Does it remove emoji and symbols from the index?

Yes. Emoji and box-drawing symbols are deleted with default toggles, which keeps the term dictionary lean without affecting real search terms.

Does deleting an NBSP fix or break tokenisation?

Will it stop my _bulk request from failing?

If the failure is caused by control bytes in field values, yes — those are removed, producing valid NDJSON. Mapping/type errors are unrelated and need index-side fixes.

Are my field names (headers) changed?

No. The header row is preserved so your columns still match the index mapping. Clean a problematic field name with the header rename tool.

Should I keep punctuation for full-text fields?

Usually yes — keep all four toggles on for title/body fields; the standard analyzer handles ASCII punctuation. For keyword/exact-match fields, untick Spaces and Punctuation for tighter values.

What output do I get to feed my indexer?

CSV input → .stripped.csv (easy to convert to NDJSON); XLSX input → .stripped.xlsx. Both are produced in-browser.

Is the source data uploaded anywhere?

No. Parsing and stripping run entirely in the browser, so pre-index source data never leaves your machine.

Can I sanitise just the title field?

Not directly — the filter runs on all data columns. To target one field, export it alone or pull it first with the regex extractor.

How many documents can I sanitise at once?

Free: 10,000 rows / 5 MB / 1 file. Pro: 100,000 rows / 50 MB / 5 files. Pro-media: 500,000 rows / 200 MB / 20 files. Developer: unlimited rows / 500 MB.

Does it deduplicate documents too?

No. It only removes characters. After sanitising (which is what often makes near-duplicate titles look distinct), deduplicate with the deduplicator or fuzzy-match with the fuzzy dedup tool.

Strip Non-ASCII and Control Characters from Excel Data Before Elasticsearch Index

How to strip non-ascii and control characters from excel data before elasticsearch index

How noise characters affect the index — and what this tool does

Toggle presets by field type

Cookbook

Zero-width space hiding a title from search

NBSP fusing two words

Control byte breaking the _bulk payload

Keep multilingual content for ICU analysis

Sanitise an XLSX catalog before indexing

Edge cases and what actually happens

Multilingual letters are preserved for the analyzer

Zero-width space removed, restoring matches

NBSP is deleted, not converted to a space

Control bytes that break _bulk are removed

Header (field-name) row preserved

Emoji removed from term dictionary

Removal closes up text

Multi-sheet workbook

Over the tier limit

Decomposed accents

Frequently asked questions

Why does a product never appear in search even though it's indexed?

Will sanitising break my non-English documents?

Does it remove emoji and symbols from the index?

Does deleting an NBSP fix or break tokenisation?

Will it stop my _bulk request from failing?

Are my field names (headers) changed?

Should I keep punctuation for full-text fields?

What output do I get to feed my indexer?

Is the source data uploaded anywhere?

Can I sanitise just the title field?

How many documents can I sanitise at once?

Does it deduplicate documents too?

Privacy first

Related guides

Strip Non-ASCII and Control Characters from Excel Data Before Elasticsearch Index

How to strip non-ascii and control characters from excel data before elasticsearch index

How noise characters affect the index — and what this tool does

Toggle presets by field type

Cookbook

Zero-width space hiding a title from search

NBSP fusing two words

Control byte breaking the _bulk payload

Keep multilingual content for ICU analysis

Sanitise an XLSX catalog before indexing

Edge cases and what actually happens

Multilingual letters are preserved for the analyzer

Zero-width space removed, restoring matches

NBSP is deleted, not converted to a space

Control bytes that break _bulk are removed

Header (field-name) row preserved

Emoji removed from term dictionary

Removal closes up text

Multi-sheet workbook

Over the tier limit

Decomposed accents

Frequently asked questions

Why does a product never appear in search even though it's indexed?

Will sanitising break my non-English documents?

Does it remove emoji and symbols from the index?

Does deleting an NBSP fix or break tokenisation?

Will it stop my _bulk request from failing?

Are my field names (headers) changed?

Should I keep punctuation for full-text fields?

What output do I get to feed my indexer?

Is the source data uploaded anywhere?

Can I sanitise just the title field?

How many documents can I sanitise at once?

Does it deduplicate documents too?

Privacy first

Related guides