How to find and delete near-duplicate excel rows using levenshtein similarity
- Step 1Open the Fuzzy Deduplicator and drop your file — Drag an
.xlsxor.csvonto the tool above. It reads the first sheet only and uses the top row as headers. Values are read as formatted text (numbers and dates become their displayed string), and blank cells become empty strings. - Step 2Type the key column name into the Key column field — The Key column control is a free-text input, not a dropdown — type the exact header of the column to compare on, e.g.
company_name. Spelling and case must match the header in your file. Only that one column is scored; all other columns ride along untouched. - Step 3Set the similarity threshold — Enter a number from 50 to 100 (default 85). This is the minimum similarity at which two values are treated as the same. The helper text under the field reminds you: 85% removes near-duplicates like
Acme Corp/Acme Corporation; 95% only removes very close matches. - Step 4Process the file — The tool walks rows top to bottom. Each value is compared against the values already kept (the representative of each cluster); the first kept value scoring at or above your threshold wins, and the current row is removed. There is no separate preview-then-confirm step — deduplication happens when you process.
- Step 5Read the removed-rows report — The results panel shows how many rows were removed and kept, and previews up to 5 removed rows in the form
Row N "value" ≈ "matchedValue" (score%). The downloadable report text lists up to 50, and the underlying findings hold up to 200, so you can audit what collapsed before trusting the output. - Step 6Download the deduplicated .xlsx — Download
deduped-fuzzy.xlsx— a single sheet namedDedupedcontaining only the kept rows, with all original columns intact. If a cluster collapsed wrongly, raise the threshold and re-run; if it missed duplicates, lower it.
Exact dedup vs. fuzzy dedup on the same column
Why Excel Remove Duplicates leaves rows that Fuzzy Dedup catches. Scores are normalized Levenshtein similarity (case-/whitespace-insensitive) for the example pairs.
| Value pair | Excel Remove Duplicates | Fuzzy similarity | Removed at 85%? |
|---|---|---|---|
Acme Corp / Acme Corp | Kept both (trailing space differs) | 100% (trimmed before compare) | Yes |
Acme Corp / acme corp | Kept both (case differs) | 100% (lowercased before compare) | Yes |
Acme Corp / Acme Corporation | Kept both | ~67% (oration added to 15-char string) | No — needs ~65% threshold |
Netflix Inc / Netflix, Inc | Kept both | ~92% (one inserted comma) | Yes |
Microsoft / Microsft | Kept both | ~89% (one deleted letter) | Yes |
Jackson / Jason | Kept both | ~71% (two edits in 7 chars) | No (kept distinct) |
Threshold guidance by data type
The threshold (50–100, default 85) is the only similarity knob. Lower = more aggressive (more false positives); higher = stricter (more missed duplicates). Pick by re-running and reading the report.
| Data type | Suggested threshold | Why |
|---|---|---|
| Company / vendor names with legal suffixes | 65–80 | Acme vs Acme Corporation needs a low bar — the suffix is a large fraction of the string length |
| Short tidy names already trimmed/cased | 90–95 | Most variation is single-character typos; a high bar avoids merging genuinely different short names |
| Personal names | 90–95 | Jackson/Jason and Jon/Tom are close in edit distance — keep the bar high to avoid false merges |
| Free-text addresses | 75–85 | Abbreviations (St/Street) and reordering create moderate distance |
| Structured codes / SKUs / IDs | 100 (or use exact dedup) | Near-match on codes creates false positives — prefer the exact csv-deduplicator instead |
What the tool reads, writes, and limits
Ground-truth behavior of the Fuzzy Deduplicator. Free tier cannot run this tool — it is Pro-gated.
| Aspect | Behavior |
|---|---|
| Input formats | .xlsx or .csv — first sheet only, first row as headers |
| Options | keyColumn (free-text, required) and threshold (number 50–100, default 85) — nothing else |
| Algorithm | Levenshtein edit distance ÷ length of longer string, ×100, rounded; both values lowercased + trimmed first |
| Survivor rule | First occurrence of each cluster is kept; later near-duplicates removed |
| Output | Binary .xlsx (deduped-fuzzy.xlsx), one sheet Deduped, all columns preserved |
| Tier gate | Pro tier minimum (Free is blocked at processing) |
| Pro limits | 50 MB file · 100,000 rows · 5 files |
| Pro-media / Developer | 200 MB / 500,000 rows · or 500 MB / unlimited rows |
Cookbook
Real near-duplicate patterns, the threshold that catches them, and what the report shows. Row numbers in the report are 1-based and include the header row, so the first data row is Row 2.
Trailing space and case differences (caught at any threshold)
Because the tool lowercases and trims both values before scoring, casing and surrounding whitespace are invisible to it — these always score 100% and collapse even at threshold 100. This is the class of duplicate Excel Remove Duplicates silently keeps.
Input (column: company_name) company_name Acme Corp acme corp Acme Corp threshold: 85 (or even 100) Report 2 near-duplicate row(s) removed · 1 rows kept. Row 3 "acme corp" ≈ "Acme Corp" (100%) Row 4 "Acme Corp " ≈ "Acme Corp" (100%) Output deduped-fuzzy.xlsx (sheet: Deduped) company_name Acme Corp
Legal-suffix variants need a lower threshold
Acme Corporation differs from Acme Corp by adding oration (7 inserted characters) to a 15-character string, scoring roughly 67% — below the default. Lower the threshold to about 65% to collapse suffix variants, then sanity-check the report before downloading.
Input (column: company_name) company_name Acme Corp Acme Corporation Acme Inc threshold: 85 -> 0 removed (all kept; ~67% < 85) threshold: 65 -> Report 2 near-duplicate row(s) removed · 1 rows kept. Row 3 "Acme Corporation" ≈ "Acme Corp" (67%) Row 4 "Acme Inc" ≈ "Acme Corp" (75%) Warning: at 65% "Acme Inc" also collapses into "Acme Corp" — verify the report; these may be different legal entities.
Single-character typos in a clean column
For a column that is already trimmed and consistently cased, most remaining duplicates are one-character typos. An 88–90% threshold catches them without merging genuinely different short names.
Input (column: city) city London Lodnon Liverpool threshold: 88 Report 1 near-duplicate row(s) removed · 2 rows kept. Row 3 "Lodnon" ≈ "London" (83%) <- below 88, NOT removed At 88% "Lodnon" (83%) survives. Lower to 80% to remove it: threshold: 80 -> Row 3 "Lodnon" ≈ "London" (83%) removed.
First-occurrence-wins decides which row survives
The survivor is always the first row of the cluster in file order. If you want a specific record (most complete, most recent) to win, sort the column so that record appears first before processing.
Input (column: name) -- note the order name,notes J. Smith,sparse record John Smith,full record threshold: 70 Report: Row 3 "John Smith" ≈ "J. Smith" (?) removed Output keeps "J. Smith" (the sparse record) — first wins. Fix: sort so the full record is first, then re-run, OR use exact dedup downstream and keep both for manual merge.
Empty key cells are a special case
A blank key value scores 100% against another blank (both empty → equal) but 0% against any non-empty value. So multiple blank-key rows collapse to one, while a blank never merges into a real name. Remove empty rows first if blanks are noise.
Input (column: company_name) company_name,id ,1 ,2 Acme,3 threshold: 85 Report 1 near-duplicate row(s) removed · 2 rows kept. Row 3 "" ≈ "" (100%) <- second blank collapses into first Output keeps one blank row + Acme. To drop blank-key rows entirely, run an empty-row pass first via /tool/csv-empty-row-remover.
Edge cases and what actually happens
Free tier user tries to run the tool
Pro requiredFuzzy Dedup is gated at Pro tier — the processor throws Fuzzy Deduplicator requires Pro tier. for Free accounts before any rows are read. Free tier's Excel limits (5 MB / 10,000 rows / 1 file) are never reached because the tool won't run. Upgrade to Pro for 50 MB / 100,000 rows / 5 files.
Key column name typed wrong or doesn't exist
Empty matchesThe Key column is free text and must match a header exactly. If you type a name no header matches, every row reads an empty string for the key — all blanks score 100% against each other, so the tool collapses the entire file to a single row. Always copy the header verbatim and check the report's kept count looks sane before downloading.
Empty key column field
RejectedIf the Key column field is left blank the processor throws Key column is required for fuzzy deduplication. Nothing is processed. Enter the exact header to proceed.
Threshold below 50 or above 100
Clamped by controlThe threshold input is bounded min=50 max=100. Values outside that range aren't part of the design — 50% is the lowest similarity the UI offers (already aggressive), and 100% keeps only exact-after-normalize matches. There is no 0% setting that would collapse everything.
Matching is not transitive clustering
By designEach row is compared only to the representatives already kept (the first member of each cluster), not to every other row. So if A≈B and B≈C but A is not ≈C, the result depends on order: C is tested against A (the kept representative), not B. This greedy single-pass approach is fast and predictable but can split a chain differently than a full transitive cluster would.
Multi-sheet workbook
First sheet onlyThe reader (fileToRows) processes sheet index 0 only. Data on other tabs is ignored and absent from the output. Move the target data to the first sheet, or export the single sheet to CSV, before deduplicating.
Number/date key columns
Compared as textValues are read with formatting applied (raw: false), so a key column of numbers or dates is compared as its displayed string. 1000 vs 1,000 or 2026-01-01 vs 01/01/2026 may not score as you expect because formatting differences are real edits. Standardize dates first via excel-date-standardizer.
You wanted exact, byte-for-byte dedup
Wrong toolFuzzy Dedup never compares exact bytes — it lowercases, trims, and scores by edit distance. For strict exact deduplication on IDs, emails, or SKUs where any near-match is a false positive, use the exact csv-deduplicator (the excel-deduplicator entry redirects there).
You wanted to JOIN two files, not dedup one
Wrong toolThis tool deduplicates rows within one file on one column. To match approximate keys across two separate files and merge their columns, use excel-fuzzy-merger (Developer tier).
Output XLSX preserves only kept rows
By designdeduped-fuzzy.xlsx contains the kept rows on a single sheet named Deduped, with all original columns. Removed rows are not in the file — they live only in the on-screen/report listing (up to 200 retained, up to 50 in the text report). Save the report separately if you need an audit trail.
Frequently asked questions
What similarity algorithm is used?
Levenshtein edit distance, normalized by the length of the longer string: 100 × (1 − distance / max(len_a, len_b)), rounded to a whole percent. Before comparing, both values are lowercased and trimmed, so identical-after-normalize values score 100%.
Is the matching case-sensitive?
No. Both values are lowercased and trimmed before scoring, so Acme Corp, acme corp, and Acme Corp all score 100% against each other and collapse — even at a 100% threshold.
Does it keep the first or the last occurrence?
The first occurrence of each cluster is kept; every later near-duplicate is removed. The survivor is decided by file order, so sort your data if you want a specific record (e.g. the most complete) to win.
Can I preview matches before they're deleted?
Not as a separate confirm step. Deduplication happens when you process; the results panel then shows what was removed — a count plus up to 5 previewed rows as Row N "value" ≈ "matched" (score%), with up to 50 in the report text and 200 retained underneath. To change the outcome, adjust the threshold and re-run.
Can I deduplicate on more than one column?
No — the tool scores exactly one Key column. For a composite key (e.g. name + email), build a combined column first (for CSV, see csv-column-merger workflows) and point the Key column at it.
Why did Excel's Remove Duplicates leave these rows but this tool removed them?
Excel's Remove Duplicates only deletes byte-identical rows. Acme Corp and Acme Corporation, or a value with a trailing space, differ by edits or whitespace, so Excel keeps both. Fuzzy Dedup scores them by similarity and collapses the ones above your threshold.
What file types and sizes are supported?
.xlsx and .csv, first sheet only. The tool requires Pro tier: Pro allows 50 MB / 100,000 rows / 5 files, Pro-media 200 MB / 500,000 rows, and Developer 500 MB / unlimited rows. Free tier cannot run this tool.
Is it slow on large files?
It compares each row against the growing list of kept representatives, so cost grows with the number of distinct values, not the full n². On a heavily duplicated 50,000-row vendor list it stays fast; on a 100,000-row list of mostly-unique values it does more comparisons. It runs in your browser, so a busy main thread can make large files feel sluggish.
Does my data leave my computer?
No. Reading, scoring, and writing all happen in your browser via SheetJS. The deduplicated .xlsx is generated client-side and downloaded — nothing is uploaded.
What does the output file contain?
A single sheet named Deduped holding only the kept rows, with every original column preserved unchanged. Removed rows are not written to the file — they appear only in the report.
How do I undo a merge that was too aggressive?
There's no in-place undo; the output is a new file and your input is untouched. Raise the threshold (e.g. from 70% to 90%) and re-process the original to keep more distinct rows, then compare the two reports.
What if two genuinely different short names score above my threshold?
Short strings are sensitive — Jon vs Tom is two edits in three characters (~33%), but Jan vs Jon is one edit (~67%). For short personal names use a 90–95% threshold and review the report; if a real pair still collides, the only lever is the threshold (there are no per-pair exclusions).
Privacy first
Every JAD Excel tool runs entirely in your browser using SheetJS and ExcelJS. Your spreadsheets, formulas, and data never leave your device — verified by zero outbound network requests during processing.