Remove Near-Duplicate Rows from Excel with Fuzzy Matching

How to find and delete near-duplicate excel rows using levenshtein similarity

Step 1
Open the Fuzzy Deduplicator and drop your file — Drag an .xlsx or .csv onto the tool above. It reads the first sheet only and uses the top row as headers. Values are read as formatted text (numbers and dates become their displayed string), and blank cells become empty strings.
Step 2
Type the key column name into the Key column field — The Key column control is a free-text input, not a dropdown — type the exact header of the column to compare on, e.g. company_name. Spelling and case must match the header in your file. Only that one column is scored; all other columns ride along untouched.
Step 3
Set the similarity threshold — Enter a number from 50 to 100 (default 85). This is the minimum similarity at which two values are treated as the same. The helper text under the field reminds you: 85% removes near-duplicates like Acme Corp / Acme Corporation; 95% only removes very close matches.
Step 4
Process the file — The tool walks rows top to bottom. Each value is compared against the values already kept (the representative of each cluster); the first kept value scoring at or above your threshold wins, and the current row is removed. There is no separate preview-then-confirm step — deduplication happens when you process.
Step 5
Read the removed-rows report — The results panel shows how many rows were removed and kept, and previews up to 5 removed rows in the form Row N "value" ≈ "matchedValue" (score%). The downloadable report text lists up to 50, and the underlying findings hold up to 200, so you can audit what collapsed before trusting the output.
Step 6
Download the deduplicated .xlsx — Download deduped-fuzzy.xlsx — a single sheet named Deduped containing only the kept rows, with all original columns intact. If a cluster collapsed wrongly, raise the threshold and re-run; if it missed duplicates, lower it.

Exact dedup vs. fuzzy dedup on the same column

Why Excel Remove Duplicates leaves rows that Fuzzy Dedup catches. Scores are normalized Levenshtein similarity (case-/whitespace-insensitive) for the example pairs.

Value pair	Excel Remove Duplicates	Fuzzy similarity	Removed at 85%?
`Acme Corp` / `Acme Corp`	Kept both (trailing space differs)	100% (trimmed before compare)	Yes
`Acme Corp` / `acme corp`	Kept both (case differs)	100% (lowercased before compare)	Yes
`Acme Corp` / `Acme Corporation`	Kept both	~67% (`oration` added to 15-char string)	No — needs ~65% threshold
`Netflix Inc` / `Netflix, Inc`	Kept both	~92% (one inserted comma)	Yes
`Microsoft` / `Microsft`	Kept both	~89% (one deleted letter)	Yes
`Jackson` / `Jason`	Kept both	~71% (two edits in 7 chars)	No (kept distinct)

Threshold guidance by data type

The threshold (50–100, default 85) is the only similarity knob. Lower = more aggressive (more false positives); higher = stricter (more missed duplicates). Pick by re-running and reading the report.

Data type	Suggested threshold	Why
Company / vendor names with legal suffixes	65–80	`Acme` vs `Acme Corporation` needs a low bar — the suffix is a large fraction of the string length
Short tidy names already trimmed/cased	90–95	Most variation is single-character typos; a high bar avoids merging genuinely different short names
Personal names	90–95	`Jackson`/`Jason` and `Jon`/`Tom` are close in edit distance — keep the bar high to avoid false merges
Free-text addresses	75–85	Abbreviations (`St`/`Street`) and reordering create moderate distance
Structured codes / SKUs / IDs	100 (or use exact dedup)	Near-match on codes creates false positives — prefer the exact csv-deduplicator instead

What the tool reads, writes, and limits

Ground-truth behavior of the Fuzzy Deduplicator. Free tier cannot run this tool — it is Pro-gated.

Aspect	Behavior
Input formats	`.xlsx` or `.csv` — first sheet only, first row as headers
Options	`keyColumn` (free-text, required) and `threshold` (number 50–100, default 85) — nothing else
Algorithm	Levenshtein edit distance ÷ length of longer string, ×100, rounded; both values lowercased + trimmed first
Survivor rule	First occurrence of each cluster is kept; later near-duplicates removed
Output	Binary `.xlsx` (`deduped-fuzzy.xlsx`), one sheet `Deduped`, all columns preserved
Tier gate	Pro tier minimum (Free is blocked at processing)
Pro limits	50 MB file · 100,000 rows · 5 files
Pro-media / Developer	200 MB / 500,000 rows · or 500 MB / unlimited rows

Cookbook

Real near-duplicate patterns, the threshold that catches them, and what the report shows. Row numbers in the report are 1-based and include the header row, so the first data row is Row 2.

Trailing space and case differences (caught at any threshold)

Because the tool lowercases and trims both values before scoring, casing and surrounding whitespace are invisible to it — these always score 100% and collapse even at threshold 100. This is the class of duplicate Excel Remove Duplicates silently keeps.

Input (column: company_name)
company_name
Acme Corp
acme corp
Acme Corp 

threshold: 85 (or even 100)

Report
2 near-duplicate row(s) removed · 1 rows kept.
Row 3 "acme corp" ≈ "Acme Corp" (100%)
Row 4 "Acme Corp " ≈ "Acme Corp" (100%)

Output deduped-fuzzy.xlsx (sheet: Deduped)
company_name
Acme Corp

Legal-suffix variants need a lower threshold

Acme Corporation differs from Acme Corp by adding oration (7 inserted characters) to a 15-character string, scoring roughly 67% — below the default. Lower the threshold to about 65% to collapse suffix variants, then sanity-check the report before downloading.

Input (column: company_name)
company_name
Acme Corp
Acme Corporation
Acme Inc

threshold: 85  ->  0 removed (all kept; ~67% < 85)
threshold: 65  ->
Report
2 near-duplicate row(s) removed · 1 rows kept.
Row 3 "Acme Corporation" ≈ "Acme Corp" (67%)
Row 4 "Acme Inc" ≈ "Acme Corp" (75%)

Warning: at 65% "Acme Inc" also collapses into "Acme Corp" —
verify the report; these may be different legal entities.

Single-character typos in a clean column

For a column that is already trimmed and consistently cased, most remaining duplicates are one-character typos. An 88–90% threshold catches them without merging genuinely different short names.

Input (column: city)
city
London
Lodnon
Liverpool

threshold: 88

Report
1 near-duplicate row(s) removed · 2 rows kept.
Row 3 "Lodnon" ≈ "London" (83%)   <- below 88, NOT removed

At 88% "Lodnon" (83%) survives. Lower to 80% to remove it:
threshold: 80 -> Row 3 "Lodnon" ≈ "London" (83%) removed.

First-occurrence-wins decides which row survives

The survivor is always the first row of the cluster in file order. If you want a specific record (most complete, most recent) to win, sort the column so that record appears first before processing.

Input (column: name)  -- note the order
name,notes
J. Smith,sparse record
John Smith,full record

threshold: 70
Report: Row 3 "John Smith" ≈ "J. Smith" (?)  removed
Output keeps "J. Smith" (the sparse record) — first wins.

Fix: sort so the full record is first, then re-run, OR
use exact dedup downstream and keep both for manual merge.

Empty key cells are a special case

A blank key value scores 100% against another blank (both empty → equal) but 0% against any non-empty value. So multiple blank-key rows collapse to one, while a blank never merges into a real name. Remove empty rows first if blanks are noise.

Input (column: company_name)
company_name,id
,1
,2
Acme,3

threshold: 85

Report
1 near-duplicate row(s) removed · 2 rows kept.
Row 3 "" ≈ "" (100%)   <- second blank collapses into first

Output keeps one blank row + Acme. To drop blank-key rows
entirely, run an empty-row pass first via
/tool/csv-empty-row-remover.

Edge cases and what actually happens

Free tier user tries to run the tool

Pro required

Fuzzy Dedup is gated at Pro tier — the processor throws Fuzzy Deduplicator requires Pro tier. for Free accounts before any rows are read. Free tier's Excel limits (5 MB / 10,000 rows / 1 file) are never reached because the tool won't run. Upgrade to Pro for 50 MB / 100,000 rows / 5 files.

Key column name typed wrong or doesn't exist

Empty matches

The Key column is free text and must match a header exactly. If you type a name no header matches, every row reads an empty string for the key — all blanks score 100% against each other, so the tool collapses the entire file to a single row. Always copy the header verbatim and check the report's kept count looks sane before downloading.

Empty key column field

Rejected

If the Key column field is left blank the processor throws Key column is required for fuzzy deduplication. Nothing is processed. Enter the exact header to proceed.

Threshold below 50 or above 100

Clamped by control

The threshold input is bounded min=50 max=100. Values outside that range aren't part of the design — 50% is the lowest similarity the UI offers (already aggressive), and 100% keeps only exact-after-normalize matches. There is no 0% setting that would collapse everything.

Matching is not transitive clustering

By design

Each row is compared only to the representatives already kept (the first member of each cluster), not to every other row. So if A≈B and B≈C but A is not ≈C, the result depends on order: C is tested against A (the kept representative), not B. This greedy single-pass approach is fast and predictable but can split a chain differently than a full transitive cluster would.

Multi-sheet workbook

First sheet only

The reader (fileToRows) processes sheet index 0 only. Data on other tabs is ignored and absent from the output. Move the target data to the first sheet, or export the single sheet to CSV, before deduplicating.

Number/date key columns

Compared as text

Values are read with formatting applied (raw: false), so a key column of numbers or dates is compared as its displayed string. 1000 vs 1,000 or 2026-01-01 vs 01/01/2026 may not score as you expect because formatting differences are real edits. Standardize dates first via excel-date-standardizer.

You wanted exact, byte-for-byte dedup

Wrong tool

Fuzzy Dedup never compares exact bytes — it lowercases, trims, and scores by edit distance. For strict exact deduplication on IDs, emails, or SKUs where any near-match is a false positive, use the exact csv-deduplicator (the excel-deduplicator entry redirects there).

You wanted to JOIN two files, not dedup one

Wrong tool

This tool deduplicates rows within one file on one column. To match approximate keys across two separate files and merge their columns, use excel-fuzzy-merger (Developer tier).

Output XLSX preserves only kept rows

By design

deduped-fuzzy.xlsx contains the kept rows on a single sheet named Deduped, with all original columns. Removed rows are not in the file — they live only in the on-screen/report listing (up to 200 retained, up to 50 in the text report). Save the report separately if you need an audit trail.

Frequently asked questions

What similarity algorithm is used?

Levenshtein edit distance, normalized by the length of the longer string: 100 × (1 − distance / max(len_a, len_b)), rounded to a whole percent. Before comparing, both values are lowercased and trimmed, so identical-after-normalize values score 100%.

Is the matching case-sensitive?

No. Both values are lowercased and trimmed before scoring, so Acme Corp, acme corp, and Acme Corp all score 100% against each other and collapse — even at a 100% threshold.

Does it keep the first or the last occurrence?

The first occurrence of each cluster is kept; every later near-duplicate is removed. The survivor is decided by file order, so sort your data if you want a specific record (e.g. the most complete) to win.

Can I preview matches before they're deleted?

Not as a separate confirm step. Deduplication happens when you process; the results panel then shows what was removed — a count plus up to 5 previewed rows as Row N "value" ≈ "matched" (score%), with up to 50 in the report text and 200 retained underneath. To change the outcome, adjust the threshold and re-run.

Can I deduplicate on more than one column?

No — the tool scores exactly one Key column. For a composite key (e.g. name + email), build a combined column first (for CSV, see csv-column-merger workflows) and point the Key column at it.

Why did Excel's Remove Duplicates leave these rows but this tool removed them?

Excel's Remove Duplicates only deletes byte-identical rows. Acme Corp and Acme Corporation, or a value with a trailing space, differ by edits or whitespace, so Excel keeps both. Fuzzy Dedup scores them by similarity and collapses the ones above your threshold.

What file types and sizes are supported?

.xlsx and .csv, first sheet only. The tool requires Pro tier: Pro allows 50 MB / 100,000 rows / 5 files, Pro-media 200 MB / 500,000 rows, and Developer 500 MB / unlimited rows. Free tier cannot run this tool.

Is it slow on large files?

It compares each row against the growing list of kept representatives, so cost grows with the number of distinct values, not the full n². On a heavily duplicated 50,000-row vendor list it stays fast; on a 100,000-row list of mostly-unique values it does more comparisons. It runs in your browser, so a busy main thread can make large files feel sluggish.

Does my data leave my computer?

No. Reading, scoring, and writing all happen in your browser via SheetJS. The deduplicated .xlsx is generated client-side and downloaded — nothing is uploaded.

What does the output file contain?

A single sheet named Deduped holding only the kept rows, with every original column preserved unchanged. Removed rows are not written to the file — they appear only in the report.

How do I undo a merge that was too aggressive?

There's no in-place undo; the output is a new file and your input is untouched. Raise the threshold (e.g. from 70% to 90%) and re-process the original to keep more distinct rows, then compare the two reports.

What if two genuinely different short names score above my threshold?

Short strings are sensitive — Jon vs Tom is two edits in three characters (~33%), but Jan vs Jon is one edit (~67%). For short personal names use a 90–95% threshold and review the report; if a real pair still collides, the only lever is the threshold (there are no per-pair exclusions).

Privacy first

Every JAD Excel tool runs entirely in your browser using SheetJS and ExcelJS. Your spreadsheets, formulas, and data never leave your device — verified by zero outbound network requests during processing.

How to find and delete near-duplicate excel rows using levenshtein similarity

Step 1
Open the Fuzzy Deduplicator and drop your file — Drag an .xlsx or .csv onto the tool above. It reads the first sheet only and uses the top row as headers. Values are read as formatted text (numbers and dates become their displayed string), and blank cells become empty strings.
Step 2
Type the key column name into the Key column field — The Key column control is a free-text input, not a dropdown — type the exact header of the column to compare on, e.g. company_name. Spelling and case must match the header in your file. Only that one column is scored; all other columns ride along untouched.
Step 3
Set the similarity threshold — Enter a number from 50 to 100 (default 85). This is the minimum similarity at which two values are treated as the same. The helper text under the field reminds you: 85% removes near-duplicates like Acme Corp / Acme Corporation; 95% only removes very close matches.
Step 4
Process the file — The tool walks rows top to bottom. Each value is compared against the values already kept (the representative of each cluster); the first kept value scoring at or above your threshold wins, and the current row is removed. There is no separate preview-then-confirm step — deduplication happens when you process.
Step 5
Read the removed-rows report — The results panel shows how many rows were removed and kept, and previews up to 5 removed rows in the form Row N "value" ≈ "matchedValue" (score%). The downloadable report text lists up to 50, and the underlying findings hold up to 200, so you can audit what collapsed before trusting the output.
Step 6
Download the deduplicated .xlsx — Download deduped-fuzzy.xlsx — a single sheet named Deduped containing only the kept rows, with all original columns intact. If a cluster collapsed wrongly, raise the threshold and re-run; if it missed duplicates, lower it.

Exact dedup vs. fuzzy dedup on the same column

Why Excel Remove Duplicates leaves rows that Fuzzy Dedup catches. Scores are normalized Levenshtein similarity (case-/whitespace-insensitive) for the example pairs.

Value pair	Excel Remove Duplicates	Fuzzy similarity	Removed at 85%?
`Acme Corp` / `Acme Corp`	Kept both (trailing space differs)	100% (trimmed before compare)	Yes
`Acme Corp` / `acme corp`	Kept both (case differs)	100% (lowercased before compare)	Yes
`Acme Corp` / `Acme Corporation`	Kept both	~67% (`oration` added to 15-char string)	No — needs ~65% threshold
`Netflix Inc` / `Netflix, Inc`	Kept both	~92% (one inserted comma)	Yes
`Microsoft` / `Microsft`	Kept both	~89% (one deleted letter)	Yes
`Jackson` / `Jason`	Kept both	~71% (two edits in 7 chars)	No (kept distinct)

Threshold guidance by data type

Data type	Suggested threshold	Why
Company / vendor names with legal suffixes	65–80	`Acme` vs `Acme Corporation` needs a low bar — the suffix is a large fraction of the string length
Short tidy names already trimmed/cased	90–95	Most variation is single-character typos; a high bar avoids merging genuinely different short names
Personal names	90–95	`Jackson`/`Jason` and `Jon`/`Tom` are close in edit distance — keep the bar high to avoid false merges
Free-text addresses	75–85	Abbreviations (`St`/`Street`) and reordering create moderate distance
Structured codes / SKUs / IDs	100 (or use exact dedup)	Near-match on codes creates false positives — prefer the exact csv-deduplicator instead

What the tool reads, writes, and limits

Ground-truth behavior of the Fuzzy Deduplicator. Free tier cannot run this tool — it is Pro-gated.

Aspect	Behavior
Input formats	`.xlsx` or `.csv` — first sheet only, first row as headers
Options	`keyColumn` (free-text, required) and `threshold` (number 50–100, default 85) — nothing else
Algorithm	Levenshtein edit distance ÷ length of longer string, ×100, rounded; both values lowercased + trimmed first
Survivor rule	First occurrence of each cluster is kept; later near-duplicates removed
Output	Binary `.xlsx` (`deduped-fuzzy.xlsx`), one sheet `Deduped`, all columns preserved
Tier gate	Pro tier minimum (Free is blocked at processing)
Pro limits	50 MB file · 100,000 rows · 5 files
Pro-media / Developer	200 MB / 500,000 rows · or 500 MB / unlimited rows

Cookbook

Real near-duplicate patterns, the threshold that catches them, and what the report shows. Row numbers in the report are 1-based and include the header row, so the first data row is Row 2.

Trailing space and case differences (caught at any threshold)

Input (column: company_name)
company_name
Acme Corp
acme corp
Acme Corp 

threshold: 85 (or even 100)

Report
2 near-duplicate row(s) removed · 1 rows kept.
Row 3 "acme corp" ≈ "Acme Corp" (100%)
Row 4 "Acme Corp " ≈ "Acme Corp" (100%)

Output deduped-fuzzy.xlsx (sheet: Deduped)
company_name
Acme Corp

Legal-suffix variants need a lower threshold

Input (column: company_name)
company_name
Acme Corp
Acme Corporation
Acme Inc

threshold: 85  ->  0 removed (all kept; ~67% < 85)
threshold: 65  ->
Report
2 near-duplicate row(s) removed · 1 rows kept.
Row 3 "Acme Corporation" ≈ "Acme Corp" (67%)
Row 4 "Acme Inc" ≈ "Acme Corp" (75%)

Warning: at 65% "Acme Inc" also collapses into "Acme Corp" —
verify the report; these may be different legal entities.

Single-character typos in a clean column

For a column that is already trimmed and consistently cased, most remaining duplicates are one-character typos. An 88–90% threshold catches them without merging genuinely different short names.

Input (column: city)
city
London
Lodnon
Liverpool

threshold: 88

Report
1 near-duplicate row(s) removed · 2 rows kept.
Row 3 "Lodnon" ≈ "London" (83%)   <- below 88, NOT removed

At 88% "Lodnon" (83%) survives. Lower to 80% to remove it:
threshold: 80 -> Row 3 "Lodnon" ≈ "London" (83%) removed.

First-occurrence-wins decides which row survives

The survivor is always the first row of the cluster in file order. If you want a specific record (most complete, most recent) to win, sort the column so that record appears first before processing.

Input (column: name)  -- note the order
name,notes
J. Smith,sparse record
John Smith,full record

threshold: 70
Report: Row 3 "John Smith" ≈ "J. Smith" (?)  removed
Output keeps "J. Smith" (the sparse record) — first wins.

Fix: sort so the full record is first, then re-run, OR
use exact dedup downstream and keep both for manual merge.

Empty key cells are a special case

Input (column: company_name)
company_name,id
,1
,2
Acme,3

threshold: 85

Report
1 near-duplicate row(s) removed · 2 rows kept.
Row 3 "" ≈ "" (100%)   <- second blank collapses into first

Output keeps one blank row + Acme. To drop blank-key rows
entirely, run an empty-row pass first via
/tool/csv-empty-row-remover.

Edge cases and what actually happens

Free tier user tries to run the tool

Pro required

Key column name typed wrong or doesn't exist

Empty matches

Empty key column field

Rejected

If the Key column field is left blank the processor throws Key column is required for fuzzy deduplication. Nothing is processed. Enter the exact header to proceed.

Threshold below 50 or above 100

Clamped by control

Matching is not transitive clustering

By design

Multi-sheet workbook

First sheet only

Number/date key columns

Compared as text

You wanted exact, byte-for-byte dedup

Wrong tool

You wanted to JOIN two files, not dedup one

Wrong tool

This tool deduplicates rows within one file on one column. To match approximate keys across two separate files and merge their columns, use excel-fuzzy-merger (Developer tier).

Output XLSX preserves only kept rows

By design

Frequently asked questions

What similarity algorithm is used?

Is the matching case-sensitive?

No. Both values are lowercased and trimmed before scoring, so Acme Corp, acme corp, and Acme Corp all score 100% against each other and collapse — even at a 100% threshold.

Does it keep the first or the last occurrence?

Can I preview matches before they're deleted?

Can I deduplicate on more than one column?

No — the tool scores exactly one Key column. For a composite key (e.g. name + email), build a combined column first (for CSV, see csv-column-merger workflows) and point the Key column at it.

Why did Excel's Remove Duplicates leave these rows but this tool removed them?

What file types and sizes are supported?

Is it slow on large files?

Does my data leave my computer?

No. Reading, scoring, and writing all happen in your browser via SheetJS. The deduplicated .xlsx is generated client-side and downloaded — nothing is uploaded.

What does the output file contain?

A single sheet named Deduped holding only the kept rows, with every original column preserved unchanged. Removed rows are not written to the file — they appear only in the report.

Find and Delete Near-Duplicate Excel Rows Using Levenshtein Similarity

How to find and delete near-duplicate excel rows using levenshtein similarity

Exact dedup vs. fuzzy dedup on the same column

Threshold guidance by data type

What the tool reads, writes, and limits

Cookbook

Trailing space and case differences (caught at any threshold)

Legal-suffix variants need a lower threshold

Single-character typos in a clean column

First-occurrence-wins decides which row survives

Empty key cells are a special case

Edge cases and what actually happens

Free tier user tries to run the tool

Key column name typed wrong or doesn't exist

Empty key column field

Threshold below 50 or above 100

Matching is not transitive clustering

Multi-sheet workbook

Number/date key columns

You wanted exact, byte-for-byte dedup

You wanted to JOIN two files, not dedup one

Output XLSX preserves only kept rows

Frequently asked questions

What similarity algorithm is used?

Is the matching case-sensitive?

Does it keep the first or the last occurrence?

Can I preview matches before they're deleted?

Can I deduplicate on more than one column?

Why did Excel's Remove Duplicates leave these rows but this tool removed them?

What file types and sizes are supported?

Is it slow on large files?

Does my data leave my computer?

What does the output file contain?

How do I undo a merge that was too aggressive?

What if two genuinely different short names score above my threshold?

Privacy first

Related guides

Find and Delete Near-Duplicate Excel Rows Using Levenshtein Similarity

How to find and delete near-duplicate excel rows using levenshtein similarity

Exact dedup vs. fuzzy dedup on the same column

Threshold guidance by data type

What the tool reads, writes, and limits

Cookbook

Trailing space and case differences (caught at any threshold)

Legal-suffix variants need a lower threshold

Single-character typos in a clean column

First-occurrence-wins decides which row survives

Empty key cells are a special case

Edge cases and what actually happens

Free tier user tries to run the tool

Key column name typed wrong or doesn't exist

Empty key column field

Threshold below 50 or above 100

Matching is not transitive clustering

Multi-sheet workbook

Number/date key columns

You wanted exact, byte-for-byte dedup

You wanted to JOIN two files, not dedup one

Output XLSX preserves only kept rows

Frequently asked questions

What similarity algorithm is used?

Is the matching case-sensitive?

Does it keep the first or the last occurrence?

Can I preview matches before they're deleted?

Can I deduplicate on more than one column?

Why did Excel's Remove Duplicates leave these rows but this tool removed them?

What file types and sizes are supported?

Is it slow on large files?

Does my data leave my computer?

What does the output file contain?

How do I undo a merge that was too aggressive?

What if two genuinely different short names score above my threshold?

Privacy first

Related guides