How to fuzzy deduplication vs excel remove duplicates — key differences
- Step 1Start with exact dedup — Run Excel's Data → Remove Duplicates (or the exact csv-deduplicator) on the columns where identical values are true duplicates. This clears the cheap, unambiguous cases first.
- Step 2Decide which column needs fuzzy matching — Pick the single text column where near-duplicates hide — company name, vendor name, contact name, address. Fuzzy Dedup scores exactly one column, so choose deliberately.
- Step 3Open Fuzzy Dedup and set the Key column — Drop the exact-deduped file onto the tool and type that column's exact header into the Key column field (it's free text, not a dropdown).
- Step 4Choose a threshold and process — Default 85; use 90–95 for short/personal values, 65–80 for legal-suffix variants. Process — the tool keeps the first row of each near-duplicate cluster and removes the rest.
- Step 5Read the report and validate — Check the
{removedCount} removed · {keptCount} keptsummary and the previewedRow N "value" ≈ "matched" (score%)lines for false merges. Adjust the threshold and re-run if needed. - Step 6Download the combined result — Download
deduped-fuzzy.xlsx— exact-identical rows already gone from step 1, near-duplicates collapsed in this pass, all columns preserved.
Excel Remove Duplicates vs Fuzzy Dedup, feature by feature
Exact dedup compares many columns byte-for-byte; Fuzzy Dedup compares one column by normalized Levenshtein similarity after lowercase+trim.
| Feature | Excel Remove Duplicates | Fuzzy Dedup (this tool) |
|---|---|---|
| Match type | Exact, byte-identical | Approximate (Levenshtein similarity %) |
| Case sensitivity | Excel: case-insensitive for text by default | Always case-insensitive (lowercased first) |
| Whitespace | Significant (trailing space = different) | Trimmed before compare (ignored) |
| Columns compared | Any number you tick | Exactly one (the Key column) |
| Tuning | None — exact or not | Threshold 50–100 (default 85) |
| Survivor | First occurrence | First occurrence of each cluster |
| Order dependence | None for exact match | Yes — greedy single-pass, not transitive |
| Output | In-place in the workbook | New .xlsx (sheet Deduped) + removal report |
| Where it runs | Excel desktop app | Browser (SheetJS), Pro tier |
Which tool for which data
Pick exact when any difference matters; pick fuzzy when typos and variants are the duplicates.
| Data | Use | Reason |
|---|---|---|
| Product SKUs / part numbers | Exact (Remove Duplicates / csv-deduplicator) | Codes are structured — a near-match is a false positive |
| Email addresses | Exact | a@x.com vs a@x.con are different addresses, not duplicates |
| Record IDs / order numbers | Exact | IDs must match precisely |
| Company / vendor names | Fuzzy (low threshold) | Legal suffixes and abbreviations create real near-duplicates |
| Contact / person names | Fuzzy (high threshold) | Nicknames and typos; keep the bar high to protect real people |
| Free-text addresses | Fuzzy (mid threshold) | Abbreviations and reordering create moderate distance |
Cost and capacity
Exact dedup is a hash pass; fuzzy compares against accumulated representatives. Fuzzy Dedup is Pro-gated.
| Aspect | Exact (Remove Duplicates) | Fuzzy Dedup |
|---|---|---|
| Algorithmic cost | ≈ linear (hashing) | Each row vs. kept representatives × edit distance |
| Tier | Built into Excel / Free csv tool | Pro tier minimum (Free blocked) |
| Capacity (Pro) | n/a | 50 MB · 100,000 rows · 5 files |
| Capacity (Developer) | n/a | 500 MB · unlimited rows |
| Best run order | First (cheap) | Second (on the survivors) |
Cookbook
Side-by-side outcomes on the same rows, showing what each approach keeps. Fuzzy report row numbers are 1-based including the header row.
The case Remove Duplicates misses entirely
Netflix, Inc and Netflix Inc differ by one comma — byte-different, so Excel keeps both. Fuzzy scores them ~92% and collapses them at the default threshold.
Input (column: company) company Netflix, Inc Netflix Inc Excel Remove Duplicates -> keeps BOTH (not identical) Fuzzy Dedup, threshold 85 Report: Row 3 "Netflix Inc" ≈ "Netflix, Inc" (92%) removed Output: one row, "Netflix, Inc"
The case fuzzy can over-merge
Exact dedup never confuses two real values; fuzzy can. 12345 vs 12346 (two SKUs) score ~80% — a 75% threshold would wrongly merge them. This is why codes belong to exact dedup.
Input (column: sku) sku 12345 12346 Excel Remove Duplicates -> keeps BOTH (correct) Fuzzy Dedup, threshold 75 Report: Row 3 "12346" ≈ "12345" (80%) removed [WRONG] Lesson: never fuzzy-dedup structured codes.
The recommended two-pass workflow
Exact first clears the easy identical rows; fuzzy then works on a smaller set with fewer comparisons and a cleaner report. The combined result is more accurate than either pass alone.
Raw: 10,000 vendor rows Pass 1 — exact dedup (csv-deduplicator on vendor_name) -> 8,800 rows (1,200 byte-identical removed) Pass 2 — Fuzzy Dedup on vendor_name, threshold 80 -> 8,310 rows (490 near-duplicates removed) Report previews: Row .. "Acme Corp" ≈ "Acme Corporation" (67%)? no — at 80% Row .. "Acme Corp." ≈ "Acme Corp" (90%) removed
Whitespace and case: a tie that fuzzy wins for free
Excel treats a trailing space as a difference; fuzzy trims and lowercases first, so it collapses what Excel keeps — without any extra option.
Input (column: name) name Acme Acme acme Excel Remove Duplicates with exact match -> keeps all 3 (trailing space and case make them byte-different) Fuzzy Dedup, threshold 100 Report: 2 removed · 1 kept (both score 100% vs "Acme") Output: "Acme"
Multi-column exact key beats single-column fuzzy
When the true duplicate requires several columns to match exactly (e.g. first+last+DOB), exact dedup over all three is correct. Fuzzy only scores one column, so it can't express that compound rule.
Goal: dedup where First, Last, AND DOB all match exactly. Excel Remove Duplicates: tick First, Last, DOB -> exact compound Fuzzy Dedup: scores ONE column only — can't AND three columns. Workaround: concatenate First|Last|DOB into a key column, then fuzzy at threshold 100 to mimic exact compound (but you've lost the speed of exact dedup).
Edge cases and what actually happens
Using fuzzy on structured codes
False merge riskSKUs, order numbers, and IDs differ by single characters that are meaningful. Fuzzy scoring sees 12345 and 12346 as ~80% similar and may merge them. Keep codes on exact dedup — Excel Remove Duplicates or the csv-deduplicator.
Expecting fuzzy to compare multiple columns
Single column onlyExcel Remove Duplicates can tick many columns; Fuzzy Dedup scores exactly one Key column. For a compound rule (e.g. first AND last AND DOB), exact dedup is the right tool, or concatenate the columns into a single key first.
Fuzzy result depends on row order
Order-dependentFuzzy Dedup is a greedy single pass — each row is compared to the representatives already kept, so reordering rows can change which cluster a borderline value joins. Exact dedup has no such order sensitivity. Sort intentionally before fuzzy-deduping.
Fuzzy is Pro-gated; exact is free
Pro requiredExcel Remove Duplicates is built into Excel, and the exact csv-deduplicator is available on lower tiers. Fuzzy Dedup throws Fuzzy Deduplicator requires Pro tier. for Free users. Budget for Pro if your workflow needs approximate matching.
Threshold too low merges distinct values
False mergeBelow ~75% the false-positive rate climbs fast, especially on short strings. There is no per-pair exclusion list — the only control is the single threshold. Start high (90–95%) and lower carefully while reading the report.
Threshold too high misses real duplicates
Missed duplicatesAt 95–100% only near-identical values collapse. Legal-suffix variants (Acme vs Acme Corporation, ~67%) survive. If exact-after-normalize isn't enough, you must lower the threshold — there's no synonym/abbreviation dictionary.
Multi-sheet workbook
First sheet onlyExcel Remove Duplicates operates on the active selection; Fuzzy Dedup reads only the first sheet of the uploaded file. Put the data you want deduped on the first tab before uploading.
Combining exact + fuzzy in the wrong order
InefficientRunning fuzzy first on the full dataset does more comparisons than necessary and clutters the report with trivially-identical pairs. Run exact dedup first to shrink the data, then fuzzy on the survivors.
Output is a new file, not in-place
By designExcel Remove Duplicates edits the sheet in place; Fuzzy Dedup produces a separate deduped-fuzzy.xlsx (sheet Deduped) and leaves your input untouched. That makes re-running with a different threshold safe — your original is preserved.
You actually need to merge two files
Wrong toolNeither Remove Duplicates nor Fuzzy Dedup joins separate files. To approximately match keys across two datasets and combine their columns, use excel-fuzzy-merger (Developer tier).
Frequently asked questions
Is fuzzy dedup slower than exact dedup?
Yes. Exact dedup is essentially a single hashing pass. Fuzzy compares each row against the accumulated cluster representatives and computes Levenshtein distance, so cost grows with the number of distinct values. It runs in the browser; on a heavily-duplicated list it stays fast, on a large mostly-unique list it does more work.
Which is better for product SKUs?
Exact dedup. SKUs are structured codes where 12345 and 12346 are different products — a near-match is a false positive. Use Excel Remove Duplicates or the exact csv-deduplicator for codes; reserve fuzzy for names and free text.
Can Excel Remove Duplicates catch 'Acme Corp' vs 'Acme Corporation'?
No. Remove Duplicates only deletes byte-identical values, so it keeps both. Fuzzy Dedup scores them by similarity (~67%) and collapses them if your threshold is low enough. That gap is the whole reason to use fuzzy dedup.
Does Excel Remove Duplicates ignore case like Fuzzy Dedup?
Excel's Remove Duplicates treats text as case-insensitive by default, but it is still whitespace-sensitive — a trailing space makes values different. Fuzzy Dedup is both case-insensitive AND whitespace-insensitive (it lowercases and trims before scoring).
Should I run both, and in what order?
Yes — run exact dedup first (Remove Duplicates or csv-deduplicator) to clear identical rows cheaply, then Fuzzy Dedup on the survivors. This minimizes fuzzy comparisons and keeps the removal report focused on genuine near-matches.
Can Fuzzy Dedup compare multiple columns at once?
No — it scores exactly one Key column. To mimic a multi-column exact key, concatenate the columns into one (e.g. First|Last|DOB) and set the threshold to 100. For true compound exact matching, Excel Remove Duplicates ticking several columns is simpler.
Why did fuzzy give different results when I reordered my rows?
Fuzzy Dedup is a greedy single pass — each row is matched against the representatives already kept, not against every other row, so order affects which cluster a borderline value joins. Exact dedup has no order sensitivity. Sort deliberately before running fuzzy.
Does either tool merge field values from the duplicates?
No. Both keep the first occurrence's row and discard the rest. Unique data on a removed row (a phone the survivor lacks) is not merged in. Fuzzy Dedup lists removals in its report so you can reconcile manually.
What tiers do I need?
Excel Remove Duplicates is part of Excel; the exact csv-deduplicator runs on lower tiers. Fuzzy Dedup requires Pro tier (50 MB / 100,000 rows / 5 files), with more on Pro-media and Developer. Free tier cannot run Fuzzy Dedup.
Is there an undo for fuzzy dedup?
Fuzzy Dedup writes a new deduped-fuzzy.xlsx and never alters your input, so re-running with a higher threshold is the undo. Excel Remove Duplicates edits in place — use Ctrl+Z or keep a copy of the workbook before running it.
Can I set different thresholds per row or per pair?
No. There is a single global threshold (50–100) and no per-pair override or exclusion list. If a specific false merge keeps happening, the only levers are raising the threshold or pre-processing the column so the two values are more distinct.
Where does the data go?
Nowhere external. Excel Remove Duplicates runs in your desktop app; Fuzzy Dedup runs in your browser via SheetJS and downloads the result locally. Neither uploads your spreadsheet.
Privacy first
Every JAD Excel tool runs entirely in your browser using SheetJS and ExcelJS. Your spreadsheets, formulas, and data never leave your device — verified by zero outbound network requests during processing.