How to deduplicate supplier names in excel before procurement analysis
- Step 1Export the vendor master — Pull the supplier/vendor list from your ERP (SAP, Oracle, Coupa, Ariba) as
.xlsxor.csv, ensuring the vendor-name column is present. The tool reads the first sheet only. - Step 2Sort so your canonical name is first — Because the first row of each cluster survives, sort the file so the spelling you want as the master record sits at the top of each group (e.g. the full legal name, or the one already in your ERP standard).
- Step 3Type the vendor column into the Key column field — The Key column is free text — type the exact header, e.g.
supplier_nameorvendor_name. Only that column is scored; spend amounts, GL codes, and contract IDs ride along untouched. - Step 4Set a lower threshold for legal-suffix variants — Default is 85, but for vendor names try 70–80 to catch
Acme CorpvsAcme Corporation. Enter a value from 50 to 100 and expect to re-run to find the sweet spot for your data. - Step 5Process and audit the consolidation report — The panel shows
{removedCount} removed · {keptCount} keptand previews up to 5 consolidations (up to 50 in the downloadable report). Check for over-merges likeAcme Corpcolliding withAcme Incat a low threshold. - Step 6Download and apply to spend data — Download
deduped-fuzzy.xlsx(sheetDeduped, master rows with all columns). Use the consolidated list to normalize vendor names in your spend cube. If real distinct vendors merged, raise the threshold and re-run.
Vendor-name variants and how they score
Normalized Levenshtein similarity (case-/whitespace-insensitive). Legal-suffix variants score lower than you'd expect because the suffix is a big fraction of the string.
| Variant pair | Approx. similarity | Removed at 80% / 90%? |
|---|---|---|
Microsoft Corp / microsoft corp | 100% | Yes / Yes |
Microsoft Corp / Microsoft, Inc. | ~64% | No / No |
Acme Corp / Acme Corp. | ~90% | Yes / Yes |
Acme Corp / Acme Corporation | ~67% | No / No |
IBM Corp / International Business Machines | ~10% | No / No (need a synonym table) |
Acme Corp / Acme Inc | ~75% | No / No — but YES at 70% (over-merge risk) |
Threshold strategy for procurement names
Lower thresholds catch suffix variants but risk merging distinct entities that share a stem. The threshold is the only knob (50–100, default 85).
| Goal | Threshold | Risk to watch |
|---|---|---|
Collapse Corp./Corp/casing only | 88–95 | Misses Corporation spelled out |
| Catch full legal-suffix variants | 65–75 | May merge Acme Corp with Acme Inc (different entities) |
| Balanced for mixed data | 78–82 | Review the report for stem collisions |
| Codes (DUNS, tax ID) | Use exact dedup | Near-match on identifiers is always wrong |
Capacity for vendor masters
Fuzzy Dedup is Pro-gated. Large vendor masters need Pro or above.
| Tier | File / rows / files | Note |
|---|---|---|
| Free | Cannot run (Pro-gated) | Throws requires Pro tier |
| Pro | 50 MB · 100,000 rows · 5 files | Fits most vendor masters |
| Pro-media | 200 MB · 500,000 rows · 20 files | Large multi-entity masters |
| Developer | 500 MB · unlimited rows | Enterprise-scale consolidation |
Cookbook
Real procurement vendor-name patterns, the threshold that handles each, and the consolidation report. Report row numbers are 1-based and count the header row.
Casing and punctuation: collapse for free
Microsoft Corp, MICROSOFT CORP, and Microsoft Corp. differ only by case, whitespace, or a period. Lowercasing and trimming handle case/space automatically; the period costs one edit (~96%), so they all collapse at a high threshold.
Input (column: vendor_name) vendor_name,spend Microsoft Corp,12000 MICROSOFT CORP,3000 Microsoft Corp.,5000 threshold: 90 Report 2 near-duplicate row(s) removed · 1 rows kept. Row 3 "MICROSOFT CORP" ≈ "Microsoft Corp" (100%) Row 4 "Microsoft Corp." ≈ "Microsoft Corp" (96%) Output keeps "Microsoft Corp" / spend 12000 (first row). NOTE: spend is NOT summed — see edge cases.
Spelled-out legal suffix needs a low threshold
Acme Corporation vs Acme Corp scores only ~67% because oration is seven inserted characters. The default 85% leaves them split; drop to ~65% to consolidate — then double-check you didn't also merge a real different entity.
Input (column: vendor_name) vendor_name Acme Corp Acme Corporation threshold: 85 -> 0 removed (67% < 85) threshold: 65 -> Report 1 near-duplicate row(s) removed · 1 rows kept. Row 3 "Acme Corporation" ≈ "Acme Corp" (67%) Review: at 65% an unrelated "Acme Inc" (75%) would ALSO merge — confirm in the report before trusting it.
Over-merge warning: same stem, different entity
Lowering the threshold to catch suffix variants can sweep in genuinely different companies that share a stem. Acme Corp and Acme Inc (~75%) are likely the same business; Apex Ltd and Apex LLC might be two different legal entities. Always read the report.
Input (column: vendor_name) vendor_name Apex Ltd Apex LLC threshold: 70 Report 1 near-duplicate row(s) removed · 1 rows kept. Row 3 "Apex LLC" ≈ "Apex Ltd" (~75%) If Apex Ltd and Apex LLC are separate entities, this is a FALSE consolidation — raise the threshold to ~90% to keep them distinct.
Parent-child relationships are out of scope
Fuzzy Dedup matches the name string only. Frito-Lay and PepsiCo score near zero, so they never merge — correct for string matching, but it means brand-to-parent roll-up needs a separate reference table.
Input (column: vendor_name) vendor_name Frito-Lay PepsiCo threshold: 50 (most aggressive) Report: 0 removed (similarity far below 50%) To roll Frito-Lay up to PepsiCo you need a mapping table (parent ↔ child), not fuzzy matching. Consider building it separately and joining on it.
Make the full legal name the master
First-occurrence-wins means the survivor is whichever spelling is first. To keep the full legal name as your canonical master, sort it to the top of each group before processing.
Before (abbreviation first): vendor_name,duns MSFT Corp,081466849 Microsoft Corporation,081466849 Sort so the full name is first, then Fuzzy Dedup (threshold ~70): vendor_name,duns Microsoft Corporation,081466849 <- kept (canonical) MSFT Corp,081466849 <- removed The tool has no "prefer longest name" option — sorting is the control.
Edge cases and what actually happens
Spend is not summed when rows collapse
By designFuzzy Dedup removes duplicate rows and keeps the first one as-is — it does NOT aggregate. If Microsoft Corp ($12k) and Microsoft Corporation ($5k) merge, the output keeps only the $12k row; the $5k is gone from the file. To total consolidated spend, first normalize the names with this tool, then sum in a pivot via excel-pivot-generator or your BI tool.
Legal suffix variants survive at default threshold
Missed duplicatesAcme Corp vs Acme Corporation scores ~67% — below the 85% default — so they stay split. Lower the threshold to ~65% to catch spelled-out suffixes, then audit the report for collateral merges of unrelated stems.
Different entities with the same stem
False mergeA low threshold can merge Acme Corp with Acme Inc, or Apex Ltd with Apex LLC, which may be distinct legal entities. The tool only sees the string. Use a higher threshold and the report to keep genuinely separate vendors apart.
Parent-child / brand hierarchy
Out of scopeFrito-Lay and PepsiCo will never fuzzy-match — they're different strings. Vendor hierarchy (subsidiaries, brands, DBAs) requires a reference mapping table, not string similarity. Build that separately and join on it.
Free tier procurement user
Pro requiredThe processor throws Fuzzy Deduplicator requires Pro tier. Vendor masters also commonly exceed Free's 10,000-row Excel cap. Pro gives 100,000 rows / 50 MB / 5 files; Pro-media and Developer go higher for enterprise masters.
Vendor column header typed wrong
Empty matchesThe Key column is free text and must match a header exactly. A mismatch makes every key empty; all blanks score 100% and the whole master collapses to one row. Copy the header verbatim and verify the kept count is sensible.
DUNS / tax ID deduplication
Wrong approachIdentifiers like DUNS or tax IDs must match exactly — a near-match is a different vendor. Don't fuzzy-dedup them; use exact dedup (csv-deduplicator) on the identifier column, and reserve fuzzy matching for the name.
Survivor isn't your canonical spelling
Order-dependentFirst-occurrence-wins, so an abbreviation can survive over the full legal name if it appears first. Sort your canonical spelling to the top of each group before processing — there is no "prefer longest name" setting.
Multiple ERP exports to reconcile
Wrong toolTo match suppliers across two separate exports (e.g. an AP file vs a contract register) by approximate name and combine their columns, use excel-fuzzy-merger (Developer tier), not this single-file deduper.
Multi-sheet ERP export
First sheet onlyMany ERP exports include summary or metadata tabs. The tool reads only the first sheet. Move the vendor rows to the first sheet, or export them alone, before deduplicating.
Frequently asked questions
What about vendors with the same parent company?
Fuzzy Dedup works on the name string only — Frito-Lay and PepsiCo score near zero and never merge. Parent-child or brand hierarchy needs a separate reference/mapping table; this tool consolidates name spelling variants, not corporate structure.
Can I run this on 50,000 vendor rows?
Yes, on Pro tier (50 MB / 100,000 rows / 5 files). Pro-media handles 500,000 rows and Developer is unlimited. Free tier cannot run the tool. Performance depends on how many distinct names there are, since each row is compared to the kept representatives.
Does it sum spend for consolidated vendors?
No. It removes duplicate rows and keeps the first one unchanged — amounts are not aggregated. Use it to normalize the vendor names, then total spend in a pivot (excel-pivot-generator) or your BI/ERP after the names are clean.
What threshold catches 'Corp' vs 'Corporation'?
Around 65%, because Corporation adds seven characters to Corp — a large edit distance relative to the string length. The default 85% won't catch it. Lower the threshold to ~65–70% and review the report for unintended merges of different entities.
How do I keep the full legal name as the master record?
Sort your file so the preferred spelling is the first row of each group before processing — first-occurrence-wins keeps it. The tool has no "prefer longest name" or "choose canonical" option; sorting is the only lever.
Will it merge two genuinely different companies with similar names?
It can, at low thresholds — Apex Ltd and Apex LLC (~75%) could merge even if they're separate entities. The tool only sees the string. Use a higher threshold, read the consolidation report, and keep distinct entities apart manually.
Which ERP exports work?
Any that produce .xlsx or .csv — SAP, Oracle, Coupa, Ariba, and others. The tool is column-agnostic; point the Key column at whatever the export calls the vendor name (supplier_name, vendor_name, Vendor).
Is the matching case- and whitespace-sensitive?
No. Both values are lowercased and trimmed before scoring, so MICROSOFT CORP, Microsoft Corp , and microsoft corp all score 100% and collapse — regardless of threshold.
Can I deduplicate on vendor name AND tax ID together?
Not directly — one Key column only. Concatenate name and tax ID into a single column (e.g. Microsoft Corp|GB123456789) and dedup on that; or exact-dedup the tax ID separately with csv-deduplicator since IDs should match exactly.
What does the output contain?
deduped-fuzzy.xlsx, one sheet named Deduped, with the kept (master) rows and all their original columns — spend, GL codes, contract IDs — preserved. Removed variant rows are listed in the report, not in the file.
Is my vendor master uploaded anywhere?
No. All reading, scoring, and writing happen in your browser via SheetJS. Vendor names, spend, and IDs stay on your machine; the clean .xlsx downloads locally.
Can I undo an over-aggressive consolidation?
Yes — your input file is never modified. Re-process the original with a higher threshold to keep more vendors distinct, and compare the two consolidation reports before applying either to your spend data.
Privacy first
Every JAD Excel tool runs entirely in your browser using SheetJS and ExcelJS. Your spreadsheets, formulas, and data never leave your device — verified by zero outbound network requests during processing.