How to find duplicate files inside an archive for free
- Step 1Open the Duplicate File Detector — Go to redundancy-analyzer. It is a Pro-tier tool, so sign in on a Pro plan or higher — Free accounts cannot run it (the archive family Free cap is 50 MB / 500 entries / 1 file, and this tool's minimum tier is Pro).
- Step 2Drop in a single archive — Drag one archive onto the drop zone. This tool reads ONE archive at a time — it is not a batch tool and does not accept folders. Supported inputs are detected by magic bytes: ZIP, GZIP, TAR, tar.gz, 7z, RAR, bz2, xz, tar.bz2, tar.xz and ISO.
- Step 3Set the Top-N groups slider — The only control is a range slider labelled Top-N groups (
pairLimit), from 10 to 500 in steps of 10, default 100. It caps how many duplicate groups the report returns — the groups with the most wasted bytes are kept. Leave it at 100 for most archives; raise it if you suspect many small duplicate sets. - Step 4Run the analysis — The tool extracts every entry, computes a SHA-256 digest of each, and builds a map from digest to file list. Groups with two or more members are duplicates. Directory entries (paths ending in
/) are skipped and never hashed. - Step 5Read the JSON report — Output is JSON with
totalEntries,duplicateGroups,totalWastedBytes,totalWastedHuman, and agroupsarray. Each group has itshash,count,perFileSize,wastedBytes, and afileslist of{name, size}. The summary metrics panel shows Duplicate groups and Wasted. - Step 6Act on the findings with a sibling tool — This tool only reports — it never edits your archive. To actually drop the redundant copies, use selective-extractor to pull only the files you want and re-zip with folder-to-zip, or compare two builds with archive-diff.
What you can drop in
Formats the analyzer can read, the engine that handles each, and how it is detected. All reading is browser-side; libarchive formats are read-only.
| Input format | Engine | Detected by | Notes |
|---|---|---|---|
ZIP (.zip) | fflate (or zip.js if encrypted) | Magic 50 4B (PK) | Directory entries (trailing /) are skipped and never hashed |
GZIP (.gz) | fflate | Magic 1F 8B | Single-member stream — yields exactly one inner file, so duplicates need a multi-file container |
TAR (.tar) | fflate (tar parser) | Header offset check | Plain uncompressed tar; many entries, ideal for dedup analysis |
| tar.gz / tar.bz2 / tar.xz | fflate (gz) / libarchive (bz2, xz) | Outer compression magic | Decompresses then walks the inner tar's entries |
7z (.7z) | libarchive WASM | Magic 37 7A etc. | Read-only — the analyzer never writes 7z, it only inspects |
RAR (.rar) | libarchive WASM | RAR signature | Read-only inspection; encrypted RAR cannot be analyzed (no password input) |
| bz2 / xz / ISO | libarchive WASM | Magic bytes | ISO walks the disc image's files; bz2/xz are single-stream like gz |
The Top-N groups slider (the only option)
The analyzer exposes exactly one control. Values shown are the real min/max/step/default from the option schema.
| Property | Value | Effect |
|---|---|---|
| UI control | Range slider, label "Top-N groups" | Drag to set how many groups are returned; a number badge shows the current value |
| Schema name | pairLimit | Maps to opts.pairLimit in the processor |
| Minimum | 10 | Smallest report — only the 10 biggest-waste groups |
| Maximum | 500 | Largest report this tool will return in one pass |
| Step | 10 | Slider snaps in increments of 10 |
| Default | 100 | Used when you do not touch the slider |
Tier limits for the archive family
Per-archive caps from lib/tier-limits.ts. Note the entry-count cap, not just file size. This tool's minimum tier is Pro, so Free cannot run it.
| Tier | Max archive size | Max entries | Files per run | Can run this tool? |
|---|---|---|---|---|
| Free | 50 MB | 500 | 1 | No — tool requires Pro |
| Pro | 500 MB | 50,000 | 20 | Yes (1 archive at a time here) |
| Pro-media | 2 GB | 500,000 | 100 | Yes |
| Developer | 2 GB | 500,000 | Unlimited | Yes |
Cookbook
Real-world reports from typical archives. Output is trimmed JSON; sizes are illustrative but the shape and fields are exactly what the tool returns.
Vendored logo duplicated across theme folders
A site export ZIP shipped the same 240 KB logo into three theme directories. Names differ, bytes are identical, so all three share one SHA-256 and form a single group with two wasted copies.
Input: site-export.zip (3,140 entries)
Report (excerpt):
{
"totalEntries": 3140,
"duplicateGroups": 1,
"totalWastedBytes": 491520,
"totalWastedHuman": "480.0 KB",
"groups": [
{
"hash": "9f2c...",
"count": 3,
"perFileSize": 245760,
"wastedBytes": 491520,
"files": [
{"name":"themes/aurora/logo.png","size":245760},
{"name":"themes/dusk/assets/logo.png","size":245760},
{"name":"themes/noir/img/logo.png","size":245760}
]
}
]
}.DS_Store noise across a Mac-zipped project
macOS Finder scatters a .DS_Store into many folders. They are not all identical (each can differ), but the empty ones collapse into one group with zero wasted bytes — useful to confirm before pruning.
Input: project-mac.zip
Report (excerpt):
{
"duplicateGroups": 2,
"totalWastedBytes": 0,
"totalWastedHuman": "0 B",
"groups": [
{
"hash": "e3b0c442...", // SHA-256 of zero bytes
"count": 12,
"perFileSize": 0,
"wastedBytes": 0,
"files": [ /* 12 empty placeholder files */ ]
}
]
}
Note: empty files all share the SHA-256 of the empty string,
so they group together but cost no space (wastedBytes 0).Tightening a noisy report with the slider
An archive with thousands of tiny duplicate sets returns a huge report at the default 100. Drop the slider to 10 to focus only on the biggest-waste groups for a quick cleanup decision.
Same archive, two runs: pairLimit = 100 (default): duplicateGroups: 100 (capped) totalWastedHuman: "31.4 MB" pairLimit = 10: duplicateGroups: 10 (capped) totalWastedHuman: "27.9 MB" The top 10 groups already account for ~89% of the waste — slider lets you ignore the long tail of trivial duplicates.
Backup snapshot overlap
A tar.gz holding two daily snapshots of the same tree is mostly redundant. The analyzer surfaces every unchanged file as a duplicate pair, showing how little actually changed between snapshots.
Input: backups-2026-06.tar.gz (two snapshot trees)
Report summary:
{
"totalEntries": 18402,
"duplicateGroups": 100, // capped at pairLimit
"totalWastedBytes": 612843776,
"totalWastedHuman": "584.4 MB"
}
Most entries are byte-identical across the two days —
strong case for incremental backups instead of full snapshots.Confirming a clean release archive
Before publishing, run the analyzer on the release ZIP. A clean build should report zero duplicate groups — if it does not, something got vendored twice.
Input: release-v2.4.0.zip
Report:
{
"totalEntries": 842,
"duplicateGroups": 0,
"totalWastedBytes": 0,
"totalWastedHuman": "0 B",
"groups": []
}
Empty groups array = no byte-identical files. Ship it.Edge cases and what actually happens
Encrypted ZIP entries
RejectedThe analyzer extracts with no password (it calls the extractor without one), so an archive with any encrypted entry throws "Archive contains encrypted entries... Provide a password to extract." There is no password input on this tool. Decrypt first with multi-format-extractor (which accepts a password) and analyze the plain output.
Two files, same name, different bytes
By designThese are NOT grouped. Grouping is purely by SHA-256 of content, so same-name-different-content files have different digests and stay separate. Only byte-identical files ever share a group.
Empty files all match
ExpectedEvery zero-byte file produces the same SHA-256 (the digest of the empty string), so all empty files in the archive collapse into one group with perFileSize: 0 and wastedBytes: 0. They are duplicates by definition but cost no space.
Single-stream archive (gz/bz2/xz)
By designA bare .gz, .bz2 or .xz decompresses to exactly one inner file, so there is nothing to compare and the report shows zero duplicate groups. Use a multi-file container (zip, tar, 7z) to find duplicates.
More groups than the slider allows
TruncatedIf the archive has more duplicate groups than pairLimit, only the highest-waste groups are returned (sorted by wasted bytes descending, then sliced). duplicateGroups reflects the returned count, not the true total. Raise the slider toward 500 if you need more.
Directory entries in the archive
SkippedEntries whose paths end in / (folder markers) are not hashed and never appear in any group. Only real files are compared.
Archive over the tier cap
RejectedPro allows 500 MB and 50,000 entries per archive; Pro-media and Developer allow 2 GB and 500,000 entries. An archive past your tier's size or entry cap is rejected before analysis. Split it with archive-splitter or upgrade.
Corrupt or unrecognized archive
FailedIf the bytes do not match any known signature and a last-resort ZIP read fails, you get "Could not detect or extract archive format." Verify the file with archive-integrity-tester first; repair a damaged ZIP with corrupted-zip-repair.
Free-tier account
BlockedThis tool's minimum tier is Pro. Free accounts cannot run it at all, regardless of archive size. Upgrade to Pro to use the Duplicate File Detector.
Very large entry counts
SupportedHashing is done one entry at a time in the browser; large archives are CPU-bound, so a 50,000-entry archive takes noticeably longer than a small one. It still completes — there is no per-entry timeout — just give the tab time and avoid backgrounding it.
Frequently asked questions
Does the tool match by filename or by content?
By content only. It computes a SHA-256 digest of each entry's exact bytes and groups files whose digests are equal. Filenames, paths, and timestamps are ignored when deciding duplicates.
What hash does it use?
SHA-256, via the browser's built-in crypto.subtle.digest. SHA-256 collisions are computationally infeasible, so identical digests mean identical bytes.
Is anything uploaded?
No. Extraction and hashing run entirely in your browser using fflate, zip.js, and libarchive WASM. Your archive never leaves your machine.
What formats can I analyze?
ZIP, GZIP, TAR, tar.gz, tar.bz2, tar.xz, 7z, RAR, bz2, xz, and ISO. ZIP/GZIP/TAR use fflate; 7z/RAR/bz2/xz/ISO use libarchive WASM (read-only).
Can it remove the duplicates for me?
No — it only reports. To delete redundant copies, extract just the files you want with selective-extractor and re-zip, or compare builds with archive-diff.
What does wastedBytes mean?
For a group, wastedBytes = perFileSize x (count - 1) — the space you would recover by keeping one copy and removing the rest. The report sums these into totalWastedBytes / totalWastedHuman.
Why are my empty files all in one group?
Every zero-byte file has the same SHA-256, so they all match. The group shows perFileSize: 0 and wastedBytes: 0 — they are duplicates but free.
What is the Top-N groups slider?
It caps how many duplicate groups the report returns (10 to 500, default 100, in steps of 10). The highest-waste groups are kept; the rest are dropped.
Can I analyze an encrypted ZIP?
No. This tool runs the extractor without a password, so encrypted entries cause an error. Decrypt first with multi-format-extractor (which has a password field), then analyze the result.
How big an archive can I use?
Pro: up to 500 MB and 50,000 entries. Pro-media and Developer: up to 2 GB and 500,000 entries. The tool requires at least Pro.
Does it process folders or multiple archives at once?
No. It reads one archive per run and does not accept folders or batches. For batch extraction see batch-extraction-manager.
Where do I act on the results?
Pair it with selective-extractor (to keep only wanted files), folder-to-zip (to re-pack), archive-diff (to compare two archives), or archive-size-analyzer (to see size by path).
Privacy first
Every JAD Archive tool runs entirely in your browser using fflate, @zip.js/zip.js, and the libarchive WASM bridge. Your archives never leave your device — verified by zero outbound network requests during processing.