How to duplicate file detector in developer workflows
- Step 1Grab the artifact — Take the build output you want to inspect — a
dist.zip, a Docker layerlayer.tar.gz, a release.7z, or a backup.tar. The analyzer reads one archive per run; it does not accept folders or batches. - Step 2Open the analyzer — Go to redundancy-analyzer (Pro tier) and drop the artifact. Format is detected by magic bytes, so you do not pick a type.
- Step 3Tune Top-N for the artifact's noise — Default 100 groups suits most bundles. For a sprawling
node_moduleswith hundreds of tiny duplicate sets, drop the slider to 10–20 to see only the heaviest offenders; raise toward 500 for a full inventory. - Step 4Read the ranked JSON — The top groups are your biggest wins. Each shows the SHA-256, copy count, per-file size, wasted bytes, and every path — enough to decide what to single-source.
- Step 5Act, because the tool only reports — It never edits the archive. Re-pack keeping only one copy with selective-extractor + folder-to-zip, or fix the build config (e.g. hoist the dependency) so the duplication does not recur.
- Step 6Track it across builds — There is no API for this browser tool, so for build-over-build tracking, run it manually on each artifact, or compare two artifacts directly with archive-diff to see what changed.
Where developer bloat hides — and what the analyzer shows
Typical duplication patterns in build outputs and the follow-up tool to fix each.
| Pattern | Why it happens | What the report shows | Fix with |
|---|---|---|---|
| Same dep in many node_modules trees | No hoisting / nested installs | One group, high count, big wasted bytes | Hoist; re-pack via selective-extractor |
| Shared asset copied per route | Build copies instead of referencing | Group across route folders | Reference once; folder-to-zip |
| Repeated LICENSE / README | One per package | Many small groups, low waste each | Usually leave; informational |
| Two snapshots in one backup | Full instead of incremental | Most files grouped as pairs | Switch to incremental backups |
| Duplicate fonts / icons | Multiple themes vendor same files | Grouped by content across themes | Single-source the asset |
Output JSON shape (for scripting)
Exact top-level and per-group fields the tool emits, so you can parse the report in a script after manual download.
| Field | Level | Meaning |
|---|---|---|
| totalEntries | top | Count of files hashed (directories excluded) |
| duplicateGroups | top | Number of groups returned (capped by Top-N) |
| totalWastedBytes / totalWastedHuman | top | Recoverable space across returned groups |
| hash | group | SHA-256 of the group's identical content |
| count | group | How many copies are in this group |
| perFileSize | group | Size of one copy in bytes |
| wastedBytes | group | perFileSize x (count - 1) |
| files[] | group | Array of {name, size} for each copy |
Capabilities vs constraints
What it does and what it does not, so you plan the workflow correctly.
| Capability | Supported? | Detail |
|---|---|---|
| Read inside archive, no extract to disk | Yes | fflate / zip.js / libarchive WASM in browser |
| Remove duplicates | No | Report-only; re-pack with sibling tools |
| CLI / API / CI step | No | Browser-only tool, no server-side path |
| Batch / folder input | No | One archive per run |
| Encrypted archives | No | No password input |
| Compare two artifacts | No | Use archive-diff for that |
Cookbook
Developer-shaped runs against real artifact shapes. JSON is trimmed; field names are exact.
Vendored dependency across node_modules
A monorepo bundle installed the same package in three workspaces. The analyzer groups every identical file in that package, exposing the duplication a hoist would remove.
Input: dist.zip (monorepo bundle)
Report top group:
{
"hash": "a3f1...",
"count": 3,
"perFileSize": 1310720,
"wastedBytes": 2621440,
"files": [
{"name":"apps/web/node_modules/lib/index.js","size":1310720},
{"name":"apps/admin/node_modules/lib/index.js","size":1310720},
{"name":"packages/ui/node_modules/lib/index.js","size":1310720}
]
}
Fix: hoist lib to the root, drop the nested copies.Shared image copied into every route
The build copied the hero image into each page's asset folder. Identical bytes, different paths, one group.
Input: site-dist.tar.gz
Report (excerpt):
hash: "7c20..." count: 11 perFileSize: 512000
wastedBytes: 5120000 ("4.9 MB")
files: pages/*/assets/hero.jpg (11 copies)
Fix: reference one /assets/hero.jpg instead of copying.Focusing on the heaviest offenders
A node_modules-heavy artifact returns hundreds of tiny duplicate sets. Drop the slider to surface only the big wins.
pairLimit = 100: duplicateGroups 100, totalWastedHuman "61.2 MB" pairLimit = 15: duplicateGroups 15, totalWastedHuman "54.8 MB" The top 15 groups hold ~90% of the waste — fix those first.
Backup snapshot redundancy
A nightly full backup tar holds yesterday's tree too. Nearly every file groups as a pair, quantifying the case for incremental backups.
Input: nightly-2026-06-12.tar Report summary: totalEntries: 24,118 duplicateGroups: 100 (capped) totalWastedHuman: "1.41 GB" Most files unchanged since yesterday -> go incremental.
Confirming a clean release after a fix
After hoisting the duplicated dependency, re-run on the new artifact to confirm the waste is gone.
Before fix: dist.zip -> totalWastedHuman "61.2 MB" After hoist + re-pack: dist.zip -> duplicateGroups 0, totalWastedHuman "0 B" Regression check passed: no byte-identical duplicates remain.
Edge cases and what actually happens
Expecting a CLI or API call
Not availableArchive tools are browser-only with no server-side path, so there is no endpoint or binary to call from CI. Use the tool interactively, then script against the downloaded JSON if you need automation around it.
Same filename, different content
By designdist/a/chunk.js and dist/b/chunk.js are only grouped if byte-identical. Different content means different SHA-256, so they stay separate — you will not accidentally treat a changed file as a duplicate.
Single-stream gzip artifact
By designA bare .gz (one inner file) cannot have duplicates and returns zero groups. Inspect a multi-file container (zip, tar, 7z) to find duplication.
node_modules with thousands of tiny dupes
TruncatedIf duplicate groups exceed the Top-N value, only the highest-waste ones are returned. Lower the slider to focus, or raise it to 500 for a fuller inventory; the long tail of trivial sets is usually safe to ignore.
Repeated LICENSE/README files
ExpectedThese group with small per-file waste and large counts. Usually informational, not worth removing — they are tiny. Focus on the top groups by wasted bytes.
Encrypted artifact
RejectedNo password input — encrypted entries error out. Decrypt with multi-format-extractor first, then analyze the plaintext artifact.
Artifact over the tier cap
RejectedPro: 500 MB / 50,000 entries; higher tiers 2 GB / 500,000. A huge build artifact may need splitting with archive-splitter or a higher tier.
Wanting the tool to delete the dupes
By designIt only reports. Re-pack keeping one copy via selective-extractor and folder-to-zip, or fix the build config so the duplication never enters the artifact.
Comparing two build artifacts
Out of scopeThis analyzes one archive. To see what changed between two builds, use archive-diff.
Free-tier developer account
BlockedThe tool requires Pro. Free accounts cannot run it. Upgrade to use it on build artifacts.
Frequently asked questions
Can I call this from CI or a script?
No. Archive tools run only in the browser with no server-side path, so there is no API or CLI. Run it interactively and parse the downloaded JSON in your own script if needed.
Will it group same-named files that differ?
No. Grouping is by SHA-256 of content. Files with the same name but different bytes have different digests and are not grouped.
Does it delete or fix the duplicates?
No, it reports only. Use selective-extractor plus folder-to-zip to re-pack a single copy, or fix the build config (e.g. hoist the dependency).
How do I read the output programmatically?
Download the JSON. It has totalEntries, duplicateGroups, totalWastedBytes/Human, and a groups array of {hash, count, perFileSize, wastedBytes, files[]}.
What artifacts can it read?
ZIP, GZIP, TAR, tar.gz, tar.bz2, tar.xz, 7z, RAR, bz2, xz, and ISO — covering deploy bundles, Docker layers, and release packages.
Why is my gzip artifact showing no duplicates?
A bare .gz holds a single inner file, so there is nothing to compare. Use a multi-file container like zip or tar.
How do I see only the biggest wins?
Lower the Top-N groups slider (down to 10). Groups are sorted by wasted bytes, so you keep the heaviest offenders.
Can it diff two builds?
Not this tool. Use archive-diff to compare two artifacts.
Is my proprietary artifact uploaded?
No. Hashing runs locally in the browser; nothing is uploaded or written to disk.
What's the size limit?
Pro: 500 MB / 50,000 entries. Pro-media and Developer: 2 GB / 500,000 entries. Minimum tier is Pro.
Does it handle encrypted artifacts?
No — there is no password field. Decrypt first with multi-format-extractor.
Is the output stable across runs?
Yes. SHA-256 is deterministic, so the same artifact yields the same groups and digests every time.
Privacy first
Every JAD Archive tool runs entirely in your browser using fflate, @zip.js/zip.js, and the libarchive WASM bridge. Your archives never leave your device — verified by zero outbound network requests during processing.