Duplicate Finder for Developers — Trim node_modules & Build Artifacts

How to duplicate file detector in developer workflows

Step 1
Grab the artifact — Take the build output you want to inspect — a dist.zip, a Docker layer layer.tar.gz, a release .7z, or a backup .tar. The analyzer reads one archive per run; it does not accept folders or batches.
Step 2
Open the analyzer — Go to redundancy-analyzer (Pro tier) and drop the artifact. Format is detected by magic bytes, so you do not pick a type.
Step 3
Tune Top-N for the artifact's noise — Default 100 groups suits most bundles. For a sprawling node_modules with hundreds of tiny duplicate sets, drop the slider to 10–20 to see only the heaviest offenders; raise toward 500 for a full inventory.
Step 4
Read the ranked JSON — The top groups are your biggest wins. Each shows the SHA-256, copy count, per-file size, wasted bytes, and every path — enough to decide what to single-source.
Step 5
Act, because the tool only reports — It never edits the archive. Re-pack keeping only one copy with selective-extractor + folder-to-zip, or fix the build config (e.g. hoist the dependency) so the duplication does not recur.
Step 6
Track it across builds — There is no API for this browser tool, so for build-over-build tracking, run it manually on each artifact, or compare two artifacts directly with archive-diff to see what changed.

Where developer bloat hides — and what the analyzer shows

Typical duplication patterns in build outputs and the follow-up tool to fix each.

Pattern	Why it happens	What the report shows	Fix with
Same dep in many node_modules trees	No hoisting / nested installs	One group, high count, big wasted bytes	Hoist; re-pack via selective-extractor
Shared asset copied per route	Build copies instead of referencing	Group across route folders	Reference once; folder-to-zip
Repeated LICENSE / README	One per package	Many small groups, low waste each	Usually leave; informational
Two snapshots in one backup	Full instead of incremental	Most files grouped as pairs	Switch to incremental backups
Duplicate fonts / icons	Multiple themes vendor same files	Grouped by content across themes	Single-source the asset

Output JSON shape (for scripting)

Exact top-level and per-group fields the tool emits, so you can parse the report in a script after manual download.

Field	Level	Meaning
totalEntries	top	Count of files hashed (directories excluded)
duplicateGroups	top	Number of groups returned (capped by Top-N)
totalWastedBytes / totalWastedHuman	top	Recoverable space across returned groups
hash	group	SHA-256 of the group's identical content
count	group	How many copies are in this group
perFileSize	group	Size of one copy in bytes
wastedBytes	group	perFileSize x (count - 1)
files[]	group	Array of {name, size} for each copy

Capabilities vs constraints

What it does and what it does not, so you plan the workflow correctly.

Capability	Supported?	Detail
Read inside archive, no extract to disk	Yes	fflate / zip.js / libarchive WASM in browser
Remove duplicates	No	Report-only; re-pack with sibling tools
CLI / API / CI step	No	Browser-only tool, no server-side path
Batch / folder input	No	One archive per run
Encrypted archives	No	No password input
Compare two artifacts	No	Use archive-diff for that

Cookbook

Developer-shaped runs against real artifact shapes. JSON is trimmed; field names are exact.

Vendored dependency across node_modules

A monorepo bundle installed the same package in three workspaces. The analyzer groups every identical file in that package, exposing the duplication a hoist would remove.

Input: dist.zip (monorepo bundle)

Report top group:
{
  "hash": "a3f1...",
  "count": 3,
  "perFileSize": 1310720,
  "wastedBytes": 2621440,
  "files": [
    {"name":"apps/web/node_modules/lib/index.js","size":1310720},
    {"name":"apps/admin/node_modules/lib/index.js","size":1310720},
    {"name":"packages/ui/node_modules/lib/index.js","size":1310720}
  ]
}
Fix: hoist lib to the root, drop the nested copies.

Shared image copied into every route

The build copied the hero image into each page's asset folder. Identical bytes, different paths, one group.

Input: site-dist.tar.gz

Report (excerpt):
  hash: "7c20..."  count: 11  perFileSize: 512000
  wastedBytes: 5120000  ("4.9 MB")
  files: pages/*/assets/hero.jpg  (11 copies)

Fix: reference one /assets/hero.jpg instead of copying.

Focusing on the heaviest offenders

A node_modules-heavy artifact returns hundreds of tiny duplicate sets. Drop the slider to surface only the big wins.

pairLimit = 100:  duplicateGroups 100, totalWastedHuman "61.2 MB"
pairLimit = 15:   duplicateGroups 15,  totalWastedHuman "54.8 MB"

The top 15 groups hold ~90% of the waste — fix those first.

Backup snapshot redundancy

A nightly full backup tar holds yesterday's tree too. Nearly every file groups as a pair, quantifying the case for incremental backups.

Input: nightly-2026-06-12.tar

Report summary:
  totalEntries: 24,118
  duplicateGroups: 100 (capped)
  totalWastedHuman: "1.41 GB"

Most files unchanged since yesterday -> go incremental.

Confirming a clean release after a fix

After hoisting the duplicated dependency, re-run on the new artifact to confirm the waste is gone.

Before fix: dist.zip -> totalWastedHuman "61.2 MB"
After hoist + re-pack:
  dist.zip -> duplicateGroups 0, totalWastedHuman "0 B"

Regression check passed: no byte-identical duplicates remain.

Edge cases and what actually happens

Expecting a CLI or API call

Not available

Archive tools are browser-only with no server-side path, so there is no endpoint or binary to call from CI. Use the tool interactively, then script against the downloaded JSON if you need automation around it.

Same filename, different content

By design

dist/a/chunk.js and dist/b/chunk.js are only grouped if byte-identical. Different content means different SHA-256, so they stay separate — you will not accidentally treat a changed file as a duplicate.

Single-stream gzip artifact

By design

A bare .gz (one inner file) cannot have duplicates and returns zero groups. Inspect a multi-file container (zip, tar, 7z) to find duplication.

node_modules with thousands of tiny dupes

Truncated

If duplicate groups exceed the Top-N value, only the highest-waste ones are returned. Lower the slider to focus, or raise it to 500 for a fuller inventory; the long tail of trivial sets is usually safe to ignore.

Repeated LICENSE/README files

Expected

These group with small per-file waste and large counts. Usually informational, not worth removing — they are tiny. Focus on the top groups by wasted bytes.

Encrypted artifact

Rejected

No password input — encrypted entries error out. Decrypt with multi-format-extractor first, then analyze the plaintext artifact.

Artifact over the tier cap

Rejected

Pro: 500 MB / 50,000 entries; higher tiers 2 GB / 500,000. A huge build artifact may need splitting with archive-splitter or a higher tier.

Wanting the tool to delete the dupes

By design

It only reports. Re-pack keeping one copy via selective-extractor and folder-to-zip, or fix the build config so the duplication never enters the artifact.

Comparing two build artifacts

Out of scope

This analyzes one archive. To see what changed between two builds, use archive-diff.

Free-tier developer account

Blocked

The tool requires Pro. Free accounts cannot run it. Upgrade to use it on build artifacts.

Frequently asked questions

Can I call this from CI or a script?

No. Archive tools run only in the browser with no server-side path, so there is no API or CLI. Run it interactively and parse the downloaded JSON in your own script if needed.

Will it group same-named files that differ?

No. Grouping is by SHA-256 of content. Files with the same name but different bytes have different digests and are not grouped.

Does it delete or fix the duplicates?

No, it reports only. Use selective-extractor plus folder-to-zip to re-pack a single copy, or fix the build config (e.g. hoist the dependency).

How do I read the output programmatically?

Download the JSON. It has totalEntries, duplicateGroups, totalWastedBytes/Human, and a groups array of {hash, count, perFileSize, wastedBytes, files[]}.

What artifacts can it read?

ZIP, GZIP, TAR, tar.gz, tar.bz2, tar.xz, 7z, RAR, bz2, xz, and ISO — covering deploy bundles, Docker layers, and release packages.

Why is my gzip artifact showing no duplicates?

A bare .gz holds a single inner file, so there is nothing to compare. Use a multi-file container like zip or tar.

How do I see only the biggest wins?

Lower the Top-N groups slider (down to 10). Groups are sorted by wasted bytes, so you keep the heaviest offenders.

Can it diff two builds?

Not this tool. Use archive-diff to compare two artifacts.

Is my proprietary artifact uploaded?

No. Hashing runs locally in the browser; nothing is uploaded or written to disk.

What's the size limit?

Pro: 500 MB / 50,000 entries. Pro-media and Developer: 2 GB / 500,000 entries. Minimum tier is Pro.

Does it handle encrypted artifacts?

No — there is no password field. Decrypt first with multi-format-extractor.

Is the output stable across runs?

Yes. SHA-256 is deterministic, so the same artifact yields the same groups and digests every time.

Privacy first

Every JAD Archive tool runs entirely in your browser using fflate, @zip.js/zip.js, and the libarchive WASM bridge. Your archives never leave your device — verified by zero outbound network requests during processing.

How to duplicate file detector in developer workflows

Step 1
Grab the artifact — Take the build output you want to inspect — a dist.zip, a Docker layer layer.tar.gz, a release .7z, or a backup .tar. The analyzer reads one archive per run; it does not accept folders or batches.
Step 2
Open the analyzer — Go to redundancy-analyzer (Pro tier) and drop the artifact. Format is detected by magic bytes, so you do not pick a type.
Step 3
Tune Top-N for the artifact's noise — Default 100 groups suits most bundles. For a sprawling node_modules with hundreds of tiny duplicate sets, drop the slider to 10–20 to see only the heaviest offenders; raise toward 500 for a full inventory.
Step 4
Read the ranked JSON — The top groups are your biggest wins. Each shows the SHA-256, copy count, per-file size, wasted bytes, and every path — enough to decide what to single-source.
Step 5
Act, because the tool only reports — It never edits the archive. Re-pack keeping only one copy with selective-extractor + folder-to-zip, or fix the build config (e.g. hoist the dependency) so the duplication does not recur.
Step 6
Track it across builds — There is no API for this browser tool, so for build-over-build tracking, run it manually on each artifact, or compare two artifacts directly with archive-diff to see what changed.

Where developer bloat hides — and what the analyzer shows

Typical duplication patterns in build outputs and the follow-up tool to fix each.

Pattern	Why it happens	What the report shows	Fix with
Same dep in many node_modules trees	No hoisting / nested installs	One group, high count, big wasted bytes	Hoist; re-pack via selective-extractor
Shared asset copied per route	Build copies instead of referencing	Group across route folders	Reference once; folder-to-zip
Repeated LICENSE / README	One per package	Many small groups, low waste each	Usually leave; informational
Two snapshots in one backup	Full instead of incremental	Most files grouped as pairs	Switch to incremental backups
Duplicate fonts / icons	Multiple themes vendor same files	Grouped by content across themes	Single-source the asset

Output JSON shape (for scripting)

Exact top-level and per-group fields the tool emits, so you can parse the report in a script after manual download.

Field	Level	Meaning
totalEntries	top	Count of files hashed (directories excluded)
duplicateGroups	top	Number of groups returned (capped by Top-N)
totalWastedBytes / totalWastedHuman	top	Recoverable space across returned groups
hash	group	SHA-256 of the group's identical content
count	group	How many copies are in this group
perFileSize	group	Size of one copy in bytes
wastedBytes	group	perFileSize x (count - 1)
files[]	group	Array of {name, size} for each copy

Capabilities vs constraints

What it does and what it does not, so you plan the workflow correctly.

Capability	Supported?	Detail
Read inside archive, no extract to disk	Yes	fflate / zip.js / libarchive WASM in browser
Remove duplicates	No	Report-only; re-pack with sibling tools
CLI / API / CI step	No	Browser-only tool, no server-side path
Batch / folder input	No	One archive per run
Encrypted archives	No	No password input
Compare two artifacts	No	Use archive-diff for that

Cookbook

Developer-shaped runs against real artifact shapes. JSON is trimmed; field names are exact.

Vendored dependency across node_modules

A monorepo bundle installed the same package in three workspaces. The analyzer groups every identical file in that package, exposing the duplication a hoist would remove.

Input: dist.zip (monorepo bundle)

Report top group:
{
  "hash": "a3f1...",
  "count": 3,
  "perFileSize": 1310720,
  "wastedBytes": 2621440,
  "files": [
    {"name":"apps/web/node_modules/lib/index.js","size":1310720},
    {"name":"apps/admin/node_modules/lib/index.js","size":1310720},
    {"name":"packages/ui/node_modules/lib/index.js","size":1310720}
  ]
}
Fix: hoist lib to the root, drop the nested copies.

Shared image copied into every route

The build copied the hero image into each page's asset folder. Identical bytes, different paths, one group.

Input: site-dist.tar.gz

Report (excerpt):
  hash: "7c20..."  count: 11  perFileSize: 512000
  wastedBytes: 5120000  ("4.9 MB")
  files: pages/*/assets/hero.jpg  (11 copies)

Fix: reference one /assets/hero.jpg instead of copying.

Focusing on the heaviest offenders

A node_modules-heavy artifact returns hundreds of tiny duplicate sets. Drop the slider to surface only the big wins.

pairLimit = 100:  duplicateGroups 100, totalWastedHuman "61.2 MB"
pairLimit = 15:   duplicateGroups 15,  totalWastedHuman "54.8 MB"

The top 15 groups hold ~90% of the waste — fix those first.

Backup snapshot redundancy

A nightly full backup tar holds yesterday's tree too. Nearly every file groups as a pair, quantifying the case for incremental backups.

Input: nightly-2026-06-12.tar

Report summary:
  totalEntries: 24,118
  duplicateGroups: 100 (capped)
  totalWastedHuman: "1.41 GB"

Most files unchanged since yesterday -> go incremental.

Confirming a clean release after a fix

After hoisting the duplicated dependency, re-run on the new artifact to confirm the waste is gone.

Before fix: dist.zip -> totalWastedHuman "61.2 MB"
After hoist + re-pack:
  dist.zip -> duplicateGroups 0, totalWastedHuman "0 B"

Regression check passed: no byte-identical duplicates remain.

Edge cases and what actually happens

Expecting a CLI or API call

Not available

Same filename, different content

By design

Single-stream gzip artifact

By design

A bare .gz (one inner file) cannot have duplicates and returns zero groups. Inspect a multi-file container (zip, tar, 7z) to find duplication.

node_modules with thousands of tiny dupes

Truncated

Repeated LICENSE/README files

Expected

These group with small per-file waste and large counts. Usually informational, not worth removing — they are tiny. Focus on the top groups by wasted bytes.

Encrypted artifact

Rejected

No password input — encrypted entries error out. Decrypt with multi-format-extractor first, then analyze the plaintext artifact.

Artifact over the tier cap

Rejected

Pro: 500 MB / 50,000 entries; higher tiers 2 GB / 500,000. A huge build artifact may need splitting with archive-splitter or a higher tier.

Wanting the tool to delete the dupes

By design

It only reports. Re-pack keeping one copy via selective-extractor and folder-to-zip, or fix the build config so the duplication never enters the artifact.

Comparing two build artifacts

Out of scope

This analyzes one archive. To see what changed between two builds, use archive-diff.

Free-tier developer account

Blocked

The tool requires Pro. Free accounts cannot run it. Upgrade to use it on build artifacts.

Frequently asked questions

Can I call this from CI or a script?

No. Archive tools run only in the browser with no server-side path, so there is no API or CLI. Run it interactively and parse the downloaded JSON in your own script if needed.

Will it group same-named files that differ?

No. Grouping is by SHA-256 of content. Files with the same name but different bytes have different digests and are not grouped.

Does it delete or fix the duplicates?

No, it reports only. Use selective-extractor plus folder-to-zip to re-pack a single copy, or fix the build config (e.g. hoist the dependency).

How do I read the output programmatically?

Download the JSON. It has totalEntries, duplicateGroups, totalWastedBytes/Human, and a groups array of {hash, count, perFileSize, wastedBytes, files[]}.

What artifacts can it read?

ZIP, GZIP, TAR, tar.gz, tar.bz2, tar.xz, 7z, RAR, bz2, xz, and ISO — covering deploy bundles, Docker layers, and release packages.

Why is my gzip artifact showing no duplicates?

A bare .gz holds a single inner file, so there is nothing to compare. Use a multi-file container like zip or tar.

How do I see only the biggest wins?

Lower the Top-N groups slider (down to 10). Groups are sorted by wasted bytes, so you keep the heaviest offenders.

Can it diff two builds?

Not this tool. Use archive-diff to compare two artifacts.

Is my proprietary artifact uploaded?

No. Hashing runs locally in the browser; nothing is uploaded or written to disk.

What's the size limit?

Pro: 500 MB / 50,000 entries. Pro-media and Developer: 2 GB / 500,000 entries. Minimum tier is Pro.

Does it handle encrypted artifacts?

No — there is no password field. Decrypt first with multi-format-extractor.

Is the output stable across runs?

Yes. SHA-256 is deterministic, so the same artifact yields the same groups and digests every time.

Duplicate File Detector in Developer Workflows

How to duplicate file detector in developer workflows

Where developer bloat hides — and what the analyzer shows

Output JSON shape (for scripting)

Capabilities vs constraints

Cookbook

Vendored dependency across node_modules

Shared image copied into every route

Focusing on the heaviest offenders

Backup snapshot redundancy

Confirming a clean release after a fix

Edge cases and what actually happens

Expecting a CLI or API call

Same filename, different content

Single-stream gzip artifact

node_modules with thousands of tiny dupes

Repeated LICENSE/README files

Encrypted artifact

Artifact over the tier cap

Wanting the tool to delete the dupes

Comparing two build artifacts

Free-tier developer account

Frequently asked questions

Can I call this from CI or a script?

Will it group same-named files that differ?

Does it delete or fix the duplicates?

How do I read the output programmatically?

What artifacts can it read?

Why is my gzip artifact showing no duplicates?

How do I see only the biggest wins?

Can it diff two builds?

Is my proprietary artifact uploaded?

What's the size limit?

Does it handle encrypted artifacts?

Is the output stable across runs?

Privacy first

Related guides

Duplicate File Detector in Developer Workflows

How to duplicate file detector in developer workflows

Where developer bloat hides — and what the analyzer shows

Output JSON shape (for scripting)

Capabilities vs constraints

Cookbook

Vendored dependency across node_modules

Shared image copied into every route

Focusing on the heaviest offenders

Backup snapshot redundancy

Confirming a clean release after a fix

Edge cases and what actually happens

Expecting a CLI or API call

Same filename, different content

Single-stream gzip artifact

node_modules with thousands of tiny dupes

Repeated LICENSE/README files

Encrypted artifact

Artifact over the tier cap

Wanting the tool to delete the dupes

Comparing two build artifacts

Free-tier developer account

Frequently asked questions

Can I call this from CI or a script?

Will it group same-named files that differ?

Does it delete or fix the duplicates?

How do I read the output programmatically?

What artifacts can it read?

Why is my gzip artifact showing no duplicates?

How do I see only the biggest wins?

Can it diff two builds?

Is my proprietary artifact uploaded?

What's the size limit?

Does it handle encrypted artifacts?

Is the output stable across runs?

Privacy first

Related guides