How to duplicate file detector for security & compliance
- Step 1Confirm tier and handling rules — The tool is Pro-tier. Because everything runs locally with no upload, it suits confidential bundles, but confirm your policy permits opening evidence in a browser tab. No file leaves the machine.
- Step 2Open the analyzer and drop the bundle — Go to redundancy-analyzer and drop one evidence or deliverable archive. It reads ZIP, 7z, RAR, tar.*, bz2, xz, and ISO; it analyzes one archive at a time.
- Step 3Keep the result complete for audits — Raise the Top-N groups slider toward 500 for a thorough audit so you do not silently drop low-waste-but-relevant duplicate sets — for compliance, completeness usually beats brevity.
- Step 4Record the digests — Each group reports its SHA-256, the duplicate count, and every file path. Copy these into your audit working notes; they are the values you would compare against a known-good manifest or a chain-of-custody record.
- Step 5Cross-check against a manifest — If you have a supplied checksum manifest, compute per-file SHA-256 with checksum-generator and reconcile. The analyzer tells you which files are identical to each other; the checksum generator tells you whether each matches the expected value.
- Step 6Do not modify the evidence — The analyzer never writes to the archive. If a downstream task needs a de-duplicated copy, produce it separately (e.g. selective-extractor) and document it as a derived artifact, leaving the original untouched.
Security & compliance scenarios
Common audit situations and what the analyzer surfaces. It reports content identity only — it does not assert provenance or trust.
| Scenario | What the analyzer shows | Follow-up tool |
|---|---|---|
| Evidence bundle with duplicate captures | Same log/file hashed twice grouped under one SHA-256 | Document; do not modify original |
| SBOM package vendoring same dep repeatedly | Identical dependency files grouped, wasted bytes totalled | file-type-breakdown |
| Vendor deliverable with copy-pasted libs | Same library under different paths in one group | archive-metadata-extractor |
| Verify a file matches a supplied digest | Per-group SHA-256 to compare | checksum-generator |
| Compare two release artifacts | Not this tool (single archive) | archive-diff |
Why content hashing matters for tamper questions
What the SHA-256 grouping does and does not tell you. Be precise in audit notes.
| Question | Analyzer answers it? | How |
|---|---|---|
| Are these two files byte-identical? | Yes | Same SHA-256 = identical content |
| How much of the bundle is redundant? | Yes | totalWastedBytes / totalWastedHuman |
| Does file X match a known-good digest? | Indirectly | Compare the reported hash to your manifest |
| Who created the file / is it signed? | No | Use archive-signing-info / archive-metadata-extractor |
| Has any file been altered vs a baseline? | No (single archive) | Use archive-diff against the baseline |
Handling caps for sensitive bundles
Tier limits. The tool is Pro-minimum; Free cannot run it.
| Tier | Max bundle size | Max entries | Suitable for |
|---|---|---|---|
| Free | n/a (tool requires Pro) | n/a | Not available |
| Pro | 500 MB | 50,000 | Most evidence bundles / deliverables |
| Pro-media | 2 GB | 500,000 | Large captures, disk images, ISOs |
| Developer | 2 GB | 500,000 (unlimited files) | Internal audit automation seats |
Cookbook
Audit-shaped examples. Paths are illustrative; the report fields are exactly what the tool emits.
Same evidence file captured twice
An IR bundle accidentally included the same system log under two collection paths. The analyzer groups them by identical SHA-256, so the auditor knows it is one artifact, not two independent captures.
Input: ir-bundle-2026-06-12.zip
Report (excerpt):
{
"duplicateGroups": 1,
"groups": [{
"hash": "c1f5a9b2...",
"count": 2,
"perFileSize": 4194304,
"wastedBytes": 4194304,
"files": [
{"name":"host-a/var/log/auth.log","size":4194304},
{"name":"collection-2/auth.log","size":4194304}
]
}]
}
Audit note: identical content (one capture), not two sources.SBOM package vendoring the same dependency
A delivery vendored an identical crypto library into multiple component folders. Grouped by content, the redundancy and its inconsistent placement become obvious.
Input: vendor-sbom.tar.gz Report summary: duplicateGroups: 14 totalWastedHuman: "38.2 MB" top group: libssl bytes appear in 6 component dirs Finding: same dependency vendored 6x under different paths -> flag for single-sourcing + provenance review.
Reconciling against a supplied checksum manifest
The analyzer says which files are identical to each other; pair it with checksum-generator to confirm each matches the vendor's published digest.
1) Analyzer: files A and B share hash 7d1e...
2) checksum-generator on the bundle:
A 7d1e...
B 7d1e...
3) Vendor manifest expects: 7d1e... -> MATCH
Conclusion: A and B are the approved artifact, just duplicated.Read-only audit of a delivered ISO
A vendor shipped a disc image. libarchive WASM reads ISO entries so you can audit for duplicated payloads without mounting or extracting the image.
Input: appliance-2.1.iso Report: totalEntries: 9,210 duplicateGroups: 7 totalWastedHuman: "112.4 MB" No files written to disk; original ISO untouched.
Confirming a hardened bundle has no redundancy
Before sign-off, an empty groups array proves the deliverable carries no byte-identical duplicates — a clean, minimal package.
Input: signed-release.zip Report: duplicateGroups: 0 totalWastedHuman: "0 B" groups: [] Clean: no redundant content. Proceed to signing review.
Edge cases and what actually happens
Encrypted evidence archive
RejectedThe analyzer extracts without a password and errors on any encrypted entry. For encrypted evidence, decrypt under your handling procedure with multi-format-extractor (which accepts a password), document the step, then analyze the cleartext copy.
Chain-of-custody forbids modification
PreservedThe analyzer is strictly read-only and never writes to the archive, so running it does not alter evidence. Any de-duplicated output must be produced as a separate derived artifact by another tool, not by this one.
Identical hash is treated as identical content
By designSame SHA-256 means byte-identical content; SHA-256 collisions are computationally infeasible. State it as content identity in audit notes, not as proof of provenance or trust.
Provenance or signature questions
Out of scopeThe analyzer reports content identity, not who made a file or whether it is signed. Use archive-signing-info or archive-metadata-extractor for those.
Comparing two bundles for tamper
Out of scopeThis tool analyzes a single archive. To detect changes between a baseline and a current bundle, use archive-diff.
Bundle exceeds tier cap
RejectedPro allows 500 MB / 50,000 entries; higher tiers 2 GB / 500,000. Oversized evidence must be split with archive-splitter and analyzed in parts, or handled on a higher tier.
Empty files in the bundle
ExpectedAll zero-byte files share one SHA-256 and group together with zero wasted bytes. This is expected, not a finding — note it so it is not misread as suspicious duplication.
More duplicate sets than the slider returns
TruncatedIf duplicate groups exceed the Top-N value, only the highest-waste ones are returned. For a complete audit, raise the slider to 500 and record that the report may still be capped if the true count is higher.
Free-tier analyst account
BlockedThe tool requires Pro. Free accounts cannot run it. Provision at least a Pro seat for compliance use.
Browser blocks WebAssembly
Failedlibarchive formats (7z/RAR/bz2/xz/ISO) need WebAssembly. A hardened browser policy or extension that blocks WASM will prevent reading those formats. ZIP/GZIP/TAR still work via fflate.
Frequently asked questions
Is the bundle uploaded anywhere?
No. Extraction and SHA-256 hashing run entirely in your browser. Nothing is uploaded and nothing is decompressed to disk, which suits confidential evidence.
Can I quote the hashes in an audit report?
Yes. Each group's hash is the SHA-256 of that file's bytes — a stable, quotable content digest you can compare against a manifest.
Does running it modify the evidence?
No. The tool is read-only; it never writes back to the archive, so it is safe under chain-of-custody rules.
Does it prove a file is authentic or signed?
No. It proves content identity between files. For signatures or provenance, use archive-signing-info or archive-metadata-extractor.
How do I verify a file against a known-good digest?
Use checksum-generator to compute each file's SHA-256, then compare to your supplied manifest. The analyzer separately tells you which files are identical to each other.
Can it compare a bundle to a baseline?
Not directly — it analyzes one archive. Use archive-diff to compare two archives for changes.
What formats can I audit?
ZIP, GZIP, TAR, tar.gz, tar.bz2, tar.xz, 7z, RAR, bz2, xz, and ISO — covering most vendor deliverables and disc images.
What if the evidence is encrypted?
The analyzer cannot read encrypted archives (no password field). Decrypt under your procedure with multi-format-extractor, then analyze the cleartext copy.
How large a bundle can I analyze?
Pro: 500 MB / 50,000 entries. Pro-media and Developer: 2 GB / 500,000 entries. The tool requires Pro or higher.
Why are empty files grouped?
Every zero-byte file has the same SHA-256, so they all match with zero wasted bytes. Note it so it is not misread as a finding.
Is the result deterministic?
Yes. SHA-256 is fixed and the algorithm is stable, so the same bundle produces the same digests and groups on every run.
Can I run it without per-seat licensing?
It runs in the browser on a Pro (or higher) account — there is no separate per-analyst install or licence beyond your plan tier.
Privacy first
Every JAD Archive tool runs entirely in your browser using fflate, @zip.js/zip.js, and the libarchive WASM bridge. Your archives never leave your device — verified by zero outbound network requests during processing.