How to detect duplicate files by sha-256 content fingerprint
- Step 1Decide which hash you'll dedup on — Use SHA-256 unless you're matching an existing index. It's the
sha256field in the report — a 64-hex-char string. SHA-256 is the right default: long enough that accidental collisions don't happen, fast enough for everyday files. - Step 2Fingerprint the first candidate — Drop a file onto the dropzone and run. The bytes are read into memory and digested locally. Copy the
sha256value, or download the<filename>.hashes.jsonreport, and record it next to the filename in a spreadsheet or note. - Step 3Fingerprint each remaining candidate — Each run hashes the first dropped file only, so repeat per file — drop, run, record. Build up a list of
filename -> sha256pairs as you go. (For a guided two-file comparison instead of a list, see the cookbook.) - Step 4Sort by hash to surface duplicate groups — Sort your
filename -> sha256list by the hash column. Any two rows with the same SHA-256 are byte-for-byte identical files — a duplicate group. Rows with a unique hash are one-of-a-kind. - Step 5Keep one, delete the rest of each group — Within a duplicate group, keep whichever copy has the name/location you want and delete the others — they are exact byte copies, so nothing is lost. Files with a unique hash are not duplicates; leave them alone.
- Step 6Re-fingerprint after cleanup to confirm — After deleting, fingerprint the survivor again and confirm its SHA-256 is unchanged. A matching digest proves you kept an intact copy and didn't accidentally truncate or alter it during the cleanup.
Why filename, size, and date can't prove a duplicate — but the hash can
Common signals people rely on for dedup, and why only a content hash is decisive.
| Signal | Same value means… | Reliable for dedup? |
|---|---|---|
| Filename | Nothing — copies get renamed; different files share names | No — copy (1).jpg could be identical or unrelated |
| File size | Possibly the same bytes — but two different files can match by size | No — a same-size pair is a *candidate*, not a confirmed duplicate |
| Modified date | Nothing — copying or syncing rewrites timestamps | No — a true copy often has a brand-new date |
| SHA-256 digest | Byte-for-byte identical contents | Yes — a match is a true duplicate; a mismatch is genuinely different |
Reading the report for dedup
Every run returns all four digests. For deduplication, the SHA-256 line is the one to key on; the others are there for matching legacy indexes.
| Report field | Length | Use for dedup |
|---|---|---|
sha256 | 64 hex chars | Primary key — sort/group on this to find duplicates |
md5 | 32 hex chars | Only if reconciling against an older asset manager that indexed on MD5 |
sha1 | 40 hex chars | Only if matching a system that stored SHA-1 content keys |
sha512 | 128 hex chars | Higher-assurance dedup of critical archives; same conclusion as SHA-256 |
Limits and scope
Per-file, in-memory hashing. The tool fingerprints one file per run; the dedup logic is your comparison of the digests.
| Property | Value | Notes |
|---|---|---|
| File-size limit (Free / Pro / Pro-media / Developer) | 10 MB / 100 MB / 500 MB / 2 GB | Whole file is read into memory; a file over your cap is rejected before hashing |
| Files per run | 1 (the first dropped file) | Fingerprint one at a time and collect the digests yourself |
| Output | JSON { sha1, sha256, sha512, md5 }, lowercase hex | Copy, or download as <filename>.hashes.json |
| Options | None | No normalization — the raw bytes are hashed exactly as they are |
Cookbook
Practical deduplication workflows. The tool produces one SHA-256 per file; deciding what's a duplicate is comparing those digests. CLI equivalents are shown so you can spot-check from a terminal.
Confirm two same-size photos are actually the same shot
Your camera roll has two 4.2 MB JPEGs that look identical. Same size isn't proof. Fingerprint each: matching SHA-256 means delete one with confidence; different SHA-256 means they're distinct files (maybe a burst or a re-edit) — keep both.
IMG_0421.jpg -> sha256: 7d865e959b2466918c9863afca942d0f... IMG_0421 (1).jpg -> sha256: 7d865e959b2466918c9863afca942d0f... Identical -> true duplicate, safe to delete one. Terminal cross-check: sha256sum IMG_0421.jpg "IMG_0421 (1).jpg"
Build a hash index of a document folder in a spreadsheet
To dedup a dump of client documents, fingerprint each file and record filename + SHA-256 in two columns. Sorting on the hash column groups every duplicate together. One-of-a-kind files have a hash that appears exactly once.
filename sha256 -------------------- ------------------------------------------ contract.pdf 9b74c9897bac770ffc029102a200c5de... contract-copy.pdf 9b74c9897bac770ffc029102a200c5de... <- dup nda.pdf 0a0a9f2a6772942557ab5355d76af442... Sort by sha256 -> the two matching rows are the duplicate. Keep contract.pdf, delete contract-copy.pdf.
Catch a near-duplicate that is NOT a byte duplicate
A re-exported or re-compressed copy of an image looks the same to your eye but has different bytes, so its SHA-256 differs. The tool correctly reports them as distinct — content hashing finds exact duplicates, not visually-similar ones.
original.png -> sha256: 2c26b46b68ffc68ff99b453c1d304134... resized.png -> sha256: fcde2b2edba56bf408601fb721fe9b5c... Different -> NOT a byte duplicate (re-encoded/edited). Keep both; a hash only matches exact copies.
Reconcile against a legacy asset manager keyed on MD5
An older DAM exported a manifest of MD5 content keys. Your job is to find which local files are already in it. Fingerprint each local file and match its md5 line against the manifest — the matching digest is already in the report.
Local file -> md5: e2fc714c4727ee9395f324cd2e7f331f DAM manifest contains: e2fc714c4727ee9395f324cd2e7f331f asset_88213 Match -> this file is already catalogued (a duplicate of asset_88213). (Use SHA-256 for new dedup work; MD5 only to honor the old index.)
Verify the survivor is intact after deleting duplicates
After purging duplicate copies, re-fingerprint the file you kept and confirm its SHA-256 still equals what you recorded before cleanup. A match proves the survivor is whole; a mismatch means it was altered or truncated during the operation.
Before cleanup -> sha256: 7d865e959b2466918c9863afca942d0f... After cleanup -> sha256: 7d865e959b2466918c9863afca942d0f... Unchanged -> survivor is intact. Differ -> the kept copy was modified; restore from backup.
Edge cases and what actually happens
Two photos look identical but hash differently
By designThe tool hashes raw bytes, not pixels. A re-saved, re-compressed, resized, or re-encoded image has different bytes — and therefore a different SHA-256 — even if it looks the same on screen. Content hashing finds exact duplicates only. It will not group a JPEG and a PNG of the same scene, or two different JPEG quality settings.
Same content, but one file has extra metadata
Not a duplicateIf one copy carries EXIF, an ID3 tag, or an XMP block the other lacks, the bytes differ and so does the hash — they are not byte-identical. To compare the media content while ignoring metadata you'd first strip it (for images via gps-geotag-remover, for MP3 tags via audio-id3-ghoster) and then re-hash both.
Text files that look the same won't match
ExpectedA CRLF (Windows) vs LF (Unix) line ending, a UTF-8 BOM, or a trailing newline added by an editor changes the bytes and flips the digest. "Looks the same" is not "is the same." If duplicate text files refuse to match, suspect line endings or encoding before assuming the hash is wrong.
Multiple files dropped at once
First file onlyThe dropzone is multi-select, but each run fingerprints files[0] only and returns one report. Drop and run per file to build your index, or use the server-safe runner path to script batch fingerprinting of a whole folder.
A file is larger than your tier's limit
Rejected: too largeThe whole file is read into memory before hashing, so the cap is enforced up front: Free 10 MB, Pro 100 MB, Pro-media 500 MB, Developer 2 GB. A file over your cap is rejected with an "exceeds the … limit for your plan" error before any digest is computed.
You only need to compare two specific files
Use the paired toolManually collecting two hashes works, but file-integrity-monitor takes both files at once, tells you directly whether they're byte-identical, and reports the first differing byte offset when they aren't — faster than eyeballing two SHA-256 strings.
Empty files all share one hash
ExpectedEvery zero-byte file produces the same well-known digest (e3b0c4… for SHA-256, d41d8c… for MD5). That's correct — they really are byte-identical (both empty). Don't treat a folder of empty placeholders as a meaningful duplicate group.
No file dropped before running
Error: no fileThere is no text-paste mode — the tool needs a file. Running with an empty dropzone throws "No file provided." Drop a file first, then run.
You want to detect tampering over time, not duplicates
Different goalDedup answers "are these two files the same now?" To answer "did this one file change since last week?" you need a saved baseline hash to compare against later — capture the SHA-256 now and re-fingerprint later, or use file-integrity-monitor to diff two copies.
Frequently asked questions
Why hash files instead of comparing names or sizes to find duplicates?
Names and dates lie — a true copy gets renamed and re-dated when you move or sync it, and an unrelated file can happen to share a name. File size is a slightly better hint but still not proof: two different files can be exactly the same size. A SHA-256 digest is derived from every byte, so identical digests guarantee identical contents and different digests guarantee different contents. It's the only signal that actually proves duplication.
Which hash should I use for deduplication?
SHA-256. It's long enough (64 hex chars) that two different real-world files matching by accident is effectively impossible, and it's the modern default. Use SHA-512 if you want extra assurance on critical archives — it reaches the same conclusion. Only key on MD5 or SHA-1 if you're reconciling against an existing index that already used those; all four digests are in every report.
Will it find photos that look the same but were edited or resized?
No — and that's correct behavior. Content hashing finds files that are byte-for-byte identical. A resized, re-compressed, cropped, or re-exported image has different bytes and therefore a different hash, even though it looks the same. This tool catches exact duplicates (the same file copied twice), not visually similar or perceptually-near images. For that you'd need a perceptual-hash tool, which this is not.
Two copies of my document won't match — what happened?
Some byte differs. Common culprits: one copy was opened and re-saved (rewriting metadata or compression), one has a different line ending or a BOM if it's text, or one carries embedded metadata the other doesn't. The contents may look identical to you while the bytes aren't. If you want to compare only the visible content of, say, two images, strip metadata first and re-hash.
Can I dedup a whole folder in one go?
Not in a single browser run — each run fingerprints the first dropped file only. Fingerprint each file and collect the digests into a spreadsheet, then sort on the hash column to surface duplicate groups. For true batch fingerprinting of many files at once, use the server-safe runner path, which scripts the same hashing without files leaving your machine.
Are my files uploaded when I hash them?
No. Each file is read into memory in your browser and digested locally with the Web Crypto API — nothing is uploaded. That's why you can safely dedup a private photo library, a client's confidential documents, or an unpublished dataset. The only server-side record for signed-in users is a usage counter, never file content.
Is a SHA-256 match ever a false positive?
For practical purposes, no. SHA-256 has 2^256 possible outputs; the chance of two different files you actually own colliding by accident is astronomically smaller than a hardware failure silently corrupting your comparison. A SHA-256 match means the files are the same bytes. (Deliberate adversarial collisions are a separate concern and aren't relevant to deduplicating your own files.)
Do empty files all count as duplicates of each other?
Yes, because they genuinely are byte-identical — every zero-byte file produces the same digest (e3b0c4… for SHA-256). That's not a bug. Just be aware that a pile of empty placeholder files will all collapse into one "duplicate group," which usually isn't what you care about.
Can I compare files in different folders or on different drives?
Yes — the hash doesn't depend on where a file lives. Fingerprint a file from one folder and a file from another (or from a USB drive, a download, a backup), and if the SHA-256 matches, they're identical regardless of path. Location, drive, and filesystem are irrelevant; only the bytes matter.
What's the difference between this and the file integrity monitor?
This tool gives you one file's fingerprint, which you then compare yourself against other fingerprints — ideal for building an index across many files. file-integrity-monitor is purpose-built for exactly two files: drop both and it tells you immediately whether they're byte-identical, plus the first byte offset where they differ. Use this for many-file dedup, the integrity monitor for a focused two-file comparison.
How big a file can I fingerprint for dedup?
Up to your tier's security file-size limit: Free 10 MB, Pro 100 MB, Pro-media 500 MB, Developer 2 GB. The whole file is read into memory before hashing — there's no streaming — so very large media files are bounded by both the tier cap and your browser's available memory. A file over the cap is rejected before any digest runs.
Can I automate dedup hashing in a script?
Yes. The fingerprinter is server-safe, so it runs through a paired @jadapps/runner without files leaving your machine. GET /api/v1/tools/multi-hash-fingerprinter returns the schema; POST each file to the local runner at http://127.0.0.1:9789/v1/tools/multi-hash-fingerprinter/run and collect the sha256 fields. The server-safe response also includes sizeBytes. Install the runner from /docs/runner.
Privacy first
Every JAD Security operation runs entirely in your browser. Files, passwords, and PGP private keys never leave your device — verified by zero outbound network requests during processing.