How to hash an email column for deterministic matching
- Step 1Agree on the salt and normalisation with the other party — Both sides must use the identical salt or the tokens won't match. Also agree on normalisation — at minimum lowercase and trim the email — because
Jane@Acme.comandjane@acme.comhash to different tokens. Pin down both before anyone runs the tool. - Step 2Normalise the email column first — The anonymizer hashes the value as-is; it does not lowercase or trim for you. Run csv-case-converter to lowercase the email column and csv-whitespace-trimmer to strip stray spaces, so both parties feed identical inputs into the hash.
- Step 3Drop the normalised CSV onto the anonymizer — It parses in your browser — the plaintext list is never uploaded. If auto-detect pre-filled an email hash rule, great; otherwise add one explicitly.
- Step 4Add a hash rule on the email column and set the shared salt — Add a rule: column = your email header, strategy = hash. Type the agreed shared salt into the Hash salt field (or set a per-rule salt). The salt is prepended to every value before hashing, so it must match the other party's exactly — byte for byte.
- Step 5Anonymize and download the token file — Click Anonymize CSV. The email column is now 16-char hex tokens; other columns pass through unless you added rules. Download
<name>.anon.csv. This is the file you can safely exchange — it has tokens, not addresses. - Step 6Both sides exchange token files and join on the token — Each party sends only its hashed file. Join the two on the token column to find the overlap. Because the hash is deterministic and the salt matched, a shared customer's token is identical on both sides. Mismatches mean different salt or different normalisation — re-check both.
What makes two tokens match
Both conditions must hold. If either differs between parties, the same email produces different tokens and the match fails.
| Factor | Must be identical? | How to guarantee it |
|---|---|---|
| The raw value (after normalisation) | Yes | Lowercase + trim both sides before hashing |
| The salt | Yes | Agree on one shared secret salt, byte-for-byte |
| The strategy | Yes (must be hash) | Only hash is deterministic across files; not mask/sequential |
| Encoding of the email text | Yes | Use the same encoding so the byte content matches (UTF-8 both sides) |
Why the other strategies don't work for matching
Matching needs same-input-same-output across two separately-processed files. Only hash provides it.
| Strategy | Deterministic across files? | Usable for matching? |
|---|---|---|
| hash | Yes — same value + salt → same token | Yes |
| mask | Yes but lossy — collisions likely (**** for many values) | No — too many collisions |
| redact | All values become [REDACTED] | No — every row identical |
| sequential | Position-based; differs by file order | No — not value-based |
| drop | Column removed | No — nothing to match on |
Tier limits
Browser-side CSV limits. The CSV Anonymizer is Pro.
| Limit | Free | Pro |
|---|---|---|
| Max file size | 2 MB | 100 MB |
| Max rows | 500 | 100,000 |
| Batch files | 2 | 10 |
Cookbook
Matching recipes. Tokens are illustrative of the deterministic 16-char hex format; normalise before hashing for reliable matches.
Two parties match shared customers without sharing emails
ExampleBoth sides normalise (lowercase+trim), then hash the email with the agreed salt. They exchange only the token columns and join on the token.
Party A (salt: shared-2026): jane@acme.com → a3f10b9c4e7d2118 zoe@globex.com → 6b21fe09c7a4d530 Party B (salt: shared-2026): jane@acme.com → a3f10b9c4e7d2118 amir@umbrella.com → 88c4... Join on token → jane@acme.com is the shared customer (a3f10b9c4e7d2118 present in both).
Normalisation mismatch causes a missed match
ExampleIf one side forgets to lowercase, the same person hashes to two different tokens and the match is lost. This is the #1 failure mode.
Party A (no normalisation): Jane@Acme.com → 1c77...e9 Party B (lowercased): jane@acme.com → a3f1...18 1c77...e9 ≠ a3f1...18 → Jane is NOT counted as shared. Fix: both lowercase + trim BEFORE hashing (csv-case-converter then csv-whitespace-trimmer).
Different salts produce non-matching token spaces
ExampleThe salt must be identical. A different salt yields entirely different tokens for the same email, so nothing matches.
Party A salt 'alpha': jane@acme.com → 4d90... Party B salt 'beta' : jane@acme.com → e2b7... 4d90... ≠ e2b7... → zero matches even though Jane is in both. Fix: agree one shared salt; exchange it over a secure channel.
Hash a phone instead of email
ExampleThe same deterministic behaviour works for any key. Normalise phone format first (strip +, spaces) so both sides hash identical strings.
Normalise: +1 (555) 123-4567 → 15551234567 (csv-find-replace) Rule: phone → hash (salt: shared-2026) 15551234567 → 7a02c9e1b4f3d680 Both parties normalise the same way → tokens match on the phone.
Keep a join key plus an analytic column for the match output
ExampleHash the email but leave a non-PII metric column untouched, so after matching you can compare metrics for the shared cohort.
Input: email,ltv_band jane@acme.com,high zoe@globex.com,low Rule: email → hash (salt: shared-2026); ltv_band → no rule Output: email,ltv_band a3f10b9c4e7d2118,high 6b21fe09c7a4d530,low → exchange this; matched rows let you compare ltv_band by cohort.
Errors and edge cases
Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.
Normalisation differences silently break matches
Match failureThe hash is computed on the value exactly as it appears. Jane@Acme.com, jane@acme.com , and jane@acme.com are three different inputs producing three different tokens. The tool does not lowercase or trim for you. Both parties must normalise identically — lowercase with csv-case-converter and trim with csv-whitespace-trimmer — before hashing, or shared customers will be missed.
Salt mismatch yields zero matches
Match failureThe salt is prepended to every value before hashing, so any difference (even a trailing space in the salt) puts the two parties in different token spaces and nothing matches. Agree on the exact salt string and confirm it byte-for-byte. If you see zero overlap where you expected some, suspect the salt first.
FNV-1a is not cryptographically secure
Security limitationThe hash is FNV-1a, chosen for speed and determinism, not security. With a known or guessable salt, a party could brute-force tokens back to common emails. Keep the salt secret to both parties only, exchange it over a secure channel, and don't treat the token file as safe to publish openly. For adversarial settings requiring cryptographic guarantees, use a vetted private-set-intersection protocol, not this tool.
Mask or sequential used by mistake instead of hash
Wrong strategyOnly hash is deterministic and value-based across separately-processed files. Mask collides heavily (many values become identical star patterns), redact makes every row the same, and sequential depends on row order — none can match across two files. If your join returns garbage, confirm both sides used the hash strategy on the same column.
Low-entropy keys are easy to reverse
Security limitationHashing a column with few possible values (a 5-digit zip, a small status set) lets the other party precompute tokens for every possible value even with the salt, since the salt is shared. Deterministic hashing only hides high-entropy values like full emails well. Don't rely on hashing to protect low-cardinality fields from your matching counterparty.
Adding a rule turns off auto-detect
Behaviour to knowAuto-detect runs only with zero explicit rules. The moment you add your email hash rule, auto-detect stops, so any other PII columns you assumed were being hashed pass through in plaintext into the file you exchange. Add explicit rules (or drop) for every column you don't want to share, and check the applied-rules chips.
Encoding differences change the bytes hashed
Match failureIf one party's file is UTF-8 and another's mangles an accented address into a different byte sequence, the inputs differ and the tokens won't match for international addresses. Ensure both sides export and process as UTF-8. ASCII-only emails are unaffected.
Free tier blocks a real matching list
BlockedThis is a Pro tool; free CSV limits are 2 MB / 500 rows. A real customer list usually exceeds that. On Pro you get 100 MB / 100,000 rows. For larger lists, split with csv-row-splitter, hash each chunk with the same salt, and concatenate — the per-row tokens are independent of chunking.
Empty cells hash to a value too
Behaviour to knowA blank email cell is hashed as the salt-plus-empty-string, producing a consistent token for 'no email'. If both files have blanks, those rows will 'match' on the empty-string token. Filter out rows with no key before hashing (use csv-empty-row-remover or csv-column-filter) to avoid spurious matches.
Frequently asked questions
Is the email hashing deterministic?
Yes — that's the core property. The same email plus the same salt always produces the same 16-character hex token. That's what lets two parties hash their own lists independently and then match on the token. The hash is two FNV-1a digests of the salted value (forward and reversed) concatenated, computed entirely from the input, with no randomness.
Why do my two files not match even though I know there's overlap?
Almost always one of two things: the salt differs between the two runs, or the email wasn't normalised identically (case or whitespace). The hash is computed on the exact value, so Jane@Acme.com and jane@acme.com differ. Confirm both sides used the exact same salt and both lowercased + trimmed the email before hashing.
Does the tool lowercase and trim the email for me?
No — it hashes the value exactly as it appears. You must normalise first. Lowercase the email column with csv-case-converter and strip whitespace with csv-whitespace-trimmer before adding the hash rule, and make sure the other party applies the identical normalisation.
How secret does the salt need to be?
Both matching parties share it, but no one else should have it. The salt is prepended before hashing, so without it an outsider can't precompute tokens for known emails. Exchange it over a secure channel. Note that because it's a fast FNV-1a hash and the salt is shared between you, the counterparty could in theory brute-force low-entropy values — full emails are reasonably safe, low-cardinality fields are not.
Can I match on something other than email?
Yes — hash any key column the same way: phone, customer id, device id. Just normalise the format consistently on both sides first (e.g. strip + and spaces from phones with csv-find-replace) so the inputs are byte-identical before hashing.
Is this a secure private-set-intersection protocol?
No. Deterministic salted hashing is a lightweight, practical approach, but it's not a cryptographic PSI protocol and FNV-1a isn't a secure hash. It's fine for cooperative parties matching reasonably high-entropy keys with a shared secret salt. For adversarial settings or regulatory requirements, use a vetted PSI implementation rather than this tool.
What if some rows have a blank email?
A blank value still gets hashed (salt + empty string), producing one consistent 'no email' token. If both files contain blanks, those rows will appear to match each other. Remove keyless rows before hashing with csv-empty-row-remover or filter them out with csv-column-filter to avoid false matches.
Can I use a different salt for a second key in the same file?
Yes — each rule can carry its own salt, which overrides the global salt. So you can hash email with salt A (matched against counterparty X) and phone with salt B (matched against counterparty Y) in the same run. The global Hash salt field applies to any hash rule that doesn't set its own.
Will the raw email list be uploaded anywhere?
No. PapaParse parses and hashes the file in your browser; only the tokenised .anon.csv is produced. Your plaintext list never reaches a server, which is the whole point of doing the hashing client-side before any exchange. A single no-content usage counter is kept server-side for signed-in stats.
Does international/accented email text affect matching?
It can, if the two files encode the characters differently. Hashing operates on the byte content, so an accented local part must be the same encoding (UTF-8) on both sides to produce the same token. Pure-ASCII emails are unaffected. Standardise encoding before hashing if your list has international addresses.
Can I keep extra columns for analysis after matching?
Yes — only columns with a rule are transformed. Hash the email but leave non-PII metric columns ruleless, and they pass through unchanged. After matching on the token, you can compare those metrics across the shared cohort. Just be careful not to leave a column that itself re-identifies people.
Can I automate the hashing for a recurring match job?
Yes — GET /api/v1/tools/csv-anonymizer returns the schema; pair the @jadapps/runner and POST to 127.0.0.1:9789/v1/tools/csv-anonymizer/run with a fixed salt. It runs locally so the raw list never reaches JAD's servers. A common setup: both partners run a scheduled normalise → hash → exchange-tokens flow weekly to refresh the overlap.
Privacy first
Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.