Hash an Email Column in a CSV for Deterministic Matching

How to hash an email column for deterministic matching

Step 1
Agree on the salt and normalisation with the other party — Both sides must use the identical salt or the tokens won't match. Also agree on normalisation — at minimum lowercase and trim the email — because Jane@Acme.com and jane@acme.com hash to different tokens. Pin down both before anyone runs the tool.
Step 2
Normalise the email column first — The anonymizer hashes the value as-is; it does not lowercase or trim for you. Run csv-case-converter to lowercase the email column and csv-whitespace-trimmer to strip stray spaces, so both parties feed identical inputs into the hash.
Step 3
Drop the normalised CSV onto the anonymizer — It parses in your browser — the plaintext list is never uploaded. If auto-detect pre-filled an email hash rule, great; otherwise add one explicitly.
Step 4
Add a hash rule on the email column and set the shared salt — Add a rule: column = your email header, strategy = hash. Type the agreed shared salt into the Hash salt field (or set a per-rule salt). The salt is prepended to every value before hashing, so it must match the other party's exactly — byte for byte.
Step 5
Anonymize and download the token file — Click Anonymize CSV. The email column is now 16-char hex tokens; other columns pass through unless you added rules. Download <name>.anon.csv. This is the file you can safely exchange — it has tokens, not addresses.
Step 6
Both sides exchange token files and join on the token — Each party sends only its hashed file. Join the two on the token column to find the overlap. Because the hash is deterministic and the salt matched, a shared customer's token is identical on both sides. Mismatches mean different salt or different normalisation — re-check both.

What makes two tokens match

Both conditions must hold. If either differs between parties, the same email produces different tokens and the match fails.

Factor	Must be identical?	How to guarantee it
The raw value (after normalisation)	Yes	Lowercase + trim both sides before hashing
The salt	Yes	Agree on one shared secret salt, byte-for-byte
The strategy	Yes (must be hash)	Only `hash` is deterministic across files; not mask/sequential
Encoding of the email text	Yes	Use the same encoding so the byte content matches (UTF-8 both sides)

Why the other strategies don't work for matching

Matching needs same-input-same-output across two separately-processed files. Only hash provides it.

Strategy	Deterministic across files?	Usable for matching?
hash	Yes — same value + salt → same token	Yes
mask	Yes but lossy — collisions likely (`****` for many values)	No — too many collisions
redact	All values become `[REDACTED]`	No — every row identical
sequential	Position-based; differs by file order	No — not value-based
drop	Column removed	No — nothing to match on

Tier limits

Browser-side CSV limits. The CSV Anonymizer is Pro.

Limit	Free	Pro
Max file size	2 MB	100 MB
Max rows	500	100,000
Batch files	2	10

Cookbook

Matching recipes. Tokens are illustrative of the deterministic 16-char hex format; normalise before hashing for reliable matches.

Two parties match shared customers without sharing emails

Example

Both sides normalise (lowercase+trim), then hash the email with the agreed salt. They exchange only the token columns and join on the token.

Party A (salt: shared-2026):
  jane@acme.com → a3f10b9c4e7d2118
  zoe@globex.com → 6b21fe09c7a4d530

Party B (salt: shared-2026):
  jane@acme.com → a3f10b9c4e7d2118
  amir@umbrella.com → 88c4...

Join on token → jane@acme.com is the shared customer
(a3f10b9c4e7d2118 present in both).

Normalisation mismatch causes a missed match

Example

If one side forgets to lowercase, the same person hashes to two different tokens and the match is lost. This is the #1 failure mode.

Party A (no normalisation): Jane@Acme.com → 1c77...e9
Party B (lowercased):       jane@acme.com → a3f1...18

1c77...e9 ≠ a3f1...18 → Jane is NOT counted as shared.

Fix: both lowercase + trim BEFORE hashing
(csv-case-converter then csv-whitespace-trimmer).

Different salts produce non-matching token spaces

Example

The salt must be identical. A different salt yields entirely different tokens for the same email, so nothing matches.

Party A salt 'alpha': jane@acme.com → 4d90...
Party B salt 'beta' : jane@acme.com → e2b7...

4d90... ≠ e2b7... → zero matches even though Jane is in both.

Fix: agree one shared salt; exchange it over a secure channel.

Hash a phone instead of email

Example

The same deterministic behaviour works for any key. Normalise phone format first (strip +, spaces) so both sides hash identical strings.

Normalise: +1 (555) 123-4567 → 15551234567 (csv-find-replace)

Rule: phone → hash (salt: shared-2026)
  15551234567 → 7a02c9e1b4f3d680

Both parties normalise the same way → tokens match on the phone.

Keep a join key plus an analytic column for the match output

Example

Hash the email but leave a non-PII metric column untouched, so after matching you can compare metrics for the shared cohort.

Input:
email,ltv_band
jane@acme.com,high
zoe@globex.com,low

Rule: email → hash (salt: shared-2026); ltv_band → no rule

Output:
email,ltv_band
a3f10b9c4e7d2118,high
6b21fe09c7a4d530,low

→ exchange this; matched rows let you compare ltv_band by cohort.

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Normalisation differences silently break matches

Match failure

The hash is computed on the value exactly as it appears. Jane@Acme.com, jane@acme.com , and jane@acme.com are three different inputs producing three different tokens. The tool does not lowercase or trim for you. Both parties must normalise identically — lowercase with csv-case-converter and trim with csv-whitespace-trimmer — before hashing, or shared customers will be missed.

Salt mismatch yields zero matches

Match failure

The salt is prepended to every value before hashing, so any difference (even a trailing space in the salt) puts the two parties in different token spaces and nothing matches. Agree on the exact salt string and confirm it byte-for-byte. If you see zero overlap where you expected some, suspect the salt first.

FNV-1a is not cryptographically secure

Security limitation

The hash is FNV-1a, chosen for speed and determinism, not security. With a known or guessable salt, a party could brute-force tokens back to common emails. Keep the salt secret to both parties only, exchange it over a secure channel, and don't treat the token file as safe to publish openly. For adversarial settings requiring cryptographic guarantees, use a vetted private-set-intersection protocol, not this tool.

Mask or sequential used by mistake instead of hash

Wrong strategy

Only hash is deterministic and value-based across separately-processed files. Mask collides heavily (many values become identical star patterns), redact makes every row the same, and sequential depends on row order — none can match across two files. If your join returns garbage, confirm both sides used the hash strategy on the same column.

Low-entropy keys are easy to reverse

Security limitation

Hashing a column with few possible values (a 5-digit zip, a small status set) lets the other party precompute tokens for every possible value even with the salt, since the salt is shared. Deterministic hashing only hides high-entropy values like full emails well. Don't rely on hashing to protect low-cardinality fields from your matching counterparty.

Adding a rule turns off auto-detect

Behaviour to know

Auto-detect runs only with zero explicit rules. The moment you add your email hash rule, auto-detect stops, so any other PII columns you assumed were being hashed pass through in plaintext into the file you exchange. Add explicit rules (or drop) for every column you don't want to share, and check the applied-rules chips.

Encoding differences change the bytes hashed

Match failure

If one party's file is UTF-8 and another's mangles an accented address into a different byte sequence, the inputs differ and the tokens won't match for international addresses. Ensure both sides export and process as UTF-8. ASCII-only emails are unaffected.

Free tier blocks a real matching list

Blocked

This is a Pro tool; free CSV limits are 2 MB / 500 rows. A real customer list usually exceeds that. On Pro you get 100 MB / 100,000 rows. For larger lists, split with csv-row-splitter, hash each chunk with the same salt, and concatenate — the per-row tokens are independent of chunking.

Empty cells hash to a value too

Behaviour to know

A blank email cell is hashed as the salt-plus-empty-string, producing a consistent token for 'no email'. If both files have blanks, those rows will 'match' on the empty-string token. Filter out rows with no key before hashing (use csv-empty-row-remover or csv-column-filter) to avoid spurious matches.

Frequently asked questions

Is the email hashing deterministic?

Yes — that's the core property. The same email plus the same salt always produces the same 16-character hex token. That's what lets two parties hash their own lists independently and then match on the token. The hash is two FNV-1a digests of the salted value (forward and reversed) concatenated, computed entirely from the input, with no randomness.

Why do my two files not match even though I know there's overlap?

Almost always one of two things: the salt differs between the two runs, or the email wasn't normalised identically (case or whitespace). The hash is computed on the exact value, so Jane@Acme.com and jane@acme.com differ. Confirm both sides used the exact same salt and both lowercased + trimmed the email before hashing.

Does the tool lowercase and trim the email for me?

No — it hashes the value exactly as it appears. You must normalise first. Lowercase the email column with csv-case-converter and strip whitespace with csv-whitespace-trimmer before adding the hash rule, and make sure the other party applies the identical normalisation.

How secret does the salt need to be?

Both matching parties share it, but no one else should have it. The salt is prepended before hashing, so without it an outsider can't precompute tokens for known emails. Exchange it over a secure channel. Note that because it's a fast FNV-1a hash and the salt is shared between you, the counterparty could in theory brute-force low-entropy values — full emails are reasonably safe, low-cardinality fields are not.

Can I match on something other than email?

Yes — hash any key column the same way: phone, customer id, device id. Just normalise the format consistently on both sides first (e.g. strip + and spaces from phones with csv-find-replace) so the inputs are byte-identical before hashing.

Is this a secure private-set-intersection protocol?

No. Deterministic salted hashing is a lightweight, practical approach, but it's not a cryptographic PSI protocol and FNV-1a isn't a secure hash. It's fine for cooperative parties matching reasonably high-entropy keys with a shared secret salt. For adversarial settings or regulatory requirements, use a vetted PSI implementation rather than this tool.

What if some rows have a blank email?

A blank value still gets hashed (salt + empty string), producing one consistent 'no email' token. If both files contain blanks, those rows will appear to match each other. Remove keyless rows before hashing with csv-empty-row-remover or filter them out with csv-column-filter to avoid false matches.

Can I use a different salt for a second key in the same file?

Yes — each rule can carry its own salt, which overrides the global salt. So you can hash email with salt A (matched against counterparty X) and phone with salt B (matched against counterparty Y) in the same run. The global Hash salt field applies to any hash rule that doesn't set its own.

Will the raw email list be uploaded anywhere?

No. PapaParse parses and hashes the file in your browser; only the tokenised .anon.csv is produced. Your plaintext list never reaches a server, which is the whole point of doing the hashing client-side before any exchange. A single no-content usage counter is kept server-side for signed-in stats.

Does international/accented email text affect matching?

It can, if the two files encode the characters differently. Hashing operates on the byte content, so an accented local part must be the same encoding (UTF-8) on both sides to produce the same token. Pure-ASCII emails are unaffected. Standardise encoding before hashing if your list has international addresses.

Can I keep extra columns for analysis after matching?

Yes — only columns with a rule are transformed. Hash the email but leave non-PII metric columns ruleless, and they pass through unchanged. After matching on the token, you can compare those metrics across the shared cohort. Just be careful not to leave a column that itself re-identifies people.

Can I automate the hashing for a recurring match job?

Yes — GET /api/v1/tools/csv-anonymizer returns the schema; pair the @jadapps/runner and POST to 127.0.0.1:9789/v1/tools/csv-anonymizer/run with a fixed salt. It runs locally so the raw list never reaches JAD's servers. A common setup: both partners run a scheduled normalise → hash → exchange-tokens flow weekly to refresh the overlap.

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

How to hash an email column for deterministic matching

Step 1
Agree on the salt and normalisation with the other party — Both sides must use the identical salt or the tokens won't match. Also agree on normalisation — at minimum lowercase and trim the email — because Jane@Acme.com and jane@acme.com hash to different tokens. Pin down both before anyone runs the tool.
Step 2
Normalise the email column first — The anonymizer hashes the value as-is; it does not lowercase or trim for you. Run csv-case-converter to lowercase the email column and csv-whitespace-trimmer to strip stray spaces, so both parties feed identical inputs into the hash.
Step 3
Drop the normalised CSV onto the anonymizer — It parses in your browser — the plaintext list is never uploaded. If auto-detect pre-filled an email hash rule, great; otherwise add one explicitly.
Step 4
Add a hash rule on the email column and set the shared salt — Add a rule: column = your email header, strategy = hash. Type the agreed shared salt into the Hash salt field (or set a per-rule salt). The salt is prepended to every value before hashing, so it must match the other party's exactly — byte for byte.
Step 5
Anonymize and download the token file — Click Anonymize CSV. The email column is now 16-char hex tokens; other columns pass through unless you added rules. Download <name>.anon.csv. This is the file you can safely exchange — it has tokens, not addresses.
Step 6
Both sides exchange token files and join on the token — Each party sends only its hashed file. Join the two on the token column to find the overlap. Because the hash is deterministic and the salt matched, a shared customer's token is identical on both sides. Mismatches mean different salt or different normalisation — re-check both.

What makes two tokens match

Both conditions must hold. If either differs between parties, the same email produces different tokens and the match fails.

Factor	Must be identical?	How to guarantee it
The raw value (after normalisation)	Yes	Lowercase + trim both sides before hashing
The salt	Yes	Agree on one shared secret salt, byte-for-byte
The strategy	Yes (must be hash)	Only `hash` is deterministic across files; not mask/sequential
Encoding of the email text	Yes	Use the same encoding so the byte content matches (UTF-8 both sides)

Why the other strategies don't work for matching

Matching needs same-input-same-output across two separately-processed files. Only hash provides it.

Strategy	Deterministic across files?	Usable for matching?
hash	Yes — same value + salt → same token	Yes
mask	Yes but lossy — collisions likely (`****` for many values)	No — too many collisions
redact	All values become `[REDACTED]`	No — every row identical
sequential	Position-based; differs by file order	No — not value-based
drop	Column removed	No — nothing to match on

Tier limits

Browser-side CSV limits. The CSV Anonymizer is Pro.

Limit	Free	Pro
Max file size	2 MB	100 MB
Max rows	500	100,000
Batch files	2	10

Cookbook

Matching recipes. Tokens are illustrative of the deterministic 16-char hex format; normalise before hashing for reliable matches.

Two parties match shared customers without sharing emails

Example

Both sides normalise (lowercase+trim), then hash the email with the agreed salt. They exchange only the token columns and join on the token.

Party A (salt: shared-2026):
  jane@acme.com → a3f10b9c4e7d2118
  zoe@globex.com → 6b21fe09c7a4d530

Party B (salt: shared-2026):
  jane@acme.com → a3f10b9c4e7d2118
  amir@umbrella.com → 88c4...

Join on token → jane@acme.com is the shared customer
(a3f10b9c4e7d2118 present in both).

Normalisation mismatch causes a missed match

Example

If one side forgets to lowercase, the same person hashes to two different tokens and the match is lost. This is the #1 failure mode.

Party A (no normalisation): Jane@Acme.com → 1c77...e9
Party B (lowercased):       jane@acme.com → a3f1...18

1c77...e9 ≠ a3f1...18 → Jane is NOT counted as shared.

Fix: both lowercase + trim BEFORE hashing
(csv-case-converter then csv-whitespace-trimmer).

Different salts produce non-matching token spaces

Example

The salt must be identical. A different salt yields entirely different tokens for the same email, so nothing matches.

Party A salt 'alpha': jane@acme.com → 4d90...
Party B salt 'beta' : jane@acme.com → e2b7...

4d90... ≠ e2b7... → zero matches even though Jane is in both.

Fix: agree one shared salt; exchange it over a secure channel.

Hash a phone instead of email

Example

The same deterministic behaviour works for any key. Normalise phone format first (strip +, spaces) so both sides hash identical strings.

Normalise: +1 (555) 123-4567 → 15551234567 (csv-find-replace)

Rule: phone → hash (salt: shared-2026)
  15551234567 → 7a02c9e1b4f3d680

Both parties normalise the same way → tokens match on the phone.

Keep a join key plus an analytic column for the match output

Example

Hash the email but leave a non-PII metric column untouched, so after matching you can compare metrics for the shared cohort.

Input:
email,ltv_band
jane@acme.com,high
zoe@globex.com,low

Rule: email → hash (salt: shared-2026); ltv_band → no rule

Output:
email,ltv_band
a3f10b9c4e7d2118,high
6b21fe09c7a4d530,low

→ exchange this; matched rows let you compare ltv_band by cohort.

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Normalisation differences silently break matches

Match failure

Salt mismatch yields zero matches

Match failure

FNV-1a is not cryptographically secure

Security limitation

Mask or sequential used by mistake instead of hash

Wrong strategy

Low-entropy keys are easy to reverse

Security limitation

Adding a rule turns off auto-detect

Behaviour to know

Encoding differences change the bytes hashed

Match failure

Free tier blocks a real matching list

Blocked

Empty cells hash to a value too

Behaviour to know

Frequently asked questions

Is the email hashing deterministic?

Why do my two files not match even though I know there's overlap?

Does the tool lowercase and trim the email for me?

How secret does the salt need to be?

Can I match on something other than email?

Is this a secure private-set-intersection protocol?

What if some rows have a blank email?

Can I use a different salt for a second key in the same file?

Will the raw email list be uploaded anywhere?

Does international/accented email text affect matching?

Can I keep extra columns for analysis after matching?

Can I automate the hashing for a recurring match job?

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

Hash an Email Column for Deterministic Matching

How to hash an email column for deterministic matching

What makes two tokens match

Why the other strategies don't work for matching

Tier limits

Cookbook

Two parties match shared customers without sharing emails

Normalisation mismatch causes a missed match

Different salts produce non-matching token spaces

Hash a phone instead of email

Keep a join key plus an analytic column for the match output

Errors and edge cases

Normalisation differences silently break matches

Salt mismatch yields zero matches

FNV-1a is not cryptographically secure

Mask or sequential used by mistake instead of hash

Low-entropy keys are easy to reverse

Adding a rule turns off auto-detect

Encoding differences change the bytes hashed

Free tier blocks a real matching list

Empty cells hash to a value too

Frequently asked questions

Is the email hashing deterministic?

Why do my two files not match even though I know there's overlap?

Does the tool lowercase and trim the email for me?

How secret does the salt need to be?

Can I match on something other than email?

Is this a secure private-set-intersection protocol?

What if some rows have a blank email?

Can I use a different salt for a second key in the same file?

Will the raw email list be uploaded anywhere?

Does international/accented email text affect matching?

Can I keep extra columns for analysis after matching?

Can I automate the hashing for a recurring match job?

Privacy first

Related guides

Hash an Email Column for Deterministic Matching

How to hash an email column for deterministic matching

What makes two tokens match

Why the other strategies don't work for matching

Tier limits

Cookbook

Two parties match shared customers without sharing emails

Normalisation mismatch causes a missed match

Different salts produce non-matching token spaces

Hash a phone instead of email

Keep a join key plus an analytic column for the match output

Errors and edge cases

Normalisation differences silently break matches

Salt mismatch yields zero matches

FNV-1a is not cryptographically secure

Mask or sequential used by mistake instead of hash

Low-entropy keys are easy to reverse

Adding a rule turns off auto-detect

Encoding differences change the bytes hashed

Free tier blocks a real matching list

Empty cells hash to a value too

Frequently asked questions

Is the email hashing deterministic?

Why do my two files not match even though I know there's overlap?

Does the tool lowercase and trim the email for me?

How secret does the salt need to be?

Can I match on something other than email?

Is this a secure private-set-intersection protocol?

What if some rows have a blank email?

Can I use a different salt for a second key in the same file?

Will the raw email list be uploaded anywhere?

Does international/accented email text affect matching?

Can I keep extra columns for analysis after matching?

Can I automate the hashing for a recurring match job?

Privacy first

Related guides