Spotting Spam with Similarity Hashes

Published in

ASecuritySite: When Bob Met Alice

5 min readDec 17, 2022

Many researchers have tried to find methods where they can hash a string and then compare the hashes. This allows content to be searched for and matched against to be represented in a hash form (and not in the original text). This makes it more efficient to store and to process. A good example is within a spam message, and where we receive two spam emails which only differ in the target’s contact name (taken from a real spam email):

Dear Sir, target@home !!Reference #PP-003-AC7-993-014
Account Security Alret. 
We need your help resolving an issue with your account.
What's going on?
Your debit or credit card issuer let us know that someone used your 
card without your permission. We want to make sure that you authorised 
any recent PayPal payments.

and:

Dear Sir, other@home !!Reference #PP-003-AC7-993-014
Account Security Alret. 
We need your help resolving an issue with your account.
What's going on?
Your debit or credit card issuer let us know that someone used your 
card without your permission. We want to make sure that you authorised 
any recent PayPal payments.

With this, we can use a similarity match to a known spam message, and gain a score. If the score goes over a certain level, we can quarantine it.

Spotting Spam with Similarity Hashes

Charikar

Written by Prof Bill Buchanan OBE FRSE