Hunting Phishers with Elastic Stack and Certificate Transparency

7 min readOct 1, 2019

A few years ago, Google launched the Certificate Transparency project. It was designed to improve the TLS/SSL certificate system, by providing an open framework for monitoring and auditing certificates in nearly real time. By allowing independent parties to inspect the certificates being issued in the world, breached Certificates Authorities issuing rouge certificates would be easy to spot. Basically security incidents like the Diginotar hack could be avoided. Moreover, the Certificates Transparency project does not only bring benefits for the end user browsing the web, it also presents some good opportunities for security researchers to perform hunting on domain names. In essence one could see Certificate Transparency as a stream of certificates been issued across the world. With the increasing adoption of HTTPS and HTTP/2 the amount of certificates has become more relevant over the last months. So how could we use this data as security researchers? In this blog we are going to explore how to hunt for phishing/malicious sites within this data set.

So let's start from the beginning, how do we get access to the Certificate Transparency data? First of all this data is open and publicly accessible, even better, there are open source projects that do the polling of the servers, parsing of the certificates , and, in some cases, alerts when potential phishing is detected. So access is not something to worry about. In our case, we decided to implement our own certificate transparency collector and parser so we could have more control over the data we gather. Then we push it into an Elastic Stack for data enrichment and analysis.

Common techniques used by Malicious Actors

So ready to hunt right? Well… would be nice to know what to hunt for first. In this blog we are going to explore some common techniques that attackers use to deceive users and disguise themselves as legit websites. For better understanding of each technique, lets first present the “disguise technique” and then show how it could be hunted using Kibana queries.

Typosquatting

This technique abuses the fact that users sometimes mistype the url of a website or simply does not pay enough attention to the correct spelling of it. By doing so, a user might think it is in the right location and (s)he would trust it and input the credential and/or sensitive information such as credit card, life insurance, etc.

As an example case study lets focus on Twitter copycats. First of all we know that the official domain is “Twitter.com” and it’s certificate is signed by “DigiCert SHA2 High Assurance Server CA” as shown in the following picture.

One approach to detect such malicious sites would be to use a regex expressions to search for some odd domains. In this case we are not interested in subdomains so we are going to focus on the field “domain”. The idea is to see all domains that contain the word twitter surrounded by numbers, “-”, and/or “.” . The idea behind this expression is that most users might ignore the fact that the top level domain is “.com” or that there are no numbers nor words in-front of the domain. Kibana query:

NOT signer.CN.keyword:”DigiCert SHA2 High Assurance Server CA” 
AND (domain:/(.*\.)?[0–9]*twitter[0–9]*(\..*)?/)

NOTE: In order to rule out the official “twitter.com” certificate we added the following filter to the query > “AND NOT signer.CN.keyword:”DigiCert SHA2 High Assurance Server CA”

Using the same detection technique, we could look for twitter api copycats, by adding another regex expression as performed in the following query:

NOT signer.CN.keyword: ”DigiCert SHA2 High Assurance Server CA” 
AND CN:/(.*\.)?[0–9]*twitter[0–9]*(\..*)?/ 
AND CN:/(.*\.)?[0–9]*api[0–9]*(\..*)?/

Another approach for hunting sites that use typosquatting would be to use “fuzzy searching”. This is a native functionality of elasticsearch that allows performing searches for terms that are “similar” to the provided searching term. It is important to note that in the this query we are using the elasticsearch datatype “keyword”. This one, stores the value before it is processed by elasticsearch for tokenization. In other words, it holds the raw value and can be searched using exact terms or regex. Kibana query:

NOT signer.CN.keyword:”DigiCert SHA2 High Assurance Server CA” 
AND CN.keyword:twitter.com~1

Appending malicious domain to a legit one

Another common technique that malicious actors use to lure users into their sites is to append an attacker controlled domain to the target domain. An example of this would be “www.example.com-malicious-co.ga”. Here the attacker is exploiting the fact that users do not fully read the FQDN and just look at the first part of the url and assume is the right site. This technique is particularly useful against small screens such as smartphones or tablets have as can be seen in the following picture.

So, how can we hunt for sites that attempt to perform such a luring technique? Regex expressions for the rescue! In this case we can search for terms that include “twitter.com” but is not the end of the string. Kibana query:

NOT signer.CN.keyword:"DigiCert SHA2 High Assurance Server CA"  
AND CN.keyword:/twitter.com[-.].+/

Another way to get similar results would be to use fuzzy search but on a more flexible field. If instead of using the “CN.keyword” field we use a tokenized field (out of the box in most elastic stack setups) we can get broader set of results. These also add in number of false positives, however it shows quite some interesting results. Kibana query:

NOT signer.CN.keyword:”DigiCert SHA2 High Assurance Server CA” 
AND CN:twitter~1

Punycoded copycats

Lastly, one of the most interesting techniques is the use of punycode as a way of disguise and look like the official site an attacker is trying to spoof. But what is punycode? In a few words, punycoded was designed to represent “International Domain Names” in the limited set of characters allowed for URLs. In other words, Domain names that contain characters (normally encoded with Unicode) such as: ü, ñ, ç, or any character from a non-latin based alphabet , can be encoded with punycode. So a domain like this: “hellô_therë.com” will be encoded as “xn — hell_ther-34a7g.com”. But what if the attacker would try to do something smarter, something like the following?

xn--appe-i9b[.]com > appɫe[.]com

Because there are a lot of characters that reassembles greatly to the ASCII characters, attackers have a big range of possibilities to exploit this feature.

So how to hunt for these sites? Unfortunately there is no build-in logstash nor elasticsearch feature to help us here. What we did is to create our own logstash enricher written in ruby to detect punycoded domains, create a field with the domain name represented in unicode, and create another field with the transliteration of domain name to ASCII. Because now we have a field with the closets ASCII interpretation of the punycoded domain, we can use all the aforementioned queries to find suspicious sites that might be trying to spoof a particular website. Kibana Query:

NOT signer.CN.keyword:"DigiCert SHA2 High Assurance Server CA"  
AND (antiPunycodedDomain:/(.*\.)?[0-9]*twitter[0-9]*(\..*)?/ 
OR antiPunycodedDomain:twitter~1)

Use keywords in a smart way!

As you might have noticed there is a very rich plethora of possibilities when it comes to hunting for malicious sites. The techniques described in this blog simply scratched the surface of what can be done. Mixing relevant terms with keywords such as “payment”, “verification”, “security”, etc could lead to very good results. A great place to start with is here. Kibana query:

NOT signer.CN.keyword:"DigiCert SHA2 High Assurance Server CA" 
AND CN:/(.*\.)?[0-9]*twitter[0-9]*(\..*)?/ 
AND CN:(login log-in sign-in signin account verification verify webscr password credential support activity security update authentication authenticate authorize wallet alert purchase transaction recover unlock confirm live office service manage portal invoice secure customer client bill online safe form )

Conclusion

As we have seen Certificate Transparency is a great resource for data mining, in this blog we explored how to hunt for malicious sites using native features of the elastic stack plus some enrichments. We have also analysed some of the most common techniques malicious actors use to disguise as legitimate website and how to focus our searches for these ones. However, Certificate Transparency can be used for quite some more (subdomain enumeration, getting customers names of an MSP, etc). Moreover, do not think that it is a resource just for blue teams, malicious actors are also aware of the richness of the data. So be aware that all your certificates, even those issued for a very experimental system are most likely public ;)

Disclaimer

Before concluding with this blog I would like to make it clear that not all domains shown here are malicious. Hunting for malicious sites does not just requires to think of nice and creative queries but also verifying that the website is indeed malicious. Looking at whois data, general OSINT resources, web archives, and the website itself help to make a better decision. Lastly, it always important to remind that we are dealing with potential malicious actors so taking OPSEC measurements is particularly important for this type of research.

Global References

https://www.certificate-transparency.org/ https://github.com/x0rz/phishing_catcher https://github.com/CaliDog/certstream-server https://github.com/wesleyraptor/streamingphish

Acknowledgments!

I want to thank a lot the KPN Security Research team:@sndrptrs, @rikvduijn, @wesleyneelen, @Jeroen4Clover , and @JCMarques15!