ElasticPhish: Using CertStream and the Elastic Stack for Phishing Intelligence

Published in

security analytics

5 min readApr 30, 2019

In my previous post, A Phishing Guide: Lessons Learned on the Journey to Detecting Phishing Domains, I laid out my experience building phishing detection algorithms and the associated challenges. In this post, I want to dig deeper into a specific use case where open source technology can be a force multiplier for any security team. I provide all the requisite information to get started on GitHub if you want to get straight to it, but I will take a deeper dive here to explain the project and the impetus for it.

As we all know, security breaches are becoming all too common from political parties to major health insurers and large technology companies. Industry research suggests that somewhere around 90% of breaches begin with some form of phishing. As I noted in my previous article, phishing encompasses a very broad set of tactics, techniques, and procedures (TTPs). This creates a need for a broad set of solutions to protect the enterprise in an environment that is increasingly less centralized with the explosion of cloud, IoT, and personal devices. Any single point solution, whether it be email or web gateway, will only see a snapshot of the attack surface. And any one organization will only see the attacks which have been directed at them, whether purposeful or opportunistic.

So how do organizations protect themselves when no single product will detect or block every attack? I am arguing for a crowd sourced effort at threat intelligence sharing and detection. You may say, “but there are already platforms for intelligence sharing” and you would be correct. I’m arguing for something a little more nuanced: the commoditization of data sources and detection capabilities. We have reached a stage where dozens of vendors are building redundant detection capabilities and differentiation is incredibly difficult. What if those detection algorithms were not only used to share findings (URL, MD5, malicious IP, etc.), but were in a shared repository where they could be run over data in every organization? Not only would this allow us to keep the best of breed and make that available to any security team, intelligence sharing would also benefit as equal capability could mean equal visibility, agnostic to vendor. This may be a dream, but I don’t see the benefit of having smart security experts building tools in silos rather than collaborating and growing the discipline faster.

Let’s now get to the point of this article: how might we take open source technology and improve our threat intelligence feed. I have laid out a simple proof of concept solution that can jump start any security team and hopefully spark creativity to build on it. Here is a breakdown of the ElasticPhish.

ElasticPhish Architecture: CertStream, Python, Elastic (Elastic logos are trademarks of Elasticsearch BV)

Data Feed

The first step in building any threat intelligence pipeline is the data feed. For the specific application of phishing domain detection, I have chosen CertStream. Certstream is a tool that provides real-time updates of the Certificate Transparency Log network. Through simple APIs, you can interact with certificates as they are being issued. CertStream monitors a number of Certificate Authorities and I would recommend reading the documentation to better understand the capability and how to use it. We will be using the Python API for this use case.

One important note: There is a url for a CertStream server hosted by CaliDog included in most examples. I would recommend using this ONLY for testing purposes to make sure your code runs properly. Once you plan to deploy for real detection, your best option is running your own CertStream server. It is quite simple and provides a more robust data pipeline for real-time intelligence feeds.

Analytic

The simplicity of the CertStream API allows for robust analytic capabilities to consume live certificate data and provide real-time intelligence feeds. In this proof of concept, I have included a simple heuristic model that has proven to be a rather effective solution for surfacing phishing domains (detects ~1000 new domains per day). The analytic scores domains based on the presence of several factors including well known keywords, brand spoofing, and tld usage. The brands and keywords found in the scoring.yml file were derived from large scale analysis of open source intelligence feeds like PhishTank, OpenPhish, VirusTotal, and Twitter. I developed a script (not included) that ingests each of these data sources and extracts words and brand names.

In order to reduce false positives in obvious situations, the Umbrella 1M list is used to filter domains before they are scored. If you use this tool, or some incarnation, in production I would recommend also using a WHOIS filter to further reduce false positives as well as add valuable data enrichment. However, these services are quite expensive and numerous so I have not built one into this prototype.

It’s important to note that the analytic writes the data to a file for consumption by Filebeat to highlight the breadth of the Elastic stack. This output could be modified to support a host file format similar to hpHosts or piped directly to a Proxy/NGFW/DNS for real time blocking. Your imagination is the limit.

Elastic Stack

To store and visualize the results of the analytic, I have chosen the Elastic stack (Beats, Logstash, Elasticsearch, Kibana). I have chosen Elastic for a number of reasons, but most importantly because of the ease of setup and built in visualization capabilities via Kibana. Although it’s a bit of overkill, I have chosen to use Filebeat and Logstash in this application to showcase all aspects of the technology. I could have easily enriched the data in Python and used the Elasticsearch API for direct data loading, but the focus of this project is on available capabilities to enhance workflow.

As the domains are processed by the analytic, those exceeding the specified threshold are logged to a file. Filebeat reads from this file and ships the data directly to Logstash through the Beats connector. Logstash is tasked with enriching the data by performing an nslookup and then using the MaxMind functionality available in Logstash to gather geolocation information about the IP. This information not only helps visualize where the domains are being hosted but also allows aggregations to be performed to discover enclaves where large scale phishing operations may be taking place.

The enriched data is then shipped into Elasticsearch where it is indexed and available for search in near real-time. Kibana is used to not only perform the searches but also to build dashboards for real-time situational awareness around emerging phishing threats.

Conclusion

My goal with this project was to show how two really cool open source technologies, CertStream and Elastic, can be used to enhance any security operations team today. This simple approach can help you block phishing campaigns before they are ever deployed. But as I said in the introduction, there is no one size fits all solution. This project is really my call to the community to open the doors and begin sharing detection algorithms to protect the entire community. We’ve all worked on the same problems, but in silos. Detection capabilities are no longer a differentiator, so we must work together to commoditize them. Malicious URL detection using Certificate Transparency Logs may be a great place to start. Leave me a response or a note with ideas you may have. Thanks for reading, and I look forward to your ideas.

ElasticPhish: Using CertStream and the Elastic Stack for Phishing Intelligence

Data Feed

Analytic

Elastic Stack

Conclusion

Written by Jonathan Ticknor