Connecting all the domains in the world and propagating labels across the graph to detect new malicious domains!
Why detect malicious domains?
As you probably know many infections in the Internet happen due to accessing malicious domains. Take phishing for example — there is a web server (e.g.: apple-id-phishing.com) serving real looking fake pages of a reputed website such as Paypal or Apple. You go ahead and type your Paypal or Apple username and password in that fake web page. Instead of request going to real Paypal or Apple servers, your credentials get into the hands of attackers.
Further, malicious domains are the key infrastructure used by Internet miscreants to launch sophisticated attacks. Take APTs (advanced persistent threats) for example. After the initial penetration, APTs communicate with C2 (command and control) servers (e.g.: bot-controller.com). Mirai botnet [1,2], which turned IoT devices such as remote cameras and home routers into bots, used at least 67 C2 domains during the wide spread in 2016.
Many malicious domains are created each day. In order to evade detection, Internet miscreants take advantage of DNS infrastructure to create disposable domains. While it is cheaper to create disposable domains, it is expensive to own other Internet resources such as IPs. By changing their network associations, these miscreants stay under the radar of the detection systems in place and also resist take down efforts.
Sometimes, the lifetime of a domain is only a few hours. Attackers are clever to change their domains (and IP associations) in order to hide their traces.
Just to give you how pathetic the situation is, let me rephrase what Paypal’s head of intelligence, Brad Wardman, told recently. The average life-time of phishing domains is about 2 hours. Currently it takes about 8 hours to detect and take such phishing domains down. So, by this time, the dame is already done. Our research try to resolve this issue, at least partially.
How do we find these malicious domains at present?
Currently, we rely on blacklists and reputation systems. For example, VirusTotal, Mcafee Site Advisor, Google Safe browsing, Phishtank, OpenPhish and Spamhaus. While they are good at listing some malicious domains, they cannot simply keep up with the number of malicious domains appear each day.
These blacklists are too slow to react. We found that they blacklist some of the domains only after several weeks. By that time, the damage is already done. Can we do better?
Can we detect malicious domains while Internet miscreants are in the act?
In our work, we expand this set by connecting domains appear in the Internet and designing inference based techniques. The work is published in ACM CODASPY 2018, titled “A domain is only as good as its buddies: Detecting Stealthy Malicious Domains via Graph Inference”. You can find an extended version of it here. This paper was adjudged the best of this conference.
First, I would like to share some observations that led to our work of connecting all domains in the Internet.
Internet miscreants increasingly utilize DNS infrastructure to change resources and avoid detection. For example, they jump from one hosting provider to another and change domain names often.
Benign domain owners exhibit certain Domain — IP dynamics as well. Take for example shared hosting and load balancing.
In shared hosting, like in Amazon AWS, multiple unrelated domains (i.e. domains belonging to different organizations) are hosted at the same IP and IPs are rotated during regular maintenance windows.
With load balancing, the same domain resolves to multiple IPs. Take google.com for example. Depending on when and where you access it, it may resolve to different IP addresses as Google serves their search result through many blocks of IPs in order to meet the demand in an efficient manner.
The above malicious and benign domain-IP dynamics leave traces in DNS logs that allow us to connect them together. In other words, there are malicious and benign domain patterns buried inside DNS query logs.
So, how do we use these patterns?
The following diagram shows the high-level idea of our approach.
We use DNS records that show the IPs that domains resolve to. It is essentially a bipartite graph of domains and the IPs they solve to.
We somehow connect as many domains as possible using some magic (describe later). The connections are made such that the connected domains are likely to be controlled by the same entity or belonging to the same entity.
The connected domains are highly related to one another. Once this process is done, we end up with a domain graph.
Then, using the blacklists I mentioned at the beginning, we identify a seed set of malicious domains (and also benign domains which is not shown in the picture). We inject these seed domains into the domain graph.
Finally, use a label propagation algorithm/an inference algorithm such as belief propagation to identify other possible malicious domains.
It follows the simple intuition domains that are connected to one or more malicious seed domains are highly likely to be malicious. Domains that are not connect or connected after many hops to malicious seed domains are less likely to be malicious.
This is the same algorithm used in many other domains including Google PageRank algorithm (Random Walks with Restart to be precise).
How can we use these patterns to identify domains controlled by or belonging to the same entity (i.e. some person or organization)? In other words, how do we build the domain graph?
First, we need to have a global view of Domain-IP resolutions. There are two great sources you can utilize:
- Passive DNS Records (from Farsight Security Inc. — Paid access)
- Active DNS Records (from Georgia Tech University — Free access)
Passive DNS records are collected from about 600+ sensors placed around 50+ countries all over the world. These sensors collect aggregate DNS query records (i.e. no user information is recorded) and store them in a centralized location. The project used to be called as DNSDB and provided free access. However, they formed a company around it and now you need to pay to gain access to these records. Our paid data feed shows that we get millions of new domain-IP mappings every day.
Active DNS records, as the name suggests, are collected by actively querying seed set of domains. GeorgiaTech has set up a system to query a large set of domains (over 100 million domains) every day to gather their resolutions. You can read about their system in RAID 2016. While most of the domains in this tend to be 2LDs — Second Level Domains (not FQDNs — Fully Qualified Domain Names), it is still a useful source of information to find domain-IP resolutions at a global level.
Now that we have global domain-IP resolutions at our hand, how can we utilize it to connect domains?
We do it based on two intuitions:
- Domains jumping across different hosting providers over time are quite rare and are likely to belong to the same organization or attacker.
- Domains that resolve to dedicated IPs are likely to belong to the same organization. We will talk about the dedicated IPs in details later.
We call these intuitions association rules. Based on these association rules, we connect domains together. That is,
If two domains share IPs from more than one organization, we add an edge between them.
Also, if two domains share one or more dedicated IPs, we add a link between them.
You may ask, how these association rules work and what the intuition behind them.
Let’s first look at the first association rule — domains jumping across different hosting organizations over time:
How do we identify the organization that owns a given IP address?
IP → ASN → Organization
Each IP belongs to what is called an AS (Autonomous System). These ASes have a unique identifier called ASN (Autonomous System Number). For example, 188.8.131.52 belongs to AS15169. Each ASN is owned by an organization. In this example, AS15169 is in fact owned by Google. Some organizations, especially big companies like Google and Amazon, owns many ASNs.
As a start, you may use online ASN look tools such as this or ipwhois python package or whois IP lookup util. While they do work, they are rate limited. In other words, they are not suitable for bulk lookups. How do we do bulk ASN lookups then? There is at least one great source. Maxmind releases the IP-ASN database, called GeoLite2, that they maintain daily. GeoLite2 is available free of charge. You can download the database and do the lookup ourselves. This is exactly what we do.
With the above, we build our own IP — Organization database that we use in this case.
Using aggregate DNS data mentioned earlier, we build the Domain-IP bipartite graph. This graph shows the IPs each domain resolves to in a given dataset.
An interesting question to think about is that how you select domain-IP resolutions from the aggregate DNS dataset. We select select all domain-IP resolutions first seen during a given week.
Why first seen? We observe that most of the malicious domains are short-lived. By restricting ourselves to first seen domains, we are likely to encounter many malicious domains while able to process the data with reasonable amount of hardware resources.
Why one week? The length is debatable. We believe more research is needed to identify the optimal window. One thing we observe though is that using a larger window does not improve the result. We selected one week mainly because that processing is manageable with the hardware resources we have. It is likely less than 7 days could produce similar results.
Now we are really ready to execute the first association rule of identifying domains resolving to IPs belonging to multiple organizations. For each pair of domains in this bipartite graph, we identify the common IPs they share. If the IPs belong to at least two organizations, we draw an edge between these two domains. This is perfectly parallelizable workload — we use map-reduce jobs over a Hadoop cluster to do this quickly.
How does the second association rule works?
Now to the second association rule: identifying domains resolving to dedicated IPs.
First of all, what the hack are dedicated IPs?
We classify each IP in the dataset as shared or dedicated.
We mark an IP as dedicated if it hosts domains only belonging to one organization (e.g. Google Inc.). If an IP hosts domains belonging to many organizations (e.g. a public hosting IP from Amazon AWS may host domains from many organizations), we mark it as shared.
One thing I would like to point out is that not all public hosting cloud IPs from Amazon, Goolge, Rackspace, etc. are shared IPs. There are dedicated IPs in these IP pools that are exclusively rented out to certain organizations.
How can we label these IPs?
A naive approach is to identify all domains associated with each IP and find out whether those domains belong to one organization or many organization. In order to do this, we need to identify the domain owner. It is not straightforward to identify the organization who owns a given domain. One way to identify the domain owner is to check thick WHOIS records. It is extremely difficult to get thick WHOIS records for all the domains in the dataset. So, what can we do?
We manually label a few thousands of IPs as public and shared. Then we train a random forrest classifier (based on a set of features we identified that discriminate between shared and dedicated IPs). Using this trained classifier, we label all the other IPs in the dataset.
Let me show a dedicated IP identified by our classifier that actually belongs to the AWS public IP pool.
Our classifier marks this as dedicated. Let’s dig into it. What are the domains that resolve to this IP during the period of study?
Looking at the domain names, it is not immediately apparent that they belong to the same organization. However, when you look at the thick WHOIS record for each of these domains, you do find that all these domains belong to the same organization, “Thomas Spooner Group”. This increases the confidence we have on our classifier.
I listed this example to show that the classifier can detect dedicated IPs from public hosting services.
Now let’s build more connections in the domain graph:
Using these labeled IPs, we make more connections in the domain graph.
Using the bipartite graph mentioned above, we again scan each pair of domains. (We do these two scans at the same time in our implementation. I separated out them here for easy comprehension.) If they resolve to at least one common dedicated IP, we add an edge between them. What is the intuition here? By definition, domains resolving to dedicated IPs likely to belong to the same organization. By adding an edge, we connect domains belonging to the same organization.
With the domain graph, how do you detect new malicious domains?
We inject known malicious domains and known benign domains into this domain graph. Malcious domains are collected from VirusTotal and SiteAdvisor. Bengin domains are collected from Alexa Top 1m domains.
We propagate the labels to unlabeled domains in the domain graph using a graph inference algorithm such as belief propagation.
Based on 10 fold cross validation, we decide on an appropriate threshold to decide which domains could be marked as malicious.
Our overall system:
We have in fact built a 28 node Hadoop based system to produce a daily malicious domain list. After making it production quality, we plan to make it available for general public to consume.
We continuously collect the data mentioned here:
- Active DNS data is fetched daily from GeorgiaTech Active DNS data repository.
- ASN information is pulled daily from MaxMind database mentioned above.
- We have about 80 proxies running to pull thick WHOIS records daily in order to find the ground truth for IP classifier.
- We also collect ground truth daily from VirusTotal, SiteAdvisor and Alexa Top 1m domains.
We normalize these data and store in a NoSQL database.
We then extract data for one week window and build the domain resolution graph.
We label a small set of IPs manually using the collected thick WHOIS records. Then we train an IP classifier.
We run all the IPs appear in the dataset through our IP classifier to classify them as dedicated or private.
We then build the domain graph I explained earlier using the two association rules.
Finally, we run inference algorithm to detect more malicious domains based on a small seed set of malicious and benign domains.
Based on an appropriate threshold derived from 10 fold cross validation testing, we generate a daily list of malicious domains. We are able to identify in the order of 40K new malicious domains based on a few hundred malicious seeds identified from VirusTotal and SiteAdvisor.
What are some use cases of the malicious list produced daily?
We support two immediate use cases of the intelligence we produce:
- Daily update to DNSBL — DNS blacklist that can be integrated with local DNS resolver
- Domain threat intelligence visualization — a search engine to search domain intelligence
The Way Forward
Our goal is to make this system near-real-time. We are currently working on to improve the performance and algorithms of the system to achieve this goal. Stay tune for more exciting updates.
Please feel free to contact me if you need further details.
Also, we would be glad to share our datasets and codes with anyone who is using it for research or academic purposes.
We, at Qatar Computing Research Institute, carry out many exciting and impactful research especially on data analytics based cyber security intelligence. We are always on the look out of passionate and smart people who can help shape the research and development in this area. If you fit this bill, please feel free to contact me or any of the team members.