Update 13/12/2019: En el último tiempo, ha habido una gran discusión por las solicitudes de dominios a NIC.cl por parte de particulares. Nosotros no hemos realizado ni realizaremos ninguna petición de esa índole. Tal como expone la última sección de este artículo, desde un inicio la información solo estuvo disponible para realizar investigaciones estadísticas anonimizadas a nivel nacional.
At alertot we have a dream: assess the security status of websites in Chile. That’s why we start our “Chile Security Survey” and the plan comprises the following steps:
- Get a list of Chilean websites.
- Run detectem on the nodes to know what software they use.
- Know how many vulnerabilities each website has. Our vulnerability database is built with information from several sources, then it’s not a mere copy of CVE database.
This post is about the first step, getting a list of Chilean websites.
Getting a source
When you think about scanning the websites in a country, the first idea is network blocks. Here you can get the list of network blocks for Chile. However, there are four drawbacks of using them:
- It contains residential network blocks. It’s not useful for our purposes.
- It’s not easy to distinguish when an IP is a first level domain.
- An IP could contain multiple
.cldomains, you only get one hopefully.
- It’s 2018, many companies use the cloud or foreign hosting providers.
Then we discard this idea. We started to look for a list of Chilean domains and we discovered “Top 10 million websites” which contains around 13k Chilean domains. Not bad, but according to nic.cl, there are more than 561k domains.
In security there’s a whole field dedicated to assets discovery. There are some enumeration tools that try to find sub-domains through DNS queries using word lists and I thought about:
- Using the same tools but for top-level domains and Spanish word lists.
- Applying alterations on found domains to discover more.
I was going to setup this approach when I reviewed nic.cl website and found an interesting query page . I’ve worked 5 years at Scrapinghub doing web scraping daily and it seemed like those sites “scrapable”. Here we go!
Scraping the constrained site
If you enter the character
a and select
domain starts with a, you won't get the full domain list starting with
a. It's the kind of site that says "I have thousands of entries, but I only show 100".
In the past I’ve faced this kind of site. The important thing is to be able to split the search to get all the data. Am I able to do that here? Yes, but how?
I’m going to define a
seed as the starter string of characters. We start with the seed
a, which has length 1 and it has more than the 100 results. Then, I have to increase the length of my seed, but with which character? The last domain in the listing is
a1producciones.cl, then the seed
a1 would be a good fit.
The next request is using the seed
a1. It returns 16 domains, we're done with this seed since it doesn't need any further split. The next seed is
a2 and we continue until we can cover completely all the domains starting with
No brute-force my friend
Seed generation is a thing. Let’s check the table to see how many requests you need to do to cover all the seeds of certain length.
- Seed length: 2 Number of requests: 1.849
- Seed length: 3 Number of requests: 79.507
- Seed length: 4 Number of requests: 3.418.801
It’s necessary to build a strategy to scrape the data efficiently. In our case, we did around 68.000 requests to get the full domain list.
I think the main problem with scraping is to cause a denial of service. For me, it’s the challenge to get the data I want with the minimum number of requests in a not abusive manner. What means not abusive? Leave X seconds between requests and adapt your crawl to site resources.
Make it look pro
The best solution for scraping a website is Scrapy. No discussion.
Imagine you’re in the middle of the process and the spider fails and exits. How do you recover from that situation and not restart it from the beginning? It also deserves some strategy.
In the end I’ve used 3 different spiders: a discoverer of seeds, a consumer of seeds and a spider for special cases. This is a common approach used for large crawls that can segmented and this way if the spider fails, I restart only the seed and not the whole process.
I’ve also implemented some network blockage mitigations to make the process safer for us.
As result, we got a list of 559.568 domains using around 68.000 requests, it’s almost a request to obtain 8 domains. In comparison with the official stats, we collected ~1.5k domains less and I’m not quite sure about the reason of that inconsistency, but I think that we could get the full list in the long term using additional techniques.
First step done, our next post will be about the improvements and advances in detectem, our open source website scanner.
Research as a service
If you have scraping needs or want to conduct a survey over Chilean websites, contact us and we could do it for you. We have the experience and skills to carry out projects on scale.