Scan 10M websites for X-Recruiting header using GO on AWS Free Tier instance

What are you doing when you looking for a new job? Right, you are going to contact HR or search on websites like LinkedIn, Glassdoor, StackOverflow, etc.

Let’s try another approach :) Did you hear about ‘X-Recruiting’ header? For example, if you look at response headers from PayPal.com, you can see this ‘strange’ header.

Interesting, how many companies use this smart way to find appropriate candidates?

We will try to answer this question using GO and AWS Free-tier instance. You can run the app on your own machine if you have a stable and good internet connection. For me personally, it didn’t work very well because the router was freeze after an hour and needed to be restarted (too many UDP requests).

Requirements and constraints:

  1. Scanning should be done using workers (we have huge list of domains to scan)
  2. Random DNS servers should be used (otherwise we will be banned by a DNS server, cause we’ll do so many DNS lookups)
  3. Memory usage. We want our app to use a small amount a memory, to be able to use Free-tier instance that has only 1Gb RAM)

Firstly, I downloaded domains datasets from multiple sources:

wget https://www.domcop.com/files/top/top10milliondomains.csv.zip
unzip unzip top10milliondomains.csv.zip
wget http://downloads.majestic.com/majestic_million.csv
wget http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip
unzip top-1m.csv.zip

Then I need to clean, merge, filter and save uniq domains to a separate file, that will be used for scanning.

awk -F "\"*,\"*" '{print $2}' top-1m.csv > umbrella-1m-domains.txt
awk
-F "\"*,\"*" 'NR>1 {print $3}' majestic_million.csv > majestic-1m-domains.txt
awk
-F "\"*,\"*" 'NR>1 {print $2}' top10milliondomains.csv > domcop-10m-domains.txt
cat
domcop-10m-domains.txt majestic-1m-domains.txt umbrella-1m-domains.txt | sort | uniq -u > uniq-domains.txt

Secondly, I need to prepare DNS servers list to be used for domain’s IP lookups.

wget https://public-dns.info/nameservers.csv
awk -F "\"*,\"*" 'NR>1, NF > 0 {print $1}' nameservers.csv > dns-servers.txt

Finally, I have two files that will be used by the app:

dns-servers.txt
uniq-domains.txt

About the code.

DNS lookup implemented using popular http://github.com/miekg/dns package. Some interface defined to be able to change IP lookups implementation if needed. Resolver loads file’s content to the memory, and then uses random servers for each lookup. Resolve method returns list of returned IPs, if found.

The app implemented using GO’s version of Producer-Consumer pattern with channels. The Worker is our consumer. It received the jobs (domains) to be scanned from JobQueue channel and exits when the channel closed. (Producer responsible for closing the channel when there are no domains remained for scanning).

Because most of our work is IO operations, we can create thousands of go-routines (workers) and benefit from it even on 1-CPU machine.

We also use WaitGroup to wait until all workers finish the jobs.

We also use a buffered channel when reading domains from the file. It allows to the app to use small amount of memory. We could just load whole the file to the memory (and actually it did on my first try), but then we will need use server with more memory, or create a SWAP file if we still want to use it on EC2 Free-tier instance.

Object App is our Consumer. It’s responsible for workers creation, synchronization and receiving completed jobs from the workers.

Full code you can see in https://github.com/spaiz/hrscanner

Build and run the app locally

To run it locally you need to clone the project, install dependencies, unzip some data files, install it and run.

# clone the project
mkdir -p ${GOPATH}/src/github.com/spaiz/
cd ${GOPATH}/src/github.com/spaiz/
git clone git@github.com:spaiz/hrscanner.git .
cd hrscanner
# install dependency manager use in the project
go get -u github.com/kardianos/govendor
# install project's dependencies
govendor sync
# install the app
go install
# unzip domains and DNS servers files
cd ${GOPATH}/src/github.com/spaiz/hrscanner/data/
unzip dns-servers.txt.zip && rm dns-servers.txt.zip
unzip uniq-domains.txt.zip && uniq-domains.txt.zip
cd ${GOPATH}/src/github.com/spaiz/hrscanner/
# run the app with default settings
hrscanner

Run the app on the server

I use MacBook, so to build a binary file to be run on Linux, I use static compiling inside the Docker. I prepared tiny script for this.

./bin/build.sh

It will create and put binary file inside artifacts directory. Now, just upload the binary and data files to the servers, and run it. I use scp tool fo this (you need to setup SSH access to your server using keys).

scp -r -C -i ~/.ssh/mykey.pem ${GOPATH}/src/github.com/spaiz/hrscanner/data/ ec2-user@remote_host:~/data/scp -i ~/.ssh/mykey.pem ${GOPATH}/src/github.com/spaiz/hrscanner/artifacts/hrscanner ec2-user@remote_host:~/

Before running the app on the server, we should increase max open files settings, otherwise we will get well known error:

too many open files

There are multiple ways to achieve this. I did it using a method that will persist the settings even after server restart. All actions made on Free-tier Amazon instance from EMI Image

Create new file:

sudo touch /etc/security/limits.d/custom.conf

And put:

* soft nofile 1000000
* hard nofile 1000000

Then edit /etc/sysctl.conf

sudo nano /etc/sysctl.conf

And add to end of the file:

fs.file-max = 1000000
fs.nr_open = 1000000
net.ipv4.netfilter.ip_conntrack_max = 1048576
net.nf_conntrack_max = 1048576

You must reconnect again to the server. To check that new settings have been applied, run:

ulimit -n

I use tmux to run the app on the server. It allows me to close the terminal and the app will continue running. Just install it on the server and run it.

sudo apt install tmux
tmux
# run the app with custom workers number
./hrscanner -workers=500 > logs.txt &
# exit the session
tmux detach
# now u can close the terminal

Next time you connect to the server, you can open the previous session by typing

tmux attach

All domains with X-recruiting header will be saved to the file results.txt (can be changed by flags).

In my case, the app was able to start from ~700 RPS and then after some time, it has decreased to stable ~250 RPS. (~900,000 requests per hour)

And the results.txt file will look like

Source code:

https://github.com/spaiz/hrscanner

P.S.

The solution isn’t ideal. There are no retries… there is no warranty that all DNS servers work well… I don’t check multiple DNS A records if the first HTTP request failed… but still, it’s good enough for me :)

Tip.

You can create your own domains list, for example by scrapping angel.co or crunchbase.com and selecting only relevant hi-tech companies ;)

P.S2

The app is still running. I’ll update results when it will finish scanning all 10M websites :)

Update.
After a more than 24h running, the app found 1873 domains with X-Recruiting header. You can see the report here.

Alexander Ravikovich

Written by

In GO we trust. Software Engineer.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade