Scraping from 0 to hero (Part 3/5)
Medium scraping
If you want to go further and start to scrape in big quantities and be more professional, you need to start to use professional and scalable tools.
Proxies
One of the biggest fears in scraping is being detected as a robot. As we’ve explained above, you can be detected by different ways, but the easiest way (by far) is by your IP. To avoid this, you can rotate proxies.
A proxy is a computer, server, mobile device or any other type of system connected to the internet that hides your IP behind its own. Using these proxies simplifies your work, because using proxies gives you some benefit like:
- Being able to control the country or even the city from where you can connect from to access an url.
- Increase the security of your connection (to avoid being identified)
Just remember that there are a few considerations when choosing a Proxy:
- Type of proxy
- Type of content per proxy
Type of proxies
In order to know what kind of proxy you need, first, you have to know what kind of web you need to scrape. As a reminder, scraping should be the last alternative to retrieving the data from one site. But if there is none, you need to analyze a few things:
- Search engine: sometimes you need to scrape (or initialize the scraping) from a search engine. Not all the proxies are allowed to be used by the main search engines. In fact, Google is putting effort on this, and that is why there are proxies that specifically work to scrape Google’s products.
- Protocol to use: every website is using more or less encryption, or even a different protocol:
- HTTP
- HTTPS
- SOCKS4/5: based on SOCKS protocol, the version 4 and 5 are the latest ones
Type of content per proxy
Now, let’s explore the content type to scrape based on a proxy:
- No Anonymity: if it’s not an issue for you to be discovered, there are proxies that only forward your requests through them without hiding your IP. This is pretty useless, but they exist.
- Anonymous or even High Anonymous proxy: pass your request through the Proxy system and hide your IP by using theirs. Normally this kind of proxy is using cloud servers (AWS, GCloud, botnets or even dedicated servers).
- Residential Proxy: computers from regular people that offer you the possibility to use their ip for a few cents. This is the best option, because they are the closest example to a real user. I have to alert you that in some cases the residential proxies are controlled by hackers that have installed some malware on these PCs and are using them without permission of the owner.
- Mobile proxies: as you can imagine, in the same way that residential proxies are proxies that are using real computers, in this case, they are using mobile devices in order to access the websites to visit/crawl.
There are not better alternatives between these types. Some are cheaper and others are more reliable. Depending on the case, I could recommend using one over another.
Rotating Proxy Services
There are multiple ways to rotate proxies (some of them are free, and others are not):
- Pay for a service that helps you to scrape for using their proxies. You send the url to them and they return its content. An example of this is the service Crawlera, but this is an amazingly expensive option. I don’t recommend this if you are going to scrape massive amounts of urls. Most of the services are offering an easy way to connect with Scrapy. To facilitate their usage and also -because as we saw in the beginning of the article — it is the easiest way to scrape small and big websites.
Other similar services to Crawlera are:
- ScraperApi, which is a good alternative, but it is more focused to be used as an API (from whatever language you want)
- SmartProxies is the biggest proxy provider in resident proxies. Good option but it’s not cheap at all
- ProxyRack: pretty new, pretty reliable and gives you different options of proxies. Good one!
- Luminati.io: the best alternative to crawlera. They give you reliability on their services with low costs
- Retrieving proxies every time: using an API, you can configure any script to scrape websites. It does not necessarily mean Scrapy, because as it is an API, you can use them whenever you want (if you pay for them, of course!)
- Using TOR is one of the free alternatives. This option is highly anonymous, but very slow too, because to anonymize your IP, the request has to pass over different nodes on the internet before reached the targeted website, and getting the response through the same nodes back to you
- Getting proxies over the internet for free and using them is the last alternative (the best way is using this application or using these examples: example 1, example 2, example 3, example 4 and the last example). But, by far, it is the most time consuming option. As it requires you to create a system that:
- Scrape IP addresses of free proxies
- Store them into a DB
- Verify them if they are active and can be used properly
- Save in the DB only the active and working ones
- Use the proxies under demand (based on the scrapy project necessities)
In the advanced scraping section, we will see a copy of Crawlera’s algorithm. To understand what is behind the scenes of Crawlera, and to create your own professional proxy broker.
Headers
Totally a part of the proxies, every request has an identifier that the servers checking all the time. This identifier indicates from what browser application you are using to try to get the website’s information.
The headers are a collection of several key value pairs like:
- Accept: “*/*”
- Accept-Encoding: “gzip, deflate, br”
- Accept-Language: “en-GB,en-US;q=0.9,en;q=0.8”
- Connection: ”’keep-alive”
- Host: “www.wikipedia.com”
- Referrer: “https://www.google.com”
- User-Agent: “Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36”
- X-Requested-With: “XMLHttpRequest”
These values can be set manually, but if you do it this way, your requests will be identified as coming from the same source.
For instance:
- User-Agent: having a fixed value, will make it look like if you’re navigating from the same browser
- Referrer: it means that a website (in the example: https://google.com) has redirected the request to the targeted website for scraping
- Accept-Language: in the case you want to simulate requests coming from different countries, you should change the language of the browser
The best way to simulate this, is to coordinate the headers with the proxy that you are using. This will help to centralize the proxy location with the right headers and will not create an inconsistency.
This could be a good solution if you consider to centralize the operations through this part, as a proxy. This means that every request is going to be more logical from the point of view of the website that we are going to scrape.
This leads me to explain exactly what Crawlera is doing behind the scenes.
Proxy and Headers manager
Can you imagine that we can centralize all the requests in one scalable system that will be able to handle all the possible requests without thinking about errors, headers or even proxies?
Hell yeah! There is Crawlera for that!
Ok, but….did you check the price? Oh wait!
This is not a possibility.
Why don’t you build up your own Crawlera? mmm, ok, let’s try!
Making your own Crawlera
Ingredients
- List of proxies to use (internally or externally, explained in the section Proxy Services)
- List of headers to use (can be a static list or dynamic based on a website like this)
- Logic system able to select, discard and renew proxies on the go (using a powerful DB)
Preparation
The entire system is using different algorithms:
- Create a microservice able to receive requests, use internal proxies and headers
- Prepare the proxies (if you are using internal proxies and not external): Get proxies + Renew proxies
- Prepare headers: Get headers + Prepare combinations of headers and proxies
- Historical data and validations
Prepare the proxies (only for internal proxies)
First, we have to get all the proxies from different sources, like the Internet, providers or other companies that we have mentioned above. As we saw, you can use this application (Proxy Manager) to find automatically different proxies around. The data of these proxies can be stored in PostgreSQL.
Once you have stored the proxies (using the application or directly getting it from the free websites), you need to validate them. The application Proxy Manager does it automatically for us, however, the proxy can not be healthy after a few hours. That’s why I highly recommend doing a cron process (every 8 hours for instance) that goes through all the valid proxies in the DB and test each one , this test can be as simple as trying to reach an existing website through the proxy (if it’s possible, create a simple website to make this test, to avoid any third party’s inconvenience). In the same process, you can know where the proxy is located and store it into DB.
In the case you are using external service, you can avoid this step and only store into the DB the proxies provided by the services, but be careful with the cost of each one (most of them are using free credits under certain limitations, you can use them free or at low cost if you manage them properly).
Prepare the headers (only for internal proxies)
Once you have the whole universe of proxies, you need to match them with the proper header, taking in account the language, referrer and also the type of device that you are using to browse the website.
The first step is retrieving the list of all possible headers. There are a few pages where you can find this list. After storing them in a DB, you have to group them by:
- Language
- Referrer
- Type of device
The second step is to match the headers and the proxies, by checking the country of the proxy with the language of the header, this can be stored as a relationship in the DB.
Whenever you will use this proxy gateway (given that you have enough proxies with user agents), you should be able to get the data, and in the case that you got an error while accessing the website, you will just have to go over the next proxy on the list.
Historical data and validations
When you are making hundreds of requests to different websites, you will also need to have hundreds of proxies. To avoid detection you should:
- Rotate proxies: don’t wait until the proxy gets banned to change it. Every proxy that has been used to process one request, can cool off for a few minutes.
- Avoid following the order on the lists: if you have to scrape thousands of pages in a list (like a paginated list), it’s better to go through the list in a random order. The website can detect that it’s the same bot asking from different proxies.
- Proxy proximity: if you detect that the website to scrape is from Belgium, try to use proxies near to this country and avoid using chinese proxies too much.
- Use the proxies smartly: if webpage A is refusing a proxy, it does not mean that webpage B is going to refuse it. Manage the proxies in a smart way and keep a log on where the proxies have failed.
- Use a cache for data: if you are giving services to different people/companies, use a Redis DB as a cache, this way, you are avoiding the duplication of requests to the same page. You can store the data with a retention time of 1 day (for instance)
Massive scraping (url manage)
If you are a company that needs to scrape massively, you will probably have to scrape in a performance demanding way. With this, you can multiply horizontally the servers to consume the urls, and it’s better to make sure that the same url is processed once only (avoiding that multiple servers re-process the same url). This way you optimize the velocity, and the servers usage.
I recommend the usage of the following structure:
- Multiple ScrapyD servers to manage all the scrapers. Every ScrapyD can have more than 1 instance of your scraper
- A unique DB to store the data
- A queue (Rabbit MQ) to manage the URLs. You can use Redis as well as a queue, but the performance is not going to be as good as using Rabbit
To use these queues, you can use these libraries:
To be continued…
This article is part of the list of articles about Scraping from 0 to hero:
- Link to Part 1
- Link to Part 2
- Link to Part 3 (Current)
- Link to Part 4
- Link to Part 5
If you have any question, you can comment below or reach me on Twitter or LinkedIn.