In this story, I will point you to the 5 tips that I have collected while working on my freelance scraping projects.
- Depend on the stable elements
This one looks like an obvious tip but still, you can meet something like this
//div/div/div/p/span in the commercial scrapers.
This is OK for one-time scraper but it is terrible for the long-term one. The reason why — that is not stable. Imagine that a developer changes the structure of the website and now you have one
div in that already big expression. And your scraper does not work at all. It is not what you expect.
Instead of this use selectors based on the
id / name / property / class attributes.
- Use ready proxy management solution
If you need to crawl a lot, consider using ready proxy management solutions, there are lots of them available on the internet. Just choose what fits best for you.
From my experience, if you want to build something from scratch — it is not worth your time and effort.
- Use monitoring
Of course, this is not for one-time scrapping project.
If you are using custom made scrapers, more likely that you will need to write a custom monitoring system for yourself. But it should not be complicated as you may think.
The simplest metric to monitor is the number of scraped items for example. With the help of this on — you can easily identify any issues with your scraper.
- Use API instead of HTML page scrapping
This option is barely available on the websites you may want to scrape, but still. It is really easy to check using Chrome Developer Tools. Just go to
Network tab and observe what is going on.
- Buy data instead of scrapping
In some few cases, you can buy data from the website owners. One of the examples I saw it was a website that has data about companies that operates in different spheres. Just website with a dictionary. Of course, it is not a big deal just to scrape all the information you need from them but consider an option to buy as well.
You can even go further and make an agreement with website owners to provide you with an API you need. But it is an extremely rare case and should be worth it.