Scraping data is a great way to get valuable data, but a lot of things can go wrong. Here’s Ruxandra Burtica, the co-founder and CTO of CloudHero explaining the lessons she learned at Source Summit AI:
“If you do scraping, you’ll definitely hit the rate limit and they’re gonna block you… Each of these platforms has a different policy — Facebook will ban you if you do more than 60 requests per second for example — so we were scraping comments from youtube — they have more lenient policy, meaning you could scrape a lot more. We didn’t use an API, we were just parsing the html… But [they would block our IP].”
One of the other participants in her session explained how he overcame the problem of blocked IP addresses:
“With Amazon web services [you can] just run your scraping so whenever it detects that a server has been blocked, you just kill it and add a new one. They’re like 0.01$ per hour so you can automate that really quickly.”
He then explained that if the servers at the data center are all blocked, you get a new data center. But Ruxandra also had other challenges:
“If you have 10 keywords you’re looking for, that’s manageable. But if you have 10k-30k, you have to have a lot of instances running only those jobs that would get the data. So, we split those jobs into separate controllers [which] would run on certain machines, and we were passing data on queues to the next controller. So we had a pipeline of data ingesting controllers that were doing the data acquisition. Then data would go through some queues. We used Kestrel, it was built by twitter.”
By working around the scraping policies of the websites involved, creating an automatic system to switch between servers in case one is blocked, and using separate controllers for data ingestion, Ruxandra’s company was able to successfully scrape large amounts of useful data to feed into AI systems.