Five Mistakes To Avoid When Scraping a Website

Don’t fall into these common mistakes

Alessandro Artoni
Geek Culture

--

Image by Masterfer

Web Scraping is a lot of fun, and it is very helpful when you need to kickstart a data project, or when you need to extract data from newspapers or social media. It is in fact estimated that in 2020, people created 1.7 MB of data every second: data you can use to train your own machine learning algorithm or make some analysis!

I scraped a lot of websites using Python requests, Selenium and BeautifulSoap.

Here are 5 tips I would like to share regarding how to design and develop a scraping pipeline.

1. Always save the raw data

Once data is consumed, it is gone. Forever.

So, when you’re scraping a website you should always, as the first thing, save the webpage you’re scraping and then parse it properly.

Imagine that you later realize that you want to do some analysis based on a different piece of information you didn’t collect. You would need to download the entire data over again.

Instead, if you saved the original data, you could simply scrape that information coherently with other information you already collected starting from the same original data.

--

--

Alessandro Artoni
Geek Culture

MLOps Engineer and Data Engineer. Consultant with experience in Manufacturing, Consumer Packaged Goods and Healthcare. I Co-founded artivon.com.