The Most Overlooked Part of Web Scraping

Learning web scraping is not easy and overlooking things outside your code is all too easy.

Samu Kaarlela
CodeX
3 min readJun 11, 2021

--

(@kellysikkema,Unsplash)

With data science being such a technical field it’s easy to forget to look at the bigger picture. Especially when learning mostly using online methods, it’s easy to focus on finding the perfect algorithm, finding the fastest way to run your code and other fine tuning processes.

Now one might think that the one place where that fine tuning is truly important is web scraping. Since you’re making hundreds or thousands of requests to the server, any bit of saved computing power or unnecessary data saved can have a significant impact.

And while this is true to a degree, quite often we end up spending too much time on specifics on projects that don’t require it.

it’s easy to forget to look at the bigger picture

All data scientists learn to clean and transform data into a usable format and for a lot of data scientists web scraping isn’t their primary field, but just a side tool. In these cases it’s usually better to use Pandas or Numpy to clean out the data after it’s been exported out.

The advantage of this approach is also making sure you don’t accidentally miss any data. Since you’re cleaning out the data after exporting it, you can be 100% sure you have all of the necessary data. This method also ensures that your data will always be entering your model or visualization tool in the exact same format every time.

So, what are the disadvantages then? Well, obviously as you scale up and start pulling huge amounts of data out being efficient becomes a top priority to save resources. Along with that you can optimize the amount of data being extracted and the methods used in order to improve runtime significantly. That means less waiting around and more doing actual work.

Since you’re cleaning out the data after exporting it, you can be 100% sure you have all of the necessary data.

Like with all methods, you need to have the ability to look at the bigger picture. Is it better to get the raw data quick to be able to work on it faster? Or is it more important that the scrape is quick and efficient? You should also be considering how many times that scrape will be run in the future. There are things to consider no matter which route you choose to go with.

When getting data out quick you still need to put thought into the format that the data gets exported as. That means the file type, but for tabular data that also means considering what the columns are and what the rows are. It might also mean combining or splitting some of the extracted data.

Similarly, even if you choose to spend a long time during scraping to ensure that the data is clean and gathered efficiently, then you still need to create a layer of data cleaning to make sure that the data enters the model in a usable way. This is hugely important especially in models that need to be running all of the time and cannot afford to break due to the wrong type of input data.

There are things to consider no matter which route you choose to go with.

So to conclude, always remember to consider your projects implementation from start to finish and don’t be afraid to spend time not coding. It’s okay to even just take a day to plan out how you will execute each step of the project. Doing so can even save you a lot of debugging and therefore time in the future.

Follow me on medium to see all my posts
You can also follow me on other social media at:
Twitter
LinkedIn

--

--

Samu Kaarlela
CodeX

I write about data science and health, sometimes combining the two. Hold every day as a new opportunity to learn.