When to use data scraping in your program or project?

Samuel Cornet
Aug 15 · 2 min read

Like retrieving API data or JSON format in order to be able to use them in other applications or projects, Data scraping in Ruby or any other languages allows us to do slightly the same work, however in a different way.

What is the clearest meaning of data scraping? It is a programmatic method that allows us to extract human readable data outputs from an original chosen website or application in order to use in another program or application. Often, we make use of data scrape to stand in where API fails to reach. In other words, not all programs, websites, or applications provide an API. Therefore, to retrieve, extract, or export the information needed, we use data scraping.

To be able to use data scraping, we have to have the prerequisite of knowing your CSS selectors. We have to know the exact CSS selectors that we want to focus on the chosen website in order to come out or extract the exact data or information to be exported. Ruby provides a quick way to scrape data by using a gem called Nokogiri. This gem makes it easy for us to fetch HTML documents and reach to the CSS selectors. Let’s imagine you are dealing with a website that contains a ton of information, which you may never need or use. Or, in a specific way, let us say you’re dealing with a magazine of a news website. You are building a sport web application and you notice that this magazine provides periodically information of sports that you would like to use in your project. In this case, your focus is not on the other sections of the magazine or news website. Rather, it is on the “sport” section. So, you will need to know what section of the sport news you really want and focus your selection on those CSS selectors in order to extract those data or information.

One of the greatest benefits of using data scraping is that the information extracting are dynamic, meaning that your application or program will reflect the exact same information that are existing from the other website we used to scrape. The information are in real time. For example, if today the magazine reflects “Real Madrid manager: Zinedine Zidane”, the next day or next hour it may reflect “Real Madrid manager: Eden Hazard”. On your built program or application, it will reflect all the changes at the exact time. Data scraping synchronize the information on your application to be the same as the magazine in the sport section.

I have built a CLI project on data scraping and I really think is it a great tool. This project is available on (https://github.com/SamuelC28/city_and_capital) and on YouTube (https://youtu.be/z5xHjsdqHw8). This project populates the list some countries with their cities given by a website. If or whenever, that website chooses to alter their list of countries and cities, when I ran my CLI project all the changes will be reflected.