Create Your Dataset by Web Scraping Using Python BeautifulSoup
People who are first introduced to the field of data science often use prepared datasets (generally .csv or .xlsx files) in their first projects. But in real-life problems, things don’t always work that way. Sometimes we need to pull data directly from a database, while in some cases we need to create our dataset. For this purpose, to obtain the data we want, we sometimes need to conduct surveys, and sometimes we need to organize the data already recorded on the Internet so that we can use it ourselves.
In this article, we will examine step by step in detail how to collect data from the Internet and how to create a dataframe using the data we collected through a basic case study.
1. Determining the Web Page Will Used for Scraping
First of all, we have to find a web page to obtain some information inside it. In this article, we will use a real-estate webpage which name is Zingat for Istanbul, Kadıköy district.
2. Getting the All House Links We Will Used
After determining the web page, we need to write a code block that will allow us to reach all the information inside the links on the page one by one. For this purpose, we should start with importing some python libraries that we will use in the project. We are going to use;
- BeautifulSoup Library for data extraction from the web.
- Pandas Library for exporting our dataframe.
If you did not download these Python libraries, please download them. But, if you have already downloaded them, let’s start with importing these libraries.
After the importing step, secondly, to access the information on this web page, we need to send a request to the web page. To do this, we will write a short function that allows accessing the web page we want.
The links we want to get consist of 42 pages and to access the information on these pages, we need to send requests to all web pages one by one. To do this, we will write another function to access all of the web pages.
As a final step, we should get every house links to access the information inside them. Before doing this, we should find the text we are looking for in the HTML code block of the web page. If I right-click with our mouse on a house ad, we will see the inspect button on the screen, and if we click this button, a new page will open like below. The gray line in the figure below contains the link we want to access.
After accessing the page which includes the information about the web page, now we can create another python list from the links of the house ads we want to get. To do access the whole links, we will use the findAll() command. This command allows us to access the text inside the HTML tags. For this example, our main tag is “a” and its class is “zl-card-inner”. Additionally, our link is located in the “href” tag. To get this link, we will write a code like below and put it in a for loop to get this link from all house ads.
3. Determining the Features Want to Scrape
Now that we have the house ad’s links, we can determine which house features we will get. In this example, we will get the features is in the red circles shown below. These features are;
- Ad’s Name
- Neighborhood’s Name
- Net Square Meters
- Gross Square Meters
- Number of Room and Hall
- Number of Bath
- Number of Photo Belongs to Ad
After determining the features we want to scrape, we can start to scrape these values to our dataframe. To do this, we need to get all the features we want in our dataframe from the HTML code of the website. Let’s make some examples.
For example, to get the neighborhood's name, we will right-click the ad’s name and click the inspect button. We will see the screen like below.
In this example, our main tag is “div” and its class is “col-md-9” and our text is located in the “h1” tag. We use “.text” to get the text in HTML code. For this reason, to get this sentence from HTML code, we will write a code like below. It is an easy one.
Another example, we want to get the ad’s location. To do this, we will go to the ad’s location in HTML code.
In this example, our main tag is “div” and its class is “detail-location-path” and our text is located in the “h2” tag. In the example above, we will get the text in HTML code by writing “.text”. But in this example, we will not get the result we expect. Because additionally, this text includes information about city name and district name. To solve this issue, we will use the “replace” function. This function allows us to change any character to the new one. For this example, we want to change the “İstanbul” character to “ ”, the “Kadıköy” character to “ ” and “,” character to “ ”. In addition, we will use the “strip” function to remove the spaces in the text. Awesome!
In our last example, we want to get the number of rooms in the ad. To do this, we will go to the number of rooms and halls in HTML code again.
In this example, our main tag is “ul” and its class is “row attribute-detail-list” and inside the “ul” tag, now we have another main tag which is “li”. In this tag our main tag is “li” and its class is “col-md-6” tag and our text is located in the “span” tag. However, we have one more issue, our text is consists of room number and hall number. It seems this example is more complicated than other ones but it’s easy too. To get the number of rooms information, firstly we will write “ul” tag’s code, after that we will write “li” tag’s code. However as seen in the figure above, we have more than one “col-md-6” row. To solve this issue, we should select the necessary row. To do this, we will add the index of the number of rooms and halls which is “” to our code. As the last step, we should get only the number of rooms in this example. To do this, similar to what we did in the previous line, we will add the index number which includes the number of rooms to our code. For this purpose, we will add “” to our code. Nice work!
4. Creating Our Dataframe
We can get all the features that I mentioned at the beginning of the 3rd chapter by writing them similar to the examples we did above. Besides, if we put all features we want to scrape in a for loop, we will get a Python list consisting of the features, after that, we can create a dataframe with objects in the Python list. In the figure below, we can see how all the features we mentioned are gotten.
After all these processes, the dataframe we created becomes as seen below.
If we wish as a final step, we can export the dataframe to a .csv or .xlsx file to use for our next studies by writing this shortcode.
For more detailed information and questions about the article, you can contact me via my LinkedIn account. Thank you for giving your time. See you soon in my next posts :)