Web Scraping with Python Part 2: The Training Wheels Come Off

Chris Johnson
Oct 14, 2020 · 7 min read

As promised here is part 2 of my web scraping story. Up to this point, my experience with web scraping had been with using APIs and wrappers to collect what I want from different websites, however, on this adventure I found that these tools were not available. As I approached the end of my Data Science cohort, I began to work on my capstone project. I settled on creating a hotel recommendation system that would generate recommendations from user-generated text input and compare this against the text descriptions on hotel websites. For my project scraped over 21,000 hotel websites from some of the largest hotel companies in the world. In this article I will describe my experience with one of the companies, using the Chrome web browser and the associated developer tools, and Beautifulsoup with requests. I had originally planned to cover my use of Selenium in this article however due to the length of this one, I will include that in a bonus article.

Image for post
Image for post
Trial and Error

The biggest takeaway from this project is that web scraping and creating code for a particular project are unique to that project. It takes trial and error, time and energy to perform all of the inspection to get the code correct. However, the amount of time put into it is nothing compared to the amount of time it would have taken for me to manually collect the data for 5,200 hotel properties.

Image for post
Image for post
Imports used for this project.

At the outset of my project I found that there were no APIs, tools, or datasets readily available; nor was I able to scrape aggregate websites like TripAdvisor where all hotels were together. So I would begin from scratch. Our lessons for writing web scraping code consisted of scraping data from a static website with not much formatting; however, my first attempt in the real application was a hotel company that did not follow this format. In creating my first web scraping code I played with it for days. I would have to try and attempt several different instances of grabbing the right tags. It was a frustrating experience, to say the least, but I stuck with it. Ultimately I was able to create a code that would take in a CSV file of a list of cities from a state. Iterating through the list of cities, the cities would be concatenated with a base URL to create a URL to request access to for the city page on the hotel website. Once on this page, Beautifulsoup collects all of the information on the page. The next step is to create an object out of the soup so that I can collect the name and URL for all hotels in that city. Using the href tag I can create the dictionary, and append the dictionary to a list of hotels, and ultimately create a DataFrame of hotel names and URLs. TA-DA! Now I have successfully collected all the information I need……. to begin collecting all the other information I really need.

Image for post
Image for post
Collecting my Hotel Names and URLs

Now that I have a DataFrame of hotel names and more specifically their URLs, I can iterate through the URLs and use Beautifulsoup on each hotel website to collect the information I need. I decided that I want to collect the text descriptions, the address, and geo-coordinates for each hotel; so I create empty lists for each of these. I will also create an empty list to collect hotels that my code does not work on so that I can try them at a later time.

Image for post
Image for post
Setting up my empty lists to collect the information, and iterating through the hotel list.

I will preface the remainder of the article: After getting my code exactly how I wanted it and running it on the entire state, I found that the page I used as a template for my code was only one version of a website that was used by this particular hotel company; and therefore failed to collect many hotels. With this information I rewrote my code to implement a set of try/except statements, going through the trial and error phase again. The first website template would be used in the try portion, and the second website template would be used in the except portion with a new try/except set inside the except statement. Ultimately if the scraping for the second website template wasn’t successful then I would collect the hotel name and URL in a list for missing hotels.

I will iterate through each row of the DataFrame and use a set of try/except statements to collect the features listed above. As with the previous instance, Beautifulsoup will collect the information on each website and then I instruct it to create objects to gather the specific information I want. This requires studying the code from the website and find different elements which contain the attributes I am looking for. In this instance, I create a Beautifulsoup object called “hotel_soup”, and then to collect the descriptions I create an object called “item” from the “main” element and search for the second instance of a paragraph indicated by ‘p’, and collect the text. I iterate through the “item” looking for this element of paragraph. I create an empty dictionary called “items” to save the description, and include a key/value pair for the hotel name as well, to be used when merging DataFrames later. Below is a snippet of the code to collect the description. I follow the same steps for the address and geocodes portion of the code.

Image for post
Image for post

As mentioned above I created a second set of try/except statements inside the except portion of the first try/except statements. The code for this portion is arguably much more complex as you will see in the screenshot below. As with my first attempt, I create a Beautifulsoup object called “hotel_soup”, and an “item2” from the soup at the specific point I want to collect the pieces of the description. Again I create an empty dictionary called “items, however, this particular website template has an extensive amount of text description, so I will iterate through the object that I created and append each instance to an empty list called “descriptions1”. I then iterate through this list and remove the items in the list that are less than 3 characters long. Lastly I clean the text of any HTML remnants, and append to a new list called “final_description”. I create a key/value pair for the description and the hotel name in the “items” dictionary and append the dictionary to the descriptions list before moving on to the Address and Geocodes portion. I again follow this same step for the address and create an empty dictionary for the geocode as this is not available on this website template. As promised below is the code, as you can see compared to the first website template this is a much more complex code which also includes cleaning the strings up while scraping. To finish this off I close out the second set of try/except statements with a dictionary of just the hotel name and hotel URL to be appended to a list of missing hotels.

Image for post
Image for post

Now that I have collected all the pieces of information the last step is to convert the lists of dictionaries into DataFrames that I can use for my project. In this case, I have four lists; Descriptions, Addresses, Geotags, and Missing Hotels. I pass each of the first three lists to their on pandas DataFrame, I then merge the first two, and then I merge the third one into the new merged DataFrame to make a final DataFrame called “hotels_df”. When merging I merge on the column “name”. Lastly I perform a drop_duplicates just in case. I follow these same steps for the missing hotels. And then I return both DataFrame. One important thing of note that I learned the hard way, when I return two items at the end of the function, I need to set two items equal to the function or else it will not perform properly and you could lose hours of time and work before realizing the mistake.

Image for post
Image for post
Creating DataFrames from all of the data collected.
Image for post
Image for post
Since the function returns two items, I need to set two items equal to the function.

I know this may have been difficult to follow, however, it is indicative of the experience. As mentioned near the beginning, this is a process that is unique to each website and each project. Best of luck to anyone who is working on a web scraping project for the first time, it will be challenging, but I hope you will find it as rewarding as I did upon completion. I will also be writing a bonus article about my use of Selenium on a different hotel company website.

Gain Access to Expert View — Subscribe to DDI Intel

Data Driven Investor

empower you with data, knowledge, and expertise

Sign up for DDIntel

By Data Driven Investor

In each issue we share the best stories from the Data-Driven Investor's expert community. Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Chris Johnson

Written by

Data Driven Investor

empower you with data, knowledge, and expertise

Chris Johnson

Written by

Data Driven Investor

empower you with data, knowledge, and expertise

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store