As promised here is part 2 of my web scraping story. Up to this point, my experience with web scraping had been with using APIs and wrappers to collect what I want from different websites, however, on this adventure I found that these tools were not available. As I approached the end of my Data Science cohort, I began to work on my capstone project. I settled on creating a hotel recommendation system that would generate recommendations from user-generated text input and compare this against the text descriptions on hotel websites. For my project scraped over 21,000 hotel websites from some of the largest hotel companies in the world. In this article I will describe my experience with one of the companies, using the Chrome web browser and the associated developer tools, and Beautifulsoup with requests. I had originally planned to cover my use of Selenium in this article however due to the length of this one, I will include that in a bonus article.
The biggest takeaway from this project is that web scraping and creating code for a particular project are unique to that project. It takes trial and error, time and energy to perform all of the inspection to get the code correct. However, the amount of time put into it is nothing compared to the amount of time it would have taken for me to manually collect the data for 5,200 hotel properties.
At the outset of my project I found that there were no APIs, tools, or datasets readily available; nor was I able to scrape aggregate websites like TripAdvisor where all hotels were together. So I would begin from scratch. Our lessons for writing web scraping code consisted of scraping data from a static website with not much formatting; however, my first attempt in the real application was a hotel company that did not follow this format. In creating my first web scraping code I played with it for days. I would have to try and attempt several different instances of grabbing the right tags. It was a frustrating experience, to say the least, but I stuck with it. Ultimately I was able to create a code that would take in a CSV file of a list of cities from a state. Iterating through the list of cities, the cities would be concatenated with a base URL to create a URL to request access to for the city page on the hotel website. Once on this page, Beautifulsoup collects all of the information on the page. The next step is to create an object out of the soup so that I can collect the name and URL for all hotels in that city. Using the href tag I can create the dictionary, and append the dictionary to a list of hotels, and ultimately create a DataFrame of hotel names and URLs. TA-DA! Now I have successfully collected all the information I need……. to begin collecting all the other information I really need.
Now that I have a DataFrame of hotel names and more specifically their URLs, I can iterate through the URLs and use Beautifulsoup on each hotel website to collect the information I need. I decided that I want to collect the text descriptions, the address, and geo-coordinates for each hotel; so I create empty lists for each of these. I will also create an empty list to collect hotels that my code does not work on so that I can try them at a later time.
I will preface the remainder of the article: After getting my code exactly how I wanted it and running it on the entire state, I found that the page I used as a template for my code was only one version of a website that was used by this particular hotel company; and therefore failed to collect many hotels. With this information I rewrote my code to implement a set of try/except statements, going through the trial and error phase again. The first website template would be used in the try portion, and the second website template would be used in the except portion with a new try/except set inside the except statement. Ultimately if the scraping for the second website template wasn’t successful then I would collect the hotel name and URL in a list for missing hotels.
Introduction to Time Series Forecasting of Stock Prices with Python | Data Driven Investor
In this simple tutorial, we will have a look at applying a time series model to stock prices. More specifically, a…
I will iterate through each row of the DataFrame and use a set of try/except statements to collect the features listed above. As with the previous instance, Beautifulsoup will collect the information on each website and then I instruct it to create objects to gather the specific information I want. This requires studying the code from the website and find different elements which contain the attributes I am looking for. In this instance, I create a Beautifulsoup object called “hotel_soup”, and then to collect the descriptions I create an object called “item” from the “main” element and search for the second instance of a paragraph indicated by ‘p’, and collect the text. I iterate through the “item” looking for this element of paragraph. I create an empty dictionary called “items” to save the description, and include a key/value pair for the hotel name as well, to be used when merging DataFrames later. Below is a snippet of the code to collect the description. I follow the same steps for the address and geocodes portion of the code.
As mentioned above I created a second set of try/except statements inside the except portion of the first try/except statements. The code for this portion is arguably much more complex as you will see in the screenshot below. As with my first attempt, I create a Beautifulsoup object called “hotel_soup”, and an “item2” from the soup at the specific point I want to collect the pieces of the description. Again I create an empty dictionary called “items, however, this particular website template has an extensive amount of text description, so I will iterate through the object that I created and append each instance to an empty list called “descriptions1”. I then iterate through this list and remove the items in the list that are less than 3 characters long. Lastly I clean the text of any HTML remnants, and append to a new list called “final_description”. I create a key/value pair for the description and the hotel name in the “items” dictionary and append the dictionary to the descriptions list before moving on to the Address and Geocodes portion. I again follow this same step for the address and create an empty dictionary for the geocode as this is not available on this website template. As promised below is the code, as you can see compared to the first website template this is a much more complex code which also includes cleaning the strings up while scraping. To finish this off I close out the second set of try/except statements with a dictionary of just the hotel name and hotel URL to be appended to a list of missing hotels.
Now that I have collected all the pieces of information the last step is to convert the lists of dictionaries into DataFrames that I can use for my project. In this case, I have four lists; Descriptions, Addresses, Geotags, and Missing Hotels. I pass each of the first three lists to their on pandas DataFrame, I then merge the first two, and then I merge the third one into the new merged DataFrame to make a final DataFrame called “hotels_df”. When merging I merge on the column “name”. Lastly I perform a drop_duplicates just in case. I follow these same steps for the missing hotels. And then I return both DataFrame. One important thing of note that I learned the hard way, when I return two items at the end of the function, I need to set two items equal to the function or else it will not perform properly and you could lose hours of time and work before realizing the mistake.
I know this may have been difficult to follow, however, it is indicative of the experience. As mentioned near the beginning, this is a process that is unique to each website and each project. Best of luck to anyone who is working on a web scraping project for the first time, it will be challenging, but I hope you will find it as rewarding as I did upon completion. I will also be writing a bonus article about my use of Selenium on a different hotel company website.