Web Scraping using Beautiful Soup and Selenium for dynamic page

rahul nayak
Feb 16 · 6 min read

Web Scraping

Web scraping can be defined as:

“the construction of an agent to download, parse, and organize data from the web in an automated manner.”

Or in other words: instead of a human end-user clicking away in their web browser and copy-pasting interesting parts into, say, a spreadsheet, web scraping offloads this task to a computer program which can execute it much faster, and more correctly, than a human can.

Web scraping is very much essential in data science field.

Why is Python a suitable language to use for Web Scraping?

It has the most elaborate and supportive ecosystem when it comes to web scraping. While many languages have libraries to help with web scraping, Python’s libraries have the most advanced tools and features.

Some python libraries for web scraping:

  • Beautiful Soup
  • Scrapy
  • Requests
  • LXML
  • Selenium

In this guide, we will be using Beautiful Soup and Selenium to scrap one of the review pages of Trip Advisor.

Why Selenium? Isn’t Beautiful Soup enough?

Web scraping with Python often requires no more than the use of the Beautiful Soup to reach the goal. Beautiful Soup is a very powerful library that makes web scraping by traversing the DOM (document object model) easier to implement. But it does only static scraping. Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in “view page source”, and then you slice and dice it. If the data you are looking for is available in “view page source” only, you don’t need to go any further. But if you need data that are present in components which get rendered on clicking JavaScript links, dynamic scraping comes to the rescue. The combination of Beautiful Soup and Selenium will do the job of dynamic scraping. Selenium automates web browser interaction from python. Hence the data rendered by JavaScript links can be made available by automating the button clicks with Selenium and then can be extracted by Beautiful Soup.


Installation

pip install bs4 selenium

Selenium for JavaScript link buttons

First, we will use Selenium to automate the button clicks required for rendering hidden but useful data. In review page of Trip Advisor, the longer reviews are partially available in the final DOM. They become fully available only on clicking “More” button. So, we will automate the clicking of all “More” buttons with Selenium.

For Selenium to work, it must access the browser driver.

Here, Selenium accesses the Chrome browser driver in incognito mode and without actually opening a browser window(headless argument).

Get Trip Advisor review page and click relevant buttons

Here, Selenium web driver traverses through the DOM of Trip Advisor review page and finds all “More” buttons. Then it iterates through all “More” buttons and automates their clicking. On the automated clicking of “More” buttons, the reviews which were partially available before becomes fully available.

After this, Selenium hands off the manipulated page source to Beautiful Soup.

Beautiful Soup for extracting data

The page source received from Selenium now contains full reviews.

Here, Beautiful Soup loads the page source. It extracts the reviews texts by iterating through all review divs. The logic in the above code is for the review page of Trip Advisor. It can vary according to the HTML structure of the page. For future use, you can write the extracted reviews to a file.

Practical

I scraped one page of Trip Advisor reviews, extracted the reviews and wrote them to a file.

Following are the reviews I have extracted from one of the Trip Advisor pages.

JOKE of an airline. You act like you have such low fares, then turn around and charge people for EVERYTHING you could possibly think of. $65 for carry on, a joke. No seating assignments without an upcharge for newlyweds, a joke. Charge a veteran for a carry on, a f***ing joke. Personally, I will never fly spirit again, and I’ll gladly tell everyone I know the kind of company this airline is. No room, no amenities, nothing. A bunch of penny pinchers, who could give two sh**s about the customers. Take my flight miles and shove them, I won’t be using them with this pathetic a** airline again.
My first travel experience with NK. Checked in on the mobile app and printed the boarding pass at the airport kiosk. My fare was $30.29 for a confirmed ticket. I declined all the extras as I would when renting a car. No, no, no and no. My small backpack passed the free item test as a personal item. I was a bit thirsty so I purchased a cold bottle of water in flight for $3.00 but I brought my own snacks. The plane pushed off the gate in Las Vegas on time and arrived in Dallas early. Overall an excellent flight.
Original flight was at 3:53pm and now the most recent time in 9:28pm. Have waisted an entire day on the airport. Worst airline. I have had the same thing happen in the past were it feels like the are trying to combine two flights to make more money. If I would have know it would have taken this long I would have booked a different airline without a doubt.
Made a bad weather flight great. Bumpy weather but they got the beverage and snack service done in style
Flew Spirit January 23rd and January 26th (flights 1672 from MCO to CMH and 1673 CMH to MCO). IF you plan accordingly you will have a good flight. We made sure our bag was correct, and checked in online. I do think the fees are ridiculous and aren't needed. $10 to check in at the terminal? Really.. That's dumb in my opinion. Frontier does not do that, and they are a no frill airline (pay for extras). I will say the crew members were very nice, and there was decent leg room. We had the Airbus A320. Not sure if I'd fly again because I prefer Frontier Airlines, but Spirit wasn't bad for a quick flight. If you get the right price on it, I would recommend it... just prepare accordingly, and get your bags early. Print your boarding pass at home!
worst flight i have ever been on. the rear cabin flight attendents were the worst i have sever seen. rude, no help. the seats are the most cramped i have every seen. i looked up the seat pitch is the smallest in the airline industry. 28" delta and most other arilines are 32" plus. maybe ok for a short hop but not for a 3 or 4 hour flight no free water or anything. a manwas trying to get settle in with his kids and asked the male flight attendent for some help with luggage in the overhead andthe male flight attendent just said put your bags in the bin and offered no assitance. my son got up and help the manget the kidscarryons put away
I was told incorrect information by the flight counter representative which costed me over $450 i did not have. I spoke with numerous customer service reps who were all very rude and unhelpful. It is not fair for the customer to have to pay the price for being told incorrect information.
We got a great price on this flight. Unfortunately, we were going on a cruise and had to take luggage. By the time we added our luggage and seats the price more than doubled.
Fun crew. Very friendly and happy--from the tag your bag kiosk to the ticket desk to the flight crew--everyone was exceptionally happy to help and friendly. We find this to be true of the many Spirit flights we've taken.
Not impressed with the Spirit check-in staff at either airport. Very rude and just not inviting. The seats were very comfortable and roomy on my first flight in the exit row. On the way back there was very little cushion and narrow seats. The flight attendants and pilots were respectful, direct, and welcoming. Overall would fly Spirit again, but please improve airport staff at check-in.

Conclusion

Beautiful Soup is a very powerful tool for web scraping. But when JavaScript kicks in and hides content, Selenium with Beautiful Soup does the job of web scraping. Selenium can also be used to navigate to the next page. You can also use Scrapy or some other scraping tools instead of Beautiful Soup for web scraping. And finally after collecting the data, you can feed the data for data science work.

Y Media Labs Innovation

Engineering blog showcasing some innovation and creativity

rahul nayak

Written by

nayakrahul.github.io

Y Media Labs Innovation

Engineering blog showcasing some innovation and creativity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade