Better web scraping in Python with Selenium, Beautiful Soup, and pandas

Dave Gray
We’ve moved to freeCodeCamp.org/news
5 min readApr 16, 2018

Web Scraping

Using the Python programming language, it is possible to “scrape” data from the web in a quick and efficient manner.

Web scraping is defined as:

a tool for turning the unstructured data on the web into machine readable, structured data which is ready for analysis. (source)

Web scraping is a valuable tool in the data scientist’s skill set.

Now, what to scrape?

“Search drill down options” == Keep clicking until you find what you want.

Publicly Available Data

The KanView website supports “Transparency in Government”. That is also the slogan of the site. The site provides payroll data for the State of Kansas. And that’s great!

Yet, like many government websites, it buries the data in drill-down links and tables. This often requires “best guess navigation” to find the specific data you are looking for. I wanted to use the public data provided for the universities within Kansas in a research project. Scraping the data with Python and saving it as JSON was what I needed to do to get started.

JavaScript links increase the complexity

Web scraping with Python often requires no more than the use of the Beautiful Soup module to reach the goal. Beautiful Soup is a popular Python library that makes web scraping by traversing the DOM (document object model) easier to implement.

However, the KanView website uses JavaScript links. Therefore, examples using Python and Beautiful Soup will not work without some extra additions.

https://pypi.python.org/pypi/selenium

Selenium to the rescue

The Selenium package is used to automate web browser interaction from Python. With Selenium, programming a Python script to automate a web browser is possible. Afterwards, those pesky JavaScript links are no longer an issue.

Web scraping the KanView web data requires Selenium, Beautiful Soup, RegEx, pandas, and the OS module.

Selenium will now start a browser session. For Selenium to work, it must access the browser driver. By default, it will look in the same directory as the Python script. Links to Chrome, Firefox, Edge, and Safari drivers available here. The example code below uses Firefox:

This example uses the Firefox browser driver.

The python_button.click() above is telling Selenium to click the JavaScript link on the page. After arriving at the Job Titles page, Selenium hands off the page source to Beautiful Soup.

https://www.crummy.com/software/BeautifulSoup/

Transitioning to Beautiful Soup

Beautiful Soup remains the best way to traverse the DOM and scrape the data. After defining an empty list and a counter variable, it is time to ask Beautiful Soup to grab all the links on the page that match a regular expression:

You can see from the example above that Beautiful Soup will retrieve a JavaScript link for each job title at the state agency. Now in the code block of the for / in loop, Selenium will click each JavaScript link. Beautiful Soup will then retrieve the table from each page.

https://pandas.pydata.org/

pandas: Python Data Analysis Library

Beautiful Soup passes the findings to pandas. Pandas uses its read_html function to read the HTML table data into a dataframe. The dataframe is appended to the previously defined empty list.

Before the code block of the loop is complete, Selenium needs to click the back button in the browser. This is so the next link in the loop will be available to click on the job listing page.

When the for / in loop has completed, Selenium has visited every job title link. Beautiful Soup has retrieved the table from each page. Pandas has stored the data from each table in a dataframe. Each dataframe is an item in the datalist. The individual table dataframes must now merge into one large dataframe. The data will then be converted to JSON format with pandas.Dataframe.to_json:

Now Python creates the JSON data file. It is ready for use!

The automated process is fast

The automated web scraping process described above completes quickly. Selenium opens a browser window you can see working. This allows me to show you a screen capture video of how fast the process is. You see how fast the script follows a link, grabs the data, goes back, and clicks the next link. It makes retrieving the data from hundreds of links a matter of single-digit minutes.

The full Python code

Here is the full Python code. I have included an import for tabulate. It requires an extra line of code that will use tabulate to pretty print the data to your command line interface:

Photo by Artem Sapegin on Unsplash

Conclusion

Web scraping with Python and Beautiful Soup is an excellent tool to have within your skillset. Use web scraping when the data you need to work with is available to the public, but not necessarily conveniently available. When JavaScript provides or “hides” content, browser automation with Selenium will insure your code “sees” what you (as a user) should see. And finally, when you are scraping tables full of data, pandas is the Python data analysis library that will handle it all.

Reference:

The following article was a helpful reference for this project:

https://pythonprogramminglanguage.com/web-scraping-with-pandas-and-beautifulsoup/

--

--