Top Five Companies make you rich

Kexin Zhai
6 min readFeb 11, 2019

--

A tutorial for web scraping

The main purpose scraping data from a website is to convert data from HTML format to table dataset format(columns and rows). This tutorial will use Python as our coding language and User Experience Researcher’s salary page on indeed.com as the website we scrape data from. A Jupyter notebook will be used for the coding part.

Import Libraries

First, we need to import the libraries we will use in this notebook.

Importing the libraries we need

Write your own parser

Before we write a parser, a thing we should do is to create a model of what we want our data to look like. It might look different from our final data based on analysis and needs.

Model of dataset

We will start writing a parser for a state, then scrape the data from other states. Finally, we will combine all the data into a big dataset.

Let’s start with writing a parser for a state, in this case, we will pick California.

Start off with using requests to get the data and then use BeautifulSoup to turn it into soup we can parse through.

Go to the website and right click. You will see Inspect tool(Figure1), click on it, and then Figure 2 will pop up on the right screen. As you select elements, the data on the website will be highlighted.

Figure1
Figure2

Using inspect tool, <table id=“cmp-sal-company-table”> is the tag for us to grab data. Use

find_all(‘table’,{‘id’:’cmp-sal-company-table’},{‘class’:’cmp-TitleSalary-table’})

to get all of these salary groups.

Figure3

Next, <tr class=“cmp-salary-aggregate-table-entry” data-tn-component=‘salary-entry[]’> is the tag for us to get each single company’s salary information.

Figure4

Using len(), we can see we grab 10 salary groups. On the website, it has 10 companies’ salary on that page. Inspect the first and last ones to make sure they look coherent.

Inspect the first and last ones
First company on the page
Last company on the page

Navigating the HTML tree to find more specific parent elements

We will zoom in and find more detail information. <span> tag is the most promising one which could come up with company name and salary amount.

We notice that each group’s title consists of a company name and the position name(user experience researcher). We’ve already known the last three words are the same for every single group. As we write the code, we just need to remove the last three words to get the company name.

Figure5

The same as we grab salary data. We know the currency is the dollar, and we will only need a number. Thus, we can remove the ‘$’ symbol.

Figure6

In this case, we still need to grab Unit as part of our dataset. Not all the salary unit is ‘per hour’ but also ‘per year’ and ‘per day’. Finding all <span class=”cmp-salary-amount”> tag for getting salary unit.

Write a functional parser

We will append all data into info_CA and turn it to DataFrame.

Everything looks good. Now we can write a function which could apply to all of the states in the next step.

Here, we will include a sleep between each request so that the Indeed server would not overwhelm with requests. We can use the “sleep” function within the “time”. In order to apply our function to each state, we need to get URLs for different states. In this case, this website only has 14 states’ salary information of user experience researcher. Here, we will create a list including all of the available states name.

In this case, we have two lists of states name for our parser function. One is for URLs, another is the value inputted in the dataset.

Different lists of states name

Using the EDA method

We can see there are three different units for the salary.

In this case, we will convert DataFrame to CSV file and use the excel calculation function to turn all salary in ‘per year’ unit. Here, we will assume “a typical 40 hours work week by 52 weeks.” “That makes 2080 hours in a typical work year.”

Convert DataFrame to CSV file.
Data after converting the unit

Applying EDA method again

Shape of dataset, counts for unit and state

We want to see which state has the highest average salary.

“User Experience Researcher” job would have a higher salary in New Jersey, California, and Connecticut. (The top 3 states)

Zoom in the salary in New Jersy. We can see there is a company has a much higher salary.

Last, our main question for this tutorial — Top five companies will make you rich, working as a user experience researcher.

Bias

The main purpose of this blog is as a tutorial. The analysis in this tutorial includes bias. First, the website we used doesn’t have all the salary information from all of the states. Second, for the salary page in ‘California’ and ‘New York State’, this tutorial will only enable to get the first page data of each state. Third, in figure ‘Shape of dataset, counts for unit and state’, we can see the amount of salary information in different states varied.

Copyright

Brian C. Keegan, Ph.D.
Assistant Professor, Department of Information Science
University of Colorado Boulder

--

--