Top Five Companies make you rich
A tutorial for web scraping
The main purpose scraping data from a website is to convert data from HTML format to table dataset format(columns and rows). This tutorial will use Python as our coding language and User Experience Researcher’s salary page on indeed.com as the website we scrape data from. A Jupyter notebook will be used for the coding part.
Import Libraries
First, we need to import the libraries we will use in this notebook.
Write your own parser
Before we write a parser, a thing we should do is to create a model of what we want our data to look like. It might look different from our final data based on analysis and needs.
We will start writing a parser for a state, then scrape the data from other states. Finally, we will combine all the data into a big dataset.
Let’s start with writing a parser for a state, in this case, we will pick California.
Start off with using requests to get the data and then use BeautifulSoup to turn it into soup we can parse through.
Go to the website and right click. You will see Inspect tool(Figure1), click on it, and then Figure 2 will pop up on the right screen. As you select elements, the data on the website will be highlighted.
Using inspect tool, <table id=“cmp-sal-company-table”> is the tag for us to grab data. Use
find_all(‘table’,{‘id’:’cmp-sal-company-table’},{‘class’:’cmp-TitleSalary-table’})
to get all of these salary groups.
Next, <tr class=“cmp-salary-aggregate-table-entry” data-tn-component=‘salary-entry[]’> is the tag for us to get each single company’s salary information.
Using len(), we can see we grab 10 salary groups. On the website, it has 10 companies’ salary on that page. Inspect the first and last ones to make sure they look coherent.
Navigating the HTML tree to find more specific parent elements
We will zoom in and find more detail information. <span> tag is the most promising one which could come up with company name and salary amount.
We notice that each group’s title consists of a company name and the position name(user experience researcher). We’ve already known the last three words are the same for every single group. As we write the code, we just need to remove the last three words to get the company name.
The same as we grab salary data. We know the currency is the dollar, and we will only need a number. Thus, we can remove the ‘$’ symbol.
In this case, we still need to grab Unit as part of our dataset. Not all the salary unit is ‘per hour’ but also ‘per year’ and ‘per day’. Finding all <span class=”cmp-salary-amount”> tag for getting salary unit.
Write a functional parser
We will append all data into info_CA and turn it to DataFrame.
Everything looks good. Now we can write a function which could apply to all of the states in the next step.
Here, we will include a sleep between each request so that the Indeed server would not overwhelm with requests. We can use the “sleep” function within the “time”. In order to apply our function to each state, we need to get URLs for different states. In this case, this website only has 14 states’ salary information of user experience researcher. Here, we will create a list including all of the available states name.
In this case, we have two lists of states name for our parser function. One is for URLs, another is the value inputted in the dataset.
Using the EDA method
We can see there are three different units for the salary.
In this case, we will convert DataFrame to CSV file and use the excel calculation function to turn all salary in ‘per year’ unit. Here, we will assume “a typical 40 hours work week by 52 weeks.” “That makes 2080 hours in a typical work year.”
Applying EDA method again
We want to see which state has the highest average salary.
“User Experience Researcher” job would have a higher salary in New Jersey, California, and Connecticut. (The top 3 states)
Zoom in the salary in New Jersy. We can see there is a company has a much higher salary.
Last, our main question for this tutorial — Top five companies will make you rich, working as a user experience researcher.
Bias
The main purpose of this blog is as a tutorial. The analysis in this tutorial includes bias. First, the website we used doesn’t have all the salary information from all of the states. Second, for the salary page in ‘California’ and ‘New York State’, this tutorial will only enable to get the first page data of each state. Third, in figure ‘Shape of dataset, counts for unit and state’, we can see the amount of salary information in different states varied.
Copyright
Brian C. Keegan, Ph.D.
Assistant Professor, Department of Information Science
University of Colorado Boulder