Data Cleaning Web Scraped Job Data

Michael Salmon
7 min readMay 1, 2017

--

Hi all,

In a previous post, I reviewed a process for scraping job posting data from the website indeed.com. I did this with the intention of collecting a variety of job postings from different cities, and then building a model to predict their relative salaries.

Before I could build my model, however, I needed to perform some initial analysis and cleaning of the scraped data. This post will briefly review some of the techniques I utilized in my cleaning process.

For this task, I utilized python and its assorted libraries, including pandas and numpy.

When we last left off, I had successfully scraped and saved a multitude of job postings from different US cities into a single pandas data frame, which looks something like the following (you’ll recall, the majority of job postings did not include salary information):

In approaching the task of cleaning my data, I had a few objectives:

  • eliminate all duplicate job postings from my data
  • format all available salary data so that salaries were presented in terms of a single figure detailing expected annual compensation
  • conduct any initial cleaning of string data that will make future natural language processing easier when building my predictive models

Eliminating Duplicate Job Postings

I began by reading in my scraped data from my saved csv, and checking its shape. I then eliminated all duplicate rows, and assessed how many truly unique postings I gathered in my scraping.

#reading csv with index_col = 0, otherwise I get an additional
#unnamed column of separate index values
scrape_data = pd.read_csv(“scrape_data.csv”, index_col=0)
scrape_data.shape
#dropping any duplicate rows:
scrape_data = scrape_data.drop_duplicates()
scrape_data.reset_index(drop=True, inplace=True)
scrape_data.shape

From this process alone, my data frame was reduced from 128,289 rows down to 6,399. While not complex in terms of coding, I only wanted to use unique job postings when conducting my future analysis.

Formatting Salary Data

To prepare the salary data, I first wanted to look at the sorts of values I was dealing with:

scrape_data[“salary”].value_counts()

As you can see, there are several aspects of this data that could make further calculations difficult. Python doesn’t know how to handle characters like commas and dollar-signs when conducting mathematical operations, so we need to get rid of these and “\n” strings before we can go much further.

On top of this, you’ll notice that salaries can be presented as single figures, or ranges, and that they can appear on different scales of payment — per year, per month, or per hour. I needed to take care of all of these items in the course of preparing my salary data for further analysis.

I began by stripping out some of the extraneous characters that I knew I needed to eliminate:

#some targeted cleaning of salary information to make parsing easier #—> remove “\n”, “$”, and “,”scrape_data[“salary”] = scrape_data[“salary”].str.replace(“\n”, “”)
scrape_data[“salary”] = scrape_data[“salary”].str.replace(“,”, “”)
scrape_data[“salary”] = scrape_data[“salary”].str.replace(“$”, “”)

While I knew I would need to remove string data on the scale of pay (e.g., “a year,” “a month”) from the salary data before I could work with it, I did want to preserve information about these original pay scales for possible future use. As such, I created a new column of data to capture this. I called this new column “og_salary_period”:

scrape_data[“og_salary_period”] = np.nan#if the salary contains information on time period, save that time
#period string in the og_salary_period column
scrape_data.ix[scrape_data[“salary”].str.contains(“year”), “og_salary_period”] = “year”scrape_data.ix[scrape_data[“salary”].str.contains(“month”), “og_salary_period”] = “month”scrape_data.ix[scrape_data[“salary”].str.contains(“week”), “og_salary_period”] = “week”scrape_data.ix[scrape_data[“salary”].str.contains(“day”), “og_salary_period”] = “day”scrape_data.ix[scrape_data[“salary”].str.contains(“hour”), “og_salary_period”] = “hour”

I then isolated all jobs with salary information into a separate data frame that I could work with to scale them all accordingly. Notice, I also removed these salaried rows entirely from the original scale_data table. I did this knowing that I would re-append the salaried rows once I had effectively standardized their appearance.

#filtering out the results with salary data != “Nothing_found”salary_data = scrape_data[scrape_data[“salary”] != “Nothing_found”]#removing all rows in salary data from scrape data, and converting #all “Nothing_found” values to NaN, so that float salary values can #be easily reintegrated laterscrape_data = scrape_data[~scrape_data.isin(salary_data)].dropna(how=’all’)scrape_data[“salary”].replace(“Nothing_found”,np.nan, inplace=True)
scrape_data[“salary”].astype(‘float’)

Here’s a quick look at the salary data frame. You’ll notice that the og_salary_period column captures information about the original scale of the salary information.

From here, I separated job postings with salary information according to the scale in which original salary data was presented. I also removed the string data associated with the pay scale.

#separating out data based on whether contains hour, day, week, #month, yearyear_salaries = salary_data[salary_data[“salary”].str.contains(“year”)]month_salaries = salary_data[salary_data[“salary”].str.contains(“month”)]week_salaries = salary_data[salary_data[“salary”].str.contains(“week”)]day_salaries = salary_data[salary_data[“salary”].str.contains(“day”)]hour_salaries = salary_data[salary_data[“salary”].str.contains(“hour”)]# removing string values(“ a year”, “ a week”, etc. from salary dfs)year_salaries[“salary”] = year_salaries[“salary”].str.replace(“ a year”, “”)month_salaries[“salary”] = month_salaries[“salary”].str.replace(“ a month”, “”)week_salaries[“salary”] = week_salaries[“salary”].str.replace(“ a week”, “”)day_salaries[“salary”] = day_salaries[“salary”].str.replace(“ a day”, “”)hour_salaries[“salary”] = hour_salaries[“salary”].str.replace(“ an hour”, “”)

Next, I defined a function that would detect when salary information was provided in a range (by looking for a hyphen in the data), and return an average of the two values. If there is no hyphen, it simply returns the single value as a float.

def split_sal(i):
try:
splt = i.split(‘ — ‘,1)
first = float(splt[0])
second = float(splt[1])
return (first + second)/2
except:
return float(i)

With my function in place, I could then clean the salary data and transform any salary data that wasn’t on an annual scale to approximate yearly earnings.

(While there may be discussion to be had regarding whether to keep jobs with hourly and weekly salaries, I decided to preserve these jobs for the sake of having more data available, and because I felt that the pay grade could still approximate the true value of the work on offer.)

year_salaries[“salary”] = year_salaries[“salary”].apply(split_sal)month_salaries[“salary”] = month_salaries[“salary”].apply(split_sal)
month_salaries[“salary”] = month_salaries[“salary”] * 12
week_salaries[“salary”] = week_salaries[“salary”].apply(split_sal)
week_salaries[“salary”] = week_salaries[“salary”] * 52
day_salaries[“salary”] = day_salaries[“salary”].apply(split_sal)
day_salaries[“salary”] = day_salaries[“salary”] * 260
hour_salaries[“salary”] = hour_salaries[“salary”].apply(split_sal)
hour_salaries[“salary”] = hour_salaries[“salary”] * 2080

Finally, I appended the various salaried data frames together, and rejoined them to the original scrape_data. While I did this in two steps, it could be consolidated into a single step.

#rejoining salary data into main scrape_data dfcombined_salaries = pd.concat([year_salaries, month_salaries, week_salaries, day_salaries, hour_salaries], axis=0)scrape_data = pd.concat([scrape_data, combined_salaries], axis=0)

Huzzah! Salary data ready!

Additional Assorted Data Cleaning

Before I felt ready to begin my modeling, I wanted to accomplish a few more cleaning tasks to prepare data for natural language processing.

Before removing special characters from all data, I recognized that there were several instances of “r&d” within the data. To prevent simply stripping the ampersand and leaving two separate strings of “r” and “d”, I wanted to targetedly change this particular string before doing a further blanket removal of special characters:

#targeted replace of “r&d” in job titles and summaries with #“research development”, as I want to remove “&” 
#from listings as part of general clean-up
scrape_data[“job_title”] = scrape_data[“job_title”].str.replace(“R&D”, “research development”)scrape_data[“summary”] = scrape_data[“summary”].str.replace(“R&D”, “research development”)

Next, I defined a function to scan a column and remove any of a list of special characters. I then applied this function to all columns except for “salary” (a column of float data) and “og_salary_period” (as I specifically created this column to not contain any of these characters.

#creating data character cleaning function, and applying to all #columns, also lowercasing all string data for ease of later nlpdef data_clean(df, column):
cleaning_list = [“+”, “$”,”/”,”,”,”?”,”.”,”;”,”-”,”@”,”!”,”&”,”%”,”^”,”*”,”)”,”(“, “\n”]
for item in cleaning_list:
df[column] = df[column].str.replace(item, “ “)
df[column] = map(str.lower, df[column])
#can’t clean the salary column due to float values, and don’t need #to clean og_salary, so keeping out of the for loopfor column in scrape_data.columns[0:len(scrape_data.columns)-2]:
data_clean(scrape_data, column)

Finally, I cleaned up the location data a bit. I noticed that certain job postings contained location data that simply said “united states”. Because this was completely unhelpful, where this was the case, I inputed these values with the city names I collected in my scrape.

Also, to simplify location data, I kept only the city and state names from each posting. This meant splitting off location data at the appearance of zip codes. I realized I was losing some information with this, but I felt it would make examining groups of locations easier, with fewer unique representations of the same places that won’t add much to natural language processing analysis.

#where location == “united states” I filled this in with the city #location name, otherwise it’ll be useless if more than one city has #location data of “united states”#using numpy is faster than using .replacescrape_data[‘location’] = np.where(scrape_data[‘location’] == “united states”, scrape_data[“city”], scrape_data[‘location’])#I’m also preemptively simplfying location data to include only city #and state, cutting data off at zip code - note: “ \d” splits #information on the first numeric digit in the string.scrape_data[“location”] = scrape_data[“location”].str.split(‘ \d’, expand=True, n=1)scrape_data.tail()

And that’s it! The final step was to save my data as a cleaned csv to make it easier to load and begin my modeling.

scrape_data.to_csv(“scraped_clean.csv”)

Thanks for reading, stay tuned for more!

--

--