Scraping and Classifying Indeed Job Postings for Data Occupations, Part 1

Compiling a dataset of job postings, structuring the text, and then classifying the postings by income-level and occupation-type requires basic web scraping, data munging, and machine learning tools. This exercise is thus a good experience for a data scientist developing their skills, which is how I came to undertaking it, as well as a demonstration of the amazing power of open-source technologies.

In part 1, I’ll describe how I approached compiling and structuring the dataset. (In Part 2, I’ll detail how I used machine learning algorithms to classify the postings and as well as what I gleaned.) To begin, I wrote a function —given a seed url that I generated by specifying parameters — to scrape the job aggregator Indeed for postings. I used the following keywords to get a mix of data science and related job postings in Boston: ‘data science,’ ‘data scientist,’ ‘data analyst,’ ‘analytics,’ and ‘business intelligence.’ I also define three salary tiers: $50,000–$75,000, $85,000–$110,000 and $115,000+. To populate my dataset with enough data science roles (interestingly, the keyword “data science” turns out to be a bad proxy) as well as higher income jobs, I expand my search for “data scientists” from Boston to the US as a whole. A seed url represents some combination of these parameters — for example, postings including the words “Data Science,” paying $75,000 per year, and located in Boston.

This graph was *cough* lifted from Google Images, and I can’t speak for it’s accuracy. But it seems accurate?

The function I defined crawls the pages of listings with jobs relating to the parameters in the seed url, up to 1000 postings in total, though many of these are duplicates. My function parsed the HTML on each page of postings to extract the urls of job postings that contain ‘company,’ a proxy that a posting is hosted at Indeed. Most, though not all, postings without ‘company’ in their ‘…’ url redirect to external pages. The formatting of those external postings differs widely, so for this exercise, I ignored them. However, I was able to parse postings hosted on Indeed with a one-size fits all approach.

In the function to extract the urls, I used a module called requests to fetch the HTML and convert it into a string. Following that, I used a tool called XPath to query data from the HTML object using path expressions.

I identify a relative path for urls of job postings
The path returns urls. All of these are preceded by

Once I identify the path matching the element or attribute I’m interested in, very much a guess and check process in which I write a query and view the result (aided here by the free XPath helper tool), I can extract it.

With parsing tools and a query in place, I retained those url’s that contain ‘company.’ The function returned a list of de-duplicated urls. Let’s call this step 1.

In step 2, I input the urls into a Pandas DataFrame, fetched the HTML for the content in those urls (with requests as before), and lastly parsed the job title and all written content in the job summary using XPath queries as above. I stored this information alongside the corresponding url in a DataFrame. The last step is to tag each posting with an income level as well as whether or not the posting is a data science job to support the classification problems I’ll discuss in part 2. Whereas Indeed defines the income level for us through the results it returns given a salary range, the process of tagging a posting as representing a data science job is subjective. I lowercased (to ensure I didn’t miss words appearing in both lower and upper case) and then scanned the job titles, masking for keywords that resonated: “data scientist,” “machine learning,” and “big data,” to name a few. Taking the index of job titles meeting these conditions, I iterated through the DataFrame to mark the relevant postings as data science jobs. At this point, I had a dataset!

If you decide to replicate this work, I advise you that compiling this dataset is time consuming. For this reason, and also because urls of job postings frequently disappear, I recommend saving your data. In the next post, I’ll discuss the results of the classification problems and my takeaways about what distinguishes jobs in data at different salary levels as well as “Data Science” from data-related occupations.