What 20 skills do you need to become a data scientist?

Using Python, API’s and NLP keyword analysis to understand skills listed in data science job descriptions

Christina Stejskalova
Geek Culture
6 min readJun 16, 2021

--

whiteboard with circle and 3 arrows of various skills pointing at the data science circle
Image by author

We have all heard that data is the new oil. But, with so many definitions of what a data scientist is, how can you prepare yourself for a career in data science?

One approach is to understand the skills needed from data science job descriptions. That’s exactly what I did. Using the Google jobs API available through SerpAPI, I pulled the job descriptions of over 100 data science roles. Then, I used a combination of Spacy, NLTK and Gensim to clean the data and extract keywords to identify the top hard and soft skills listed in job description.

Step 1: Getting the data

Getting job descriptions from Linkedin and Glassdoor is complex. The good news is that I found another solution! Using SerpAPI I was able to access Google Jobs job descriptions and take advantage of the free trial!

PRO TIP: Use the interactive browser to create a tailored request. I wanted to include search results uniquely for the last week, I did this by going to the google search page and extracting the type parameter for date_posted and inputting it into the chips file

Using the API defaulted to 10 results, scarcely enough for a meaningful analysis. To circumvent this, I created a list called ‘start’, and then a function to loop through the items in start list, calling the API for each page. As a result I was able to pull 100 results per job title:

Step 2: Splitting the text to identify skills

Most job descriptions are roughly split into 3 categories:

  1. Company intro
  2. Responsibilities of the role and
  3. List of specific skills

A bit like this:

Sample text from job description

Most of the skills information is listed in the bulleted categories. For the sake of simplicity, the easiest way to get the data we want is to extract all the bullets from every job description. The good news is that this will remove the company info that we don’t need, but leave us with the responsibilities (soft skills) and qualifications:

Step 3: Cleaning the data

I decided to use the spacy platform to tokenize my job descriptions. Spacy’s tokenizer allows you to pick specific types of words. As I am interested in skills, I choose nouns (those would likely give me the hard skills like Python) and verbs which capture the soft skills need like ‘Communicating’. Further, I lemmatize the words, put them all into lower case and then the fun could begin!

Step 4: Term Frequency Analysis

Cleaning the data, I then started with a classic TF analysis*. Looking at the data in this way these are the top skills listed across 89 (some values were dropped as they didn’t have bulleted qualifications) data science job descriptions:

bar chart showing frequency of various nouns and verbs across the 89 analyzed job descriptions
Image by author

*I ran an LDA analysis as well, but the results I got where uninformative. I found this simple analysis much more helpful.

Results: How can you apply this?

Python comes as the second most common noun listed in 68% of job descriptions. Perhaps that’s not surprising, but what I do think is surprising is the SQL, is the next tool listed in the top 20 list. To me, this implies a desire for generalist data science specialists, not focussed on any specific skills (there ain’t no pytorch or NLTK listed here) but rather, the general ability to query data.

This is also reflected in some of the other skills. Business features highly, and reflects the critical needs for data scientists to remember to always tie their work to business outcomes; a great example is actually showing the product changes that could be made based on the data insights.

But, does a job description reflect what’s needed to be an actual data scientist?

As always, the answer is it depends. If you are looking for a job, and are interested in maximizing the chance of passing the initial screen, then this is the analysis for you. Typically, around 80% — 90% [2] of resumes get filtered at the screening stage:

Flow chart of job funnel process from application, screening to offer showing how much drop-off occurs at each stage
The candidate hiring funnel (Source) [3] | Image by author

This stage consists of a combination of recruiter and/or ATS screening, and keywords are critical to passing. A great way to get more keywords into your resume is using the job description. So, the more you can pepper your resume with the above words, the more likely, you should be to pass the majority of screens with one resume.

That said, you could also use the opposite tactic. Given that most job descriptions don’t list specific skills, you could focus uniquely on those that do. If, for example you are a Pytorch expert and that’s what the job spec requires, you are easily distinguishing yourself from other candidates.

Based on your own experience, how do your impressions of data science skills differ to what we saw here? What other job descriptions would you want to see analyzed?

[1] William Scott, TF-IDF from scratch in python on real world dataset.(2018), 2019 Towards Data Science

[2] Jobvite, The Recruiting Funnel, Deconstructed, (2015)

[3] Diamond lister product Demo, Diamond listers: Hire quickly using the power of NLP to screen candidates using their voice, (2020)

--

--

Christina Stejskalova
Geek Culture

My articles vary in topic but focus on how you can build products that have impact with the power of psychology and data