What 20 skills do you need to become a data scientist?

Using Python, API’s and NLP keyword analysis to understand skills listed in data science job descriptions

Published in

Geek Culture

6 min readJun 16, 2021

whiteboard with circle and 3 arrows of various skills pointing at the data science circle — *Image by author*

We have all heard that data is the new oil. But, with so many definitions of what a data scientist is, how can you prepare yourself for a career in data science?

One approach is to understand the skills needed from data science job descriptions. That’s exactly what I did. Using the Google jobs API available through SerpAPI, I pulled the job descriptions of over 100 data science roles. Then, I used a combination of Spacy, NLTK and Gensim to clean the data and extract keywords to identify the top hard and soft skills listed in job description.

Step 1: Getting the data

Getting job descriptions from Linkedin and Glassdoor is complex. The good news is that I found another solution! Using SerpAPI I was able to access Google Jobs job descriptions and take advantage of the free trial!

# Install Google search results package!pip install google-search-results

PRO TIP: Use the interactive browser to create a tailored request. I wanted to include search results uniquely for the last week, I did this by going to the google search page and extracting the type parameter for date_posted and inputting it into the chips file

Using the API defaulted to 10 results, scarcely enough for a meaningful analysis. To circumvent this, I created a list called ‘start’, and then a function to loop through the items in start list, calling the API for each page. As a result I was able to pull 100 results per job title:

# Call API
from serpapi import GoogleSearch
import pandas as pd# Define the pagination
start = [0,10,20,30,40,50,60,70,80,100]# I used 1 job title as a test
#job_title = 'analytics manager'# The I pulled a list of job titles I was interested in searching
job_titles = ['data scientist', 'sales','software engineer','UX researcher','product designer','analytics manager','product manager','product marketing manager','data engineer','business development']def get_jobs(start, job_title):
 #Create an empty dataframe
final = pd.DataFrame()for j in job_titles:
  for i in start:
    num = i
    params = {
      "engine": "google_jobs",
      "google_domain": "google.com",
      "q": f"{j}",
      "gl": "us",
      "hl": "en",
      "chips": "date_posted;week",
      "location": "New York, New York, United States",
      "api_key":"<YOUR SECRET>",
      "start":f"{num}"
}search = GoogleSearch(params)
results = search.get_dict()# Put results into dataframe
jobs = pd.DataFrame.from_dict(results['jobs_results'])# Append values to dictionary
final = final.append(jobs,ignore_index=True)
return final

Step 2: Splitting the text to identify skills

Most job descriptions are roughly split into 3 categories:

Company intro
Responsibilities of the role and
List of specific skills

A bit like this:

Most of the skills information is listed in the bulleted categories. For the sake of simplicity, the easiest way to get the data we want is to extract all the bullets from every job description. The good news is that this will remove the company info that we don’t need, but leave us with the responsibilities (soft skills) and qualifications:

import reds["all_bullets_string"] = [' '.join(map(str, re.findall('•(.+)',ds.description[i]))) for i, _ in enumerate(ds.description)]

Step 3: Cleaning the data

I decided to use the spacy platform to tokenize my job descriptions. Spacy’s tokenizer allows you to pick specific types of words. As I am interested in skills, I choose nouns (those would likely give me the hard skills like Python) and verbs which capture the soft skills need like ‘Communicating’. Further, I lemmatize the words, put them all into lower case and then the fun could begin!

# Start comparing the topics together
# Import our language packagesimport gensim
import nltk
import spacyfrom gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *# Spacy lemmatizer
# Load core modelsp = spacy.load('en_core_web_sm')# Remove stopwords
all_stopwords = sp.Defaults.stop_wordsdef lemmatize_words(job_description):
  result = []
  # tokenize each job description
  document = sp(job_description)
  # Remove punctuation and stop words, also remove high frequency but uninformative words like 'data'
  for word in document:
    if not word.is_punct and not word.like_num and not word.is_stop  and not word.is_space and (word.pos_ in ('NOUN','PROPN','VERB')) and word.lemma_ !='datum':
      result.append(word.lemma_.lower())result = [word for word in result if not word in all_stopwords]
return result# Other tokenizer
def noun_chunks(job_description):
  result = []
  # tokenize the job description text
  document = sp(job_description)
  # Would be interesting to see if it makes any difference if we
  # Look at sentences first
  # list(document.sents)[0]
  for word in document.noun_chunks:
    result.append(word)
  return result

Step 4: Term Frequency Analysis

Cleaning the data, I then started with a classic TF analysis*. Looking at the data in this way these are the top skills listed across 89 (some values were dropped as they didn’t have bulleted qualifications) data science job descriptions:

bar chart showing frequency of various nouns and verbs across the 89 analyzed job descriptions — *Image by author*

lemmatize_docs = []
# Lists for other tokenizer, did not use
noun_docs = []
processed_docs = []for doc in ds.all_bullets_string:
  processed_docs.append(preprocess(doc))
  lemmatize_docs.append(lemmatize_words(doc))
  noun_docs.append(noun_chunks(doc))# This creates a dictionary that counts the number of time a word occurs
dictionary2= gensim.corpora.Dictionary(lemmatize_docs)# Do a simple plot of word frequencies 
# Sourced from William Scott [1]df = {}for i in range(len(lemmatize_docs)):
  tokens = lemmatize_docs[i]
  for w in tokens:
    try:
      df[w].add(i)
    except:
      df[w] = {i}# Replace values with length of values that represents frequency
for k,v in df.items():
  df[k] = len(v)# Sort the dictionary
import operatorsorted_d = dict( sorted(df.items(), key=operator.itemgetter(1),reverse=True))print(sorted_d)import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick# Setup plot
a4_dims = (20, 9)
fig, ax = plt.subplots(figsize=a4_dims)
sns.set_theme(style="whitegrid")
sns.set(font_scale=1.4)# Create values
keys = list(sorted_d.keys())[:20]
# get values in the same order as keys, and parse percentage values
vals = list(sorted_d.values())[:20]
#Create a percentage variable
perc = [(i/89)*100 for i in vals]
pal = sns.color_palette("Spectral", len(vals))# Format y axis as percent
fmt = '%.0f%%' # Format you want the ticks, e.g. '40%'
yticks = mtick.FormatStrFormatter(fmt)
ax.yaxis.set_major_formatter(yticks)
ax.set(xlabel='% of Job Descriptions', ylabel='Skill')
ax.set_xticklabels(keys,rotation=30)
ax = sns.barplot(x=keys, y=perc,palette=pal)# Add title
plt.title('The most frequently mentioned nouns in data science job descriptions')
plt.show()

*I ran an LDA analysis as well, but the results I got where uninformative. I found this simple analysis much more helpful.

Results: How can you apply this?

Python comes as the second most common noun listed in 68% of job descriptions. Perhaps that’s not surprising, but what I do think is surprising is the SQL, is the next tool listed in the top 20 list. To me, this implies a desire for generalist data science specialists, not focussed on any specific skills (there ain’t no pytorch or NLTK listed here) but rather, the general ability to query data.

This is also reflected in some of the other skills. Business features highly, and reflects the critical needs for data scientists to remember to always tie their work to business outcomes; a great example is actually showing the product changes that could be made based on the data insights.

But, does a job description reflect what’s needed to be an actual data scientist?

As always, the answer is it depends. If you are looking for a job, and are interested in maximizing the chance of passing the initial screen, then this is the analysis for you. Typically, around 80% — 90% [2] of resumes get filtered at the screening stage:

Flow chart of job funnel process from application, screening to offer showing how much drop-off occurs at each stage — The candidate hiring funnel (Source) [3] | *Image by author*

This stage consists of a combination of recruiter and/or ATS screening, and keywords are critical to passing. A great way to get more keywords into your resume is using the job description. So, the more you can pepper your resume with the above words, the more likely, you should be to pass the majority of screens with one resume.

That said, you could also use the opposite tactic. Given that most job descriptions don’t list specific skills, you could focus uniquely on those that do. If, for example you are a Pytorch expert and that’s what the job spec requires, you are easily distinguishing yourself from other candidates.

Based on your own experience, how do your impressions of data science skills differ to what we saw here? What other job descriptions would you want to see analyzed?

[1] William Scott, TF-IDF from scratch in python on real world dataset.(2018), 2019 Towards Data Science

[2] Jobvite, The Recruiting Funnel, Deconstructed, (2015)

[3] Diamond lister product Demo, Diamond listers: Hire quickly using the power of NLP to screen candidates using their voice, (2020)