Web Scrape Indeed For Popular Programming Languages Of Data Scientist Jobs

Published in

Analytics Vidhya

3 min readNov 20, 2019

Introduction

On the journey of pursuing my career, I tend to think what would be the most important programming language I should focus on. With more than a hundred of programming languages in the world, choosing which language to learn and specialized in can be difficult and overwhelming. In this article, I am going to present my findings from Indeed web scraping for the top 100 data scientist jobs and go through the steps.

The full code can be found in my github: https://github.com/ailing123/Indeed-Web-Scraping

Tools

python using beautifulsoup

Methodology

Decide which programming languages and cities to compare for data scientist jobs.

I choose 13 languages: C, C++, Java, Javascript, Python, R, SQL, Hadoop, Hive, Pig, Spark, AWS and Tableau

I choose 8 Locations : San Francisco, Los Angeles, New York, Boston, Chicago, Austin, DC and orange county

2. Scrape Indeed web page

3. Use regular expression to find out those programming languages in the job description

4. Make them into functions to get the links of each city easily

5. Visualize the result

Findings

Top 3 popular languages for data scientist jobs: Python, R, SQL

2. In most cities, Python is the most popular one. Nevertheless, SQL in orange county outstands python

3. Tableau has a high demanding too. It is a important visualization tool that we should definitely learn

Step 1: Load necessary packages

from bs4 import BeautifulSoup
import urllib
import re
import pandas as pd
import requests
from urllib.request import urlopen
import seaborn as sns
import matplotlib.pyplot as plt

Step 2: Web Scape Indeed for data scientist jobs

Here, I am scraping Indeed mobile page since it has a easier page layout which is shown below:

By getting into all the job links, we can extract the job descriptions for each job in each link. Here, the example is getting the data scientist jobs in Los Angeles.

url = "https://www.indeed.com/m/jobs?q=%22data+scientist%22&l=los+angeles&start=0"
page = urlopen(url)
soup = BeautifulSoup(page, 'lxml')
all_matches = soup.findAll(attrs={'rel':['nofollow']})
for i in all_matches:
    jd_url = 'http://www.indeed.com/m/'+i['href']
    response = requests.get(jd_url)
    jd_page = response.text
    jd_soup = BeautifulSoup(jd_page, 'lxml')
    jd_desc = jd_soup.findAll('div',attrs={'id':['desc']})

Step 3: Find out the programming language using regular expression and count them

Initialized the count of each languages to zero. After that, use regular expression to find out those programming languages in the job description. If we find the language mentioned in the job description, we add it to the count. Here, I showed how I get the count of python.

sum_py = 0python = re.findall(r'[\/\b]?[Pp]ython[\s\/,]?', str(jd_desc))sum_py = sum_py + len(python)

Step 4: Put the count into a dataframe

df=pd.DataFrame({'language':["C","C++","Java","Javascript","Python","R","SQL","Hadoop","Hive","Pig","Spark","AWS","Tableau"],
'count':[sum_C,sum_Cplus,sum_java,sum_javascript,sum_py,sum_r,sum_sql,sum_hadoop,sum_hive,sum_pig,sum_spark,sum_aws,sum_tableau]})

Step 5: Visualize the result

I am using seaborn to visualize the dataframe.

f, axes = plt.subplots(4, 2, figsize=(20, 20))
sns.despine(left=True)


sns.barplot(x='language', y='count', data=df_losangeles,ax=axes[0, 0]).set_title('Los Angeles',fontsize=20)
sns.barplot(x='language', y='count', data=df_sanfrancisco,ax=axes[0, 1]).set_title('San Francisco',fontsize=20)
sns.barplot(x='language', y='count', data=df_newyork,ax=axes[1, 0]).set_title('New York',fontsize=20)
sns.barplot(x='language', y='count', data=df_boston,ax=axes[1, 1]).set_title('Boston',fontsize=20)
sns.barplot(x='language', y='count', data=df_chicago,ax=axes[2, 0]).set_title('Chicago',fontsize=20)
sns.barplot(x='language', y='count', data=df_austin,ax=axes[2, 1]).set_title('Austin',fontsize=20)
sns.barplot(x='language', y='count', data=df_dc,ax=axes[3, 0]).set_title('DC',fontsize=20)
sns.barplot(x='language', y='count', data=df_orange_county,ax=axes[3, 1]).set_title('Orange County',fontsize=20)

Conclusion

To sum up, Python and SQL seems to be widely used for data scientists nowadays. The skill frequencies of different programming languages might differ in different locations. Knowing where you would like to work and focus more on those top languages will definitely give you a higher opportunity of landing a job.