Data Scientist vs Data Analyst vs Data Engineer using Word Cloud

The terms Data Scientist, Data Analyst and Data Engineer are often used interchangeably. Although all three are data focused roles, they have subtle differences that separate them from each other. With even the hiring companies using the job titles interchangeably, let’s take a look at understanding job titles ourselves using…. DATA!!

First Love.. Let’s ask Google

Data Scientist

Data scientists are big data wranglers. They take an enormous mass of messy data points (unstructured and structured) and use their formidable skills in math, statistics and programming to clean, massage and organize them. Then they apply all their analytic powers — industry knowledge, contextual understanding, skepticism of existing assumptions — to uncover hidden solutions to business challenges.

Data Analyst

Data analysts collect, process and perform statistical analyses of data. Their skills may not be as advanced as data scientists (e.g. they may not be able to create new algorithms), but their goals are the same — to discover how data can be used to answer questions and solve problems.

Data Engineer

Data engineers build massive reservoirs for big data. They develop, construct, test and maintain architectures such as databases and large-scale data processing systems. Once continuous pipelines are installed to — and from — these huge “pools” of filtered information, data scientists can pull relevant data sets for their analyses.

The above definitions are a little vague and doesn’t explain clearly what skillset a company expects from a potential candidate for the given roles.

The approach we would take in understanding the differences in the job titles

Word Cloud

Word Cloud is an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency in the documents.

Data from LinkedIn

We collected around 20 ‘Job description and responsibilities’ data for the each of the roles on LinkedIn, kaggle and Glassdoor posted by multiple companies. Generating word clouds using this data might help us distinguish the roles clearly. However, like with many data science analysis, take this analysis with a grain of salt until we build a vast data set of job description and responsibilities preferably 100 companies for each job titles:)

Generating Word Cloud — Python Code

The extracted data are saved in text files and is used to generate the word cloud. This uses word_cloud library that can be installed with ‘pip install word cloud’

from wordcloud import WordCloud
import matplotlib.pyplot as plt
## Data analyst responsibilities
f = open('data/Data_analyst_responsibility.txt','r')
data_analyst_resp = f.read()
f.close()

##### Data analyst skills
f = open('data/Data_analyst_skill.txt','r')
data_analyst_skill = f.read()
f.close()
##### Data scientist responsibilities
f = open('data/data_scientist_responsibility.txt','r')
data_scientist_responsibility = f.read()
f.close()
##### Data scientist skills
f = open('data/data_scientist_skills.txt','r')
data_scientist_skills = f.read()
f.close()
def word_cloud_job_title(data, font_size = 40, title = '') :
""" Function to plot Word cloud """
    stopwords = ['etc','years', 'Etc','degree','skill',
'using','preferred','field','based','related','including','ability', 'experience']
data = data.lower()
for word in stopwords:
if word in data:
data=data.replace(word,"")

#Generate a word cloud image
wordcloud = WordCloud().generate(data)

# Display the generated image:
# the matplotlib way:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
fig = plt.gcf()
fig.set_size_inches(15,10)
plt.title(title, fontsize = 24)
plt.show()

### Data_analyst responsibility
word_cloud_job_title(data_analyst_resp, title = 'data_analyst_responsibility')

### Data_analyst skill
word_cloud_job_title(data_analyst_skill, title = 'data_analyst_skill')

### Data scientist responsibility
word_cloud_job_title(data_scientist_responsibility, title = 'data_scientist_responsibility')

### Data scientist skills
word_cloud_job_title(data_scientist_skills, title='data_scientist_skills')

### Data engineer responsibility
word_cloud_job_title(data_scientist_responsibility, title = 'data_scientist_responsibility')

### Data engineer skills
word_cloud_job_title(data_scientist_skills, title='data_scientist_skills')
Exported Matplotlib Images
Conclusion

Any company involved with processing large amounts of data will have employees in all three roles working in tandem. From the Data engineer skills word cloud, we notice a lot of keywords like SQL, Spark, Hadoop that are predominantly used for data processing. Data engineers process big data with these software and make it easier for Data Scientists and Analysts to work with the collected data.

While both Data scientists and analysts work closely with the business team to advise them on decisions based on their findings with the given data, data scientists also work on developing prediction models and thus more qualifications in programming, statistics and quantitative aptitude is expected off them. And this can again be seen with the generated word cloud keywords for data scientist skills (python, statistics, machine learning).

Variants:

Data Analyst: Product/Marketing/Risk Analyst

Data Scientist: Associate/Senior/Lead/Product Data Scientist

Data Engineer: Machine Learning Engineer/ Big Data Engineer

This article is co-authored by Ashish khan, who is also a free lancer in Machine Learning, android app, web design and data science . Check out his website here for fun and exciting things one could do with DATA. And you can find my work on GitHub here.

References:

  1. Google!
  2. Linkedin, Kaggle and Glassdoor job descriptions and responsibilities for data analyst, data scientist and data engineer
  3. Springboard career material. (Currently, I am part of data science career track program)