The Confused Candidate

Data Science vs Data Engineering vs Machine Learning Engineering

Ambarish Chatterjee
Analytics Vidhya

--

I see a lot of people, especially the students and new-comers in the Data Science/Data Engineering field, confused with these different Job Titles, especially because the Job Descriptions and Specifications for all these positions posted, are generally very similar. With experience in the Analytics/Data Science world, I have understood that it’s not only the students or young job-seekers, but the companies, in some cases their leaders also are confused about the actual role they need and consequently post a fuzzy JD on different job portals, further confusing the young crowd.

Although there exists articles in the internet, where the differences have been discussed, I am trying to write it in a very crisp manner, to minimize the confusions. At the end of the article, I expect the following things:
i) A student or Data Science enthusiast should be able to choose which career path he/she actually wants or which role is fit for him/her.
ii) A hiring manager or leader should be able to decide which role they actually want currently in his/her team or organization and how they can build a team with a mix of these different roles or skillsets.

The Role of a Data Engineer :

Usually in any organization, the data are maintained in different places in different databases; somewhere there may be a Oracle Database, somewhere they may use Teradata, somewhere data may be stored in MS SQL. Some systems may take input or give output in JSON or in XML. Somewhere they may use NoSQL databases like Cassandra or MongoDB also. Somewhere data may also be in distributed computing structure like Hadoop. Lot of data remain in Excel files also with different teams inside an organization.

The role of a Data Engineer is to systematically plan, create, and maintain data architectures and structures of Databases to store data from all these different types of structured (all types of RDBMS), semi-structured (JSON, XML etc.) and unstructured (Text, image, speech etc.) datasources, in a way that can be used by Data Scientists or Analysts for their purposes (like getting insights, developing models etc.). They need to know which field means what and which data and which fields are needed when.

Data Engineers also create procedures to automate some of the manual database creation and maintenance tasks. In many companies, they are also expected to derive some summaries or insights from the data. They also need to use their skills to develop routine reports that can be generated at certain schedules for different stakeholders.

Skills needed: So what are the things one need to learn to perform the above-mentioned tasks?

  1. SQL and PLSQL Advanced level — Use of windows functions, optimum query building (memory and processing optimization), stored procedures, functions, triggers etc.
  2. Database Architecture — Knowledge about architectures of different databases like Oracle, Microsoft, DB2, Cassandra, MongoDB etc.
  3. BigData — In many companies, BigData Engineers are hired separately and in others, Data Engineers work on BigData as well.
  4. Programming languages — Python, VBA, some OOP like Java, C++
  5. Unstructured data management — NLP, Image processing etc.
  6. Basic statistics and some idea about Machine Learning.

The Role of a Data Scientist :

Data Scientists are looked upon as the modern-day astrologers for businesses! They are expected to mine any data available in an organization, get valuable insights, validate different hypotheses for business strategy to determine which strategy will benefit the company more, build predictive model, make forecasts etc.

A Data Scientist is expected to be able to extract required data from different databases using connectors through R, Python, Java or any other language he/she uses for work. They should be able to manipulate data in a way that may help the data to be fed to different statistical or Machine Learning algorithms. The Data created by data engineers also need to be transformed to create different variables, a Data Scientist will use for building models.

A Data Scientist needs to understand the business very well, to actually make out the business problem for which the analysis or model is required, what data points are needed to be used for the business problem, how many years’ records should be collected etc. For this, they need to talk with the business stakeholders, who can be from Marketing or Sales or Underwriting or any other department who know the background of the problem very well.

They are expected to build and then improve predictive models in terms of the prediction effectiveness. They also need to present the models and their outcome well to the management.

Skills needed:

  1. Statistics and Machine Learning — They need to be really good in Statistics and should have Mathematical intuitions to understand the data effectively and build predictive models (Linear and Logistic regression, Decision Tree, Random Forest etc.) or forecasting models (ARIMA, ARIMAX, Prophet etc.). They need to understand each parameter of a model/algorithm and the pros and cons of each model. They also need to know some of the different Deep Learning frameworks like Tensorflow, Pytorch, Caffe etc.
  2. Business Understanding — As already mentioned, they must gain domain knowledge to understand what data they can use for their works, what models will benefit the business more etc.
  3. People skill — Since a large part of a Data Scientist’s job involves understanding the business from different teams, they should have effective people skill.
  4. Knowledge of different tools and platforms — Knowledge of multiple ML platforms and technology stacks are necessary to become a good Data Scientist. It involves different OS, different bigdata tools and ecosystems like Kafka, Spark, Pig, Hive etc., Cloud platforms like AWS, GCP, Azure etc.
  5. Database skills — They should be comfortable working with SQL.
  6. Programming skills — A Data Scientist should be very much hands-on working with any one or more than one languages that are used for Machine Learning, e.g. R, Python, SAS, Julia and should have OOP understanding.
  7. Handle semi-structured and unstructured data — Since a lot of data in any company are semi-structured (JSON, XML etc.) and unstructured (images, PDFs, Audio records, Images etc.), a data scientist should be able handle them as well. NLP, Computer vision, OCR these also need to be learned for some relevant business use cases.
  8. Presentation Skill — An important part of the job of a Data Scientist is presenting the output to the stakeholders and showing the management the benefit of using Data Science. So effective presentation skill is also required in a Data Scientist.

The Role of a Machine Learning Engineer :

The Data Scientists make models, which best solves the business problem in terms of accuracy, precision etc. different metrics. But there might be a requirement of the model in production to give the prediction within a very stringent time limit or may need to be deployed in a specific tool, which incorporates some other rules defined by the business. Machine Learning Engineers are those computer/software engineers, who help in optimizing the ML models for deployment in production, for ensuring the models can give prediction on Terrabytes of data or more.

ML engineers create APIs for the ML models. They also help in creating the data pipelines through which the raw data can be transformed to the form the model API can accept and throw the prediction.

ML engineers must have an in-depth idea and visibility to determine which model out of the multiple models a Data Scientist may create, may be the best or optimum to be productionized.

Skills needed:

  1. Machine Learning and Statistics — The need to understand each model very well. So ML and statistics both are must for this role.
  2. Programming skill — They have to be very good programmers and should know multiple languages like Python, Java, C++, R and also different libraries of these languages which help in ML, and create APIs.
  3. Different tools and platforms — The ML engineers need to know all tools and platforms a Data Scientist knows (as described in the Data Scientist section). In addition to those, they should learn Flask, Docker, Kubernetes etc. tools for helping in deployment of models in production on different platforms.
  4. Data handling — Same level or more expertise in the Data handling capability in comparison to Data Scientists.
  5. Domain knowledge — Domain knowledge helps an ML engineer to understand what technique or approach can be best in what situation.

I hope now you have got a fair idea what these different profiles or job titles actually mean. A job candidate or aspirant must understand that many job posters don’t understand these differences and sometimes companies don’t have budget to hire people of these different specific profiles. So from the JD, it looks like they want all in one. In such cases, the candidates must clarify the expected responsibilities and team structure during interview and accordingly decide which role it exactly is and whether he/she is fit for the same. Likewise the hiring managers also should not only go by the resume headlines, rather should understand what exactly the skills of the candidate is and where he/she fits in the team. If a student is reading this, I want to assure you that it’s not necessary to know everything written here to get into the Data Science, Data Engineer or ML Engineer career. If your fundamentals are strong and you have an analytical mind, which is open to learning new things, it’s enough to first make a start. Then while working, gradually you need to constantly upskill yourself as we all do and grow more and more in whatever career path you choose. Thank you for reading till this point!

--

--

Ambarish Chatterjee
Analytics Vidhya

A Lead Data Scientist and Senior AI Architect, who always derive joy in gaining and sharing knowledge to make an impact on people's lives and in workplaces.