In Data Science They Do This, Right?

Malvin Khoe
tiket.com
Published in
6 min readOct 31, 2022

Data Science, Artificial Intelligence, and Machine Learning, are the words that we hear quite often these days. “Data Scientist is The Hottest Job in 21st Century”, but what does it really mean? Unfortunately, there are many misconceptions on what data science really is. A lot of people even use the word data scientist very loosely. Not only at the personal level, but also there are companies which often use the word data science without knowing what data science really is. Such phenomenon not only occurs in the small startups or small companies, but also it can be found in the big companies as well.

As someone who was interested in data science long before I knew what data science is, these are some misconceptions that I found along my journey from a computer science student up to becoming a full-time data scientist for a big tech company in Indonesia.

MODELLING IS NOT THE ONLY PART OF THE JOB

“Data Science is all about creating the best machine learning model.”

This sentence is probably one of the biggest misconceptions about data science. While it is true to some extent, people always misunderstand what the “best model” means. The best machine learning model is not always the model with the most complicated or the most advanced algorithms. So suddenly putting a deep learning model into your project that didn’t really need to use a deep learning model (i.e. basic regression and classification) will not make it better, in fact, most of the time it will get worse. The best machine learning model will vary from project to project, there is no one-stop solution for all, and there are times when deep learning is needed but there are also times when simple classification algorithms would suffice.

Prepare the Data First

In order to create the best machine learning model, we need to focus more on the data that we are working with rather than focusing on the model itself. This is something that most people often overlook even though it is the foundation of every machine learning model. So before we start creating the model, first we need to analyze the data that we are working with and gaining certain degree of understanding on what kind of data we are dealing with. Several questions that will come to your horizon, including:

  • Is it time series data?
  • What is the type of fields we are dealing with? Is it categorical data? Is it numerical data? Or is our data containing the mixture of both numerical and categorical data?
  • Then, we will need to understand representation of your data. Is this value showing the price of the product or is this value showing the number of products that has been sold?

After we understand the data, we need to explore it further before we can feed the data into our model.

Things that we need to explore further are the missing values, and the distribution of the data. Furthermore, we can also look at the correlation between our training data and the desired outcome. In some cases, we can also create new data based on the existing data, or what we call as feature engineering. After all of these steps in your analysis, we can create the final dataset that we will use to train and evaluate our model.

Know How to Evaluate The Model

Another important thing to look at is how we evaluate the model. This, arguably, is the most important part of creating a machine learning model. How can you know how good your machine learning model is if not from the evaluation?

There are many metrics to evaluate a model, so how can we choose which evaluation metrics will be used? By knowing what your goals are, and choosing the evaluation metrics that reflects those goals, you will have more validated experiments on your machine learning models. Firstly, you have to clearly define the nature of problem definition. For example, do you see it as a classification or a regression problem? If we want to create a model for classification, we can use accuracy. However, if we have an imbalanced dataset, we cannot simply choose accuracy because it will not reflect the real performance of our model. Hence, we may choose to use precision and recall instead, depending on your business scenarios. Knowing what kind of evaluation metrics to use will help us to create a better model.

As we can see, there are many things to do before we create a model and sometimes these things are much more important than the modelling itself.

DON’T FORGET YOUR OOP (Object Oriented Programming)

“Data Scientists don’t need OOP because they code in python”

This misconception comes from all the data science courses out there, which often too focused to teach us to use Jupyter Notebooks only. And because of the nature of Jupyter Notebooks and python, we often forget to implement any OOP concept.

So, where do data scientists create their models?

  • We still use Jupyter Notebooks, however, it is used in the data exploration and experimentation only.

In order to deploy our model to production, we need to create an API and we cannot create an API from Jupyter Notebooks. We need to put our code for modelling from Jupyter Notebooks into an executable python script. Since we are creating an API, we need to implement OOP and also clean code so it will be easier to maintain and debug.

So, don’t forget your python OOP because you are going to need it dearly.

FOCUS ON THE BASICS

The last one is not a misconception about data science, but this is a misconception on “How to get an entry-level Data Science job”. Many fresh graduates believe that they can get an entry-level job because they have created a deep learning model before or some other high-level model out there, but they often forget about the basics.

Please focus on the basics first, and master the fundamentals of Data Science. Don’t forget how to calculate precision and recall, create confusion matrices, create cross-validation, and understand simple machine learning algorithms like Linear Regression, Naive Bayes, etc. Being able to explain the pros and cons of these algorithms on a given real-world use case shows that you really understand those algorithms technically, and theoretically. Even the creativity to link those algorithm usages to potential computer science problems will be much appreciated as a Data Scientist. Not only the data scientist have to know in details the nature of the data, but also the models and even “story-telling” based on cross-validated experiments. Knowing well statistics and probability theories would be the minimum requirements to enter the world of practical data science. Be passionate and be extremely excited when dealing with the data, being challenged when you encounter extremely difficult problems to solve. And, do not be afraid to ask any questions to your peers or leads. As a competent data scientist, do not assume your question is a stupid question (read this article for your inspiration: https://towardsdatascience.com/why-you-should-ask-questions-as-a-data-scientist-46fb9f2bf893). Be hungry about consuming the data.

CONCLUSION

Data science is a relatively new field in computer science, especially in Indonesia, so there will be misconceptions here and there. I hope this article could help you in identifying the misconceptions continue your journey becoming a great data scientist in the future.

--

--