6 key mistakes junior (and not so junior) Data Scientists make

Save time and effort by not reproducing those errors 😲

William Palmer
Margo Analytics
12 min readMar 8, 2022

--

Image from https://www.inzata.com/the-7-most-common-data-analysis-mistakes-to-avoid/
What to avoid as a Data Scientist

When I started my Data Science journey, with my Machine Learning 101 course, my first Kaggle project, my first internship, my first job, I was basically a rookie, eager to learn but also a little cocky thinking I knew it all.

Today, it seems surreal to me to look back and realize how many mistakes I ended up making. For the little anecdote, I remember getting a R2 score of 0.98 with a linear regression, thinking “Damn, Machine Learning is easy” with no notion whatsoever of cross validation, model generalization or feature selection.

“A smart person makes a mistake, learns from it, and never makes that mistake again.”– Roy H. Williams (1958-)

At the end of the day, making mistakes is an essential part of the journey of a Data Scientist! Because we are solving complex problems with data, we are bound to make so many mistakes. But what is more worrying is not even detecting the mistake or not breaking those bad habits which could drastically hinder your growth and learning curve as a Data Scientist.

This is why I identified some of the key mistakes that I’ve seen fellow Data Scientists, candidates and/or myself make in the past so that you can avoid them. We will also use a case example along the way to illustrate.

#1 Starting on the wrong foot

I delivered a project that didn’t meet the client’s expectations because the scope wasn’t properly set on my end.

I thought my Deep Learning model for time series prediction was performing really well until I realized that using the previous value as a predictor gave me better results …

Subconsciously, when starting a project, we tend to be eager to hit the ground running and rush to get fast results but time and again, those results are not what was expected because from the start, a wrong turn was made.

To avoid a false start, I highly encourage you to follow those 3 essential steps that will set you up for steady success:

Frame the problem

Take the time to exactly define what the problem you are trying to tackle. Be sure to word it in simple terms in plain English and double check that objective with the stakeholders of the project.

Don’t hesitate to ask questions and challenge assumptions made subconsciously by the client, colleague or yourself. By taking your time at the beginning to do so, you make sure you are in the right direction with no potholes in sight.

Let’s use a real example that i will refer to along the article to illustrate. One of my recent project for a client was to build a time series forecasting pipeline for 12 weekly periods (t+1,t+2…). After a few framing meetings, we realized that there was actually no need for weekly predictions since only monthly predictions would be used for logistical purposes afterwards. Hence, we jointly decided that monthly predictions would be optimal. This may seem like an easy thing to discover but it was not. This was the culmination of many discussions to understand the business value of the project.

Get the right metric

The objective set beforehand should dictate which performance metric you choose.

90% of the time the metric you will use to quantify your performance will be R2 for regression predictions and accuracy for classification. As I said, those metrics will suffice more often then not but when they don’t match the problem at hand, the whole project will be at risk.

Let me give you simple case scenarios to show you the importance of choosing the right metric:

  • Having inhomogeneous/potential outliers in your data ? A good R2 score won’t be a feat because your model may focus on predicting those extremities. Instead use a MAE or Huber loss.
  • Having imbalance in your data set ? A 90% accuracy for a model is no achievement if 90% of your data points have the same label. Instead use Precision, Recall or F1-score (or re-balance your data set)

If those case scenarios seem too obvious to you, let’s get back to that example. In this specific case of sales forecasting, the MSE was not adequate because the time series were not comparable in terms of the scale of the target and in terms of behavior. Thus, we decided to “explode” our dataset into smaller data set that had common characteristics and inspect the MSE metric for every smaller data set. Also, during the framing meeting, we understood that the confidence in our predictions was extremely important from month to month. Hence, we looked at variance from month to month of our MSE to be sure that our error was constant over time. Inter-month variance would result in the model not being accepted by the users.

Start with a Baseline

We’ve all been there thinking we had a great model which turned out to not be the case.

Right image from Alamy
Baseline = Benchmark

This is so simple it can be thought of anecdotal to many but it’s often easily overlooked. Every time you start a project, you should start with a baseline. Two ways to do that :

  • If there is a preexisting model (coded or subjective), you should recompute its metrics. This will also be proof of the superiority of your model to convince the different stakeholders.
  • If not, you should find a really simple/stupid model i.e. mobile means for time series predictions, always predicting the majority class for classification…

This has to become a no-brainer so that you always have a point of comparison for all future models, to quickly retrieve from dead ends when the performance is not up to par.

For our particular example, we simply used the 12 month lag values as a first baseline. Really simple, right ? That’s the point 😅

#2 No insightful EDA

I realized, a week in, that my data had many anomalies which is why I couldn’t get my model to converge.

More often then not, junior Data Scientists find Exploratory Data Analysis (EDA) to be time consuming and inefficient. This is so far from the truth in my opinion, as you can:

  • Gain knowledge on the predictors: understand what type of features you have, if they are inter-correlated or with the target, what type of distributions for my features
  • Gain quality: detect errors in the data. This could be outliers, duplicates, or missing values
  • Deconstruct your preconceived notions : visualizing relationships between your features can validate or contradict your suppositions or biases towards the problem

This last point is to me super important. EDA is the perfect opportunity to analyse the problem and for unseen problems to arise for the different stakeholders.

To reinforce that last point, in my previous example (of time series prediction), it was while doing an in-depth EDA that we realized that we had drastically different behaviors in our data set. Some time series had a smooth behavior while some showed signs of intermittent demand (with months having 0 as values). Therefore, detecting this lead us to questioning the stakeholders on the cause of those different patterns and most importantly guided us in the resolution of the problem.

Researching this subject, I discovered a great library pandas-profiling that generates profile reports with descriptive statistics, correlation plots, missing values... This is a great starting point for EDA, so no excuse now !

#3 Failing to manage your time

I feel overloaded with so many tasks that I said I would do but I can’t find the time to do them

Short iterations vs R&D

Data science can be tiresome at times : so many features to be added at the last minute or requests/reports outside the ones that have already been planned …

Here are the ways I found to be most effective to manage my time :

  • Prioritization : Every new project, new feature, new request needs to be evaluated in terms of priority and therefore ranked against the others. How to do that ? Without a proper evaluation metrics, prioritization is a hard task. One simple framework is the Action Priority matrix that you can see below. “Quick wins” need to be put at the top of your to-do list with “Major projects”. I encourage you to read thorough explanation of the matrix here.
Image from Productplan
  • Focus on one task at a time : It has been shown that multitasking hinders your performance. We simply can’t do two things at once. We’re actually switching our attention back and forth between the two tasks. Therefore, stop multitasking :) One way to do that is to divide your day into blocks (Morning for meetings / Afternoon for main project for example).
  • Get organized at the beginning : When starting a new project, you need to voice important questions such as : What is the business impact ? What is the created value ? But what is equally important is the question of management of tasks and responsibilities. One simple and effective way to manage this is via Trello.

#4 Not setting strong foundations

Using a decision tree, my R2 on my train and test set was 0.4. I then proceeded to use a Random Forest and didn’t understand why it didn’t improve my scores.

I’ve used k-means clustering on the results of UMap and I can’t figure out why the results are so strange.

My loss on my neural network stagnates after 2 epoch, failing to converge.

The examples are galore. This is, to me, a huge mistake that many Data Scientists make. It is so important to not fall in the trap of using algorithms we know nothing of. We’ve all done it and we still do it from time to time : finding a new paper/library on the web, going to the documentations page and copy/pasting the example notebook code to test it on our dataset with 80% of the time failing at first.

Picture from https://www.dreamstime.com/stock-illustration-foundation-problem-house-falling-apart-failure-eps-vector-illustration-image89977040
Without strong foundations, you won’t go far

Machine Learning has become immensely popular, attracting so many professionals from technical and non-technical background. I can’t stress this enough for newcomers in the field: setting strong mathematical foundations before practical usage of machine learning models is a must. Otherwise, it will simply not be fun for you to be a Data Scientist, you will randomly try different models, different parameters for your models, different losses and most importantly get stuck on issues with no notion how to solve it.

To help you, here’s a list of concepts you need to be familiar with as a Data Scientist :

  • Statistics: correlation vs causality, statistical independence, p-values, t-test, bootstrap, oversampling.
  • Probabilities : log loss, entropy, information gain, different types of distributions (Gaussian, Binomial, Beta, Poisson…), maximum likelihood, Bayes theorem.
  • Other: cross-validation, bias-variance trade-off, data leakage, stationarity of a time series, discretization, gradient descent and stochastic gradient descent.

Read those concepts, be sure to understand them fully. When this is the case, you will be able to :

  • have a thorough understanding of the popular machine learning models and at least ‘intuition’ on the more advanced papers. For example, you will understand fully how a decision tree works having learned about information gain and entropy but also will get the idea of why LightGBM is so fast now that you know about discretization and sampling.
  • save time and effort by detecting errors in advance and find fixes
  • analyze the weaknesses of your current model and even more importantly, have the knowledge to know which direction needs to be taken to better the model. This a skill that in my mind separates the good from the great Data Scientists.

#5 No Feature Engineering

I lost so much time trying to better my model instead of improving my feature engineering. The features I had, in their initial representation, were not helpful.

This mistake is closely related to the previous one. Complex problems need tailored solutions. There are mainly 2 ways to find a solution when you are stuck in Machine Learning : either transforming your data or transforming your model. We talked above about the right way to test models but sometimes, no matter, how many parameters you grid search, how many models you test out, you simply can not find an acceptable solution without proper feature engineering. The data you have at hand, will need to be manipulated in order to extract the most insightful information (for example, non-linear relationships, detecting patterns…).

Going back to that case study example, not matter how many models we tried and how many different combinations of hyper-parameters we tested, we could not improve the model. We hit a “plateau”. Hence, we turned to feature engineering and decided to change the way we represented our time series. Without going into too much detail, changing the way we represented our time series by extracting more useful information was a game changer for us and we improved our performance by 10%.

There are many resources online that can teach you how to do feature engineering. FeatureTools can be an interesting library to speed up that process but it needs to be used carefully. Why is that, you may ask ! As you know now, fully automating the feature engineering process is risky. Feature engineering should be specific to your project. I would prefer for you to really think about how you should transform your data in ways that help find a solution. Many ideas will arise when you try to answer those questions : How do I handle categorical variables ? Is there groupings of data that can be interesting for my problem ? Do I need feature augmentation or feature reduction ? Should I apply a transformation to my features and my target (log transform) ?

# 6 Not being curious

I still use the same models that I used last year.

Data Science is an ever-changing discipline where the best models of today will be almost archaic in 3/4 years. You need to be constantly curious and be passionate about Data Science so that you keep learning new things day after day. Losing this eagerness to learn can be a trap because, as for other fields, the objective is to keep learning to be the best in your domain.

To do so, here are 4 tips :

  • Keep asking dumb questions : The most simple questions can lead to most innovative answers to different problems : What if ? What if we did that ? or this ?

“He who can no longer pause to wonder and stand rapt in awe, is as good as dead; his eyes are closed,” Einstein

  • Keep a Data Science journal : keep a record of your best practices, your tips, the problems you faced and how you resolved them. By doing that, you can use your own experience when faced with a similar problem 2/3 years from now and avoid some mistakes that you worded in your journal. Also, keep it separate from your personal diary if you have one : let’s not mix reflections on model architecture and love letters please😝
  • Keep going on Kaggle : Kaggle is a great place to learn what the best Data Scientist’s techniques are at the moment in different field. So either enter competitions or read the different discussions and models used to resolve the challenges
  • Keep reading the newest papers : Be a proactive reader and read about any particular field of machine learning. It needs to become an addiction. If you don’t have this automatism, I highly encourage you to book time slots of 30–60 minutes dedicated to read about ML blogs, posts and research papers. It could be on the train going to work or after-lunch while you digest :) Here are my go-to resources :

For various ML : Journal of Machine Learning Research ;2021bestpapers

For Deep Learning: DeepLearningResearch; RoadmapDeepLearning;

For techniques: TowardsDataScience Newsletter; The Batch Newsletter

I hope, in some way, I shed light on a few mistakes that you’ve been unconsciously making or came close to making. The idea of this post was to share the mistakes that so often, prompt failures and hinder our growth as Data Scientists.

To cap off, I want you to remember (at least) those 3 golden rules:

  1. Your first model should be a stupid baseline
  2. Never apply a model or concept without being able to explain it to a fellow colleague
  3. Keep a Data Science journal with you at all times :)

Feel free to to tell me which mistake I missed in the comment section and get in touch via Linkedin !

--

--