5 mistakes you can avoid in your first steps as Data Scientist

Published in

Predict

4 min readSep 27, 2018

Recently and mostly with AI evolution, Data Science has expanded into a very perspective career option. It is well-paid, and assignments are almost always fascinating. What newcomers should do to be successful in this area? There are few moments in this job to pay attention and most common mistakes to avoid.

1. Do not study without practice

Many people who start their career in this sphere make the same mistake — they get a lot of online courses, learn too many concepts but do not try to realize them on a practice. To understand just a part of the information is not enough for this job. When you learn an algorithm, try to find out all its pros and cons, its limitation, how it works in real applications. There is a tricky thing — when you are learning advanced libraries such as Python’s ggplot2, for example, you rarely understand what is going on in its background. It would be better to apply what was learned to an experiment and get a deeper understanding of the process.

But make sure you will carry on with your studying even after you get a job. Your learning should be continuous and professional — it is very important to keep a finger on the pulse of changes. Do not be afraid of difficult topics and do not give up on a midway. You always can ask for a help more experienced Data Scientists. At Ralabs, for example, we have a mentor system to support newcomers. Also, you can discuss your questions on online forums such as Stack Exchange, Stack Overflow or even GitHub.

2. Learn math

Algebra, Statistics, Probability, and Calculus — you need this four concepts to dive into the deep areas of Data Science. It is a big mistake to code algorithms from scratch without learning the prerequisites. Lack of this knowledge will lead you to the practical problems. While you are making your first steps in Data Science, you do not really need to create every algorithm from scratch. But if you have to make a totally new algorithm, try to focus on learning.

Going deeper into Data Science, make sure you fill the gaps in your knowledge of the basic mathematical concepts. If you not confident how good you in it, you should refresh this information. Online-courses can help in it, for example, Introduction to Data or Statistical Thinking for Data Science.

3. Validate and re-validate your models

If you think you made a perfect machine learning model, it is the first sign you need to check it again. Even if the predictive power of your model is very high, you are just halfway to success. To make sure that indexes will not change is your next task. The model fits perfectly with observational data? Great, it is necessary to re-validate it at set intervals. Modeled relationships may change continuously so the predictive power of a model can collapse because of that. This problem can be easily avoided. You need to check the data with regularity depending on changes in relationships in the model.

The predictive power of models is influenced by many factors, and in some situations, Data Scientists have to rebuild their models. Still, do not panic — our main goal is not a model itself, but its results, which we can not drop below the acceptable level. It is a good practice to build few models and define the distributions of variables.

4. Watch the difference between correlation and causation

Some even experienced Data Scientists make this mistake — they misunderstood the difference between correlation and causation. Correlation is when two factors are observed at the same time, but causality is when the first of these factors lead to the second one. This difference is often ignored by Data Scientists, which leads to huge mistakes. For example, the similar gap in the analyze in Illinois, US, made authorities to send books to every student in the state. The statistic shows that books at home have a great influence on the marks level. But after some extra researches, it was found out that houses where parents interested in books and buy them from time to time are better learning environment.

As we can see, correlation does not necessarily imply causation. Big Data is often used to explain the correlation between variables. But in practice, if two subjects somehow related to each other it does not mean they have a causative dependence. So if you are making a decision based on correlation without understanding the cause, be ready to get faulty results.

5. Formulate clear questions

The main scientific standard is to formulate the clear question and design experiments depending on that question. Without the right question, you can’t collect the right datasets. Data Science requires structuring and well-defined questions, too. It is a quite common mistake to pay all attention to data without understanding the question that needs to be answered through analysis.

A huge number of Data Science projects give an answer on “what” kind of questions, which gives just numbers without explanations. This is happening when scientists do not follow their main goal. But our task is to answer the “why” kind of questions to understand something that was not clear before. Also, do not forget your question when you choose visualization techniques to represent the results. Sometimes this choice is navigated by aesthetic taste instead of dataset characteristics. So, a perfect goal for your model is the big part of success.

Do you want to join a groundbreaking Data Science project? Go Ralabs!