4 Life Saving Tips for Data Science Beginners

Yousuf Mehmood
The Modern Scientist
5 min readMay 27, 2022
Photo by Glenn Carstens-Peters on Unsplash

Let me preface this by saying that the only things you need to learn data science are a basic understanding of mathematics, statistics and logic.

That’s it.

If you have that down, you can use any of the plethora of data science tools out there to achieve your goals.

However, there are so many tools, languages, libraries, and methods out there that it’s easy to become overwhelmed.

Whether you’re taking a university/college course on data science or pursuing a degree or even just acquiring skills through Coursera or Udemy, you will need to focus on certain essentials.

So, here are 4 Life-Saving Tips for Data Science Beginners.

1. Understanding the Data is Priority #1

Stop me if you’ve heard this one before: “Data is the New Oil.”

It means data is being used everywhere, just like oil was in the 20th century. As a data scientist, you’ll probably end up working on data from all sorts of industries and fields.

Without understanding what that data means, and how you have to approach it, all the data science methods will be useless.

Having a basic understanding of the data features and how they connect to the end goal will help you create a pipeline for data wrangling, analysis, and prediction.

So, ask as many questions as you need to understand when you’re handed a project. Don’t leave ambiguities hanging until the deadline. One answer to the right question could make or break your model.

2. Basic EDA Will Save You a Lot of Trouble

Exploratory Data Analysis (EDA) is a primary step in data analysis. Using EDA will help you build a roadmap to your goal.

It comprises of visualization techniques, data exploration, description, and data feature connection.

These techniques include:

- Missing Value Analysis (MVA)

This technique will help you figure out which parts of your data are missing. More than that, it will tell you if there is a pattern in their absence.

Basic Missing Number Analysis to Show Gaps in Data
Missing Number Analysis can identify which features may be closely related and thus used to fill gaps

Perhaps you’ll notice synchronous gaps in data, or perhaps you’ll notice data features you don’t need at all.

This basic insight into the data will help you decide on how to proceed.

Maybe you’ll decide to compensate for the absent data by using any number of techniques like Interpolation, or Imputation. Or maybe you’ll decide to trim your data through outlier analysis.

You may also choose to extract the missing data from other sources to complete your data set.

This will prepare your data for analysis down the line.

- Clustering

There are so many clustering methods out there that a simple list could make your head spin.

Agglomerative Clustering hierarchically divides a dataset into clusters
Agglomerative Clustering showing a hierarchy depending on the size of inherent classes within a data set

K-Means Clustering, Affinity Propagation, the DBSCAN Method, Hierarchical Clustering, Spectral Clustering, etc.; all serve one purpose, to group your data into rough stacks or piles.

DBSCAN Method showing sub optimal clustering for a data set (too many clusters)

With a basic clustering algorithm, you can find out how many classes or groups your data is divided into.

This is useful for a number of reasons. Say you’re trying to find an ideal demographic to price a product. Clustering methods will identify broad groups for you to break down in your analysis, all without any classification method.

This is a very useful method that will help you build a roadmap towards your goal.

- Heatmaps

If you’re looking to find important links between different features in your data set, look no further than heatmaps.

Heatmaps can show you close relationship (positive or negative) between data set features
A Basic Heatmap I used to figure out close relationships between data set features

They’ll give you a clearer picture than any other diagram about the connections between the various features in your data.

3. ALWAYS SAVE YOUR MODELS

Saving your models will save you from the headache of repeatedly training your models whenever you’re given new data.

It’s a great boon to save your models if your machine ever crashes. With your models saved, you don’t have to go through the agony of waiting for them to train over and over again. You can simply load them from your directory.

I made the rookie mistake of not saving my models when I did my first machine learning project. I cycled through at least 20 of them with varying parameters, and had to retrain them if my kernel ever crashed.

3 Great Python Libraries to Save and Load Machine Learning/Deep Learning Models Include:

Saving your machine learning models can save time when you’re trying out different methods
Sample model.save command I used to save a basic Bi-Directional LSTM Model

4. Don’t Be Afraid to Experiment With Different Libraries

Different tools are good for different applications. You don’t have to stick to what you’re taught for every data problem.

That’s why it’s important to experiment with as many python libraries as you can.

To give you an example; my preferred library to performing deep learning tasks is Tensorflow. However, there are several other libraries out there to perform deep learning tasks including PyTorch, Caffe, and H2O.

Another example I can quote from experience is using the Sklearn library to train machine learning models. It’s a very comprehensive library to carry out all sorts of data analysis tasks including feature selection, unsupervised learning, and supervised learning.

However, when I discovered PyCaret, there was just no going back.

PyCaret is a very useful library to cycle through machine learning models
PyCaret runs through several machine learning methods to pick out the best model for your data

PyCaret cycles through all the basic supervised machine learning methods using whatever parameters you set with just one line of code. It also picks out the best model generated based on your preferred metric and allows you to tune it.

PyCaret extracts the most important features from your machine learning models
PyCaret shows you the most important features in your model by generating a “feature importance plot”

So, let that be a lesson to budding data scientists out there. Don’t be afraid to experiment with new libraries. It may seem daunting, but you may end up learning a new method that is more optimal for your needs.

There is a lot more to learn in your data science career than these tips of course, but nevertheless, they’ll make it just a little easier to solve even the most gargantuan problems you will face.

--

--

Yousuf Mehmood
The Modern Scientist

Burgeoning Data Scientist with Power Level Over 9000. I have not yet begun to compile!