The Life Cycle of A Data Science Project

Published in

Machine Learning India

5 min readJul 10, 2021

Projects are always a great way of learning in any field. This article will ease you through this process by presenting the steps involved in a Data Science project. If you’re new to Data Science, you can use this article as a roadmap when working on projects. Data Science is a discipline in which many micro-decisions must be made. These small decisions have a big impact on the consistency and quality of the finished product.

SETTING THE OBJECTIVE

Before you begin with any Data Science project or any project in general, you must know why it is being built. What is the problem it will solve? Are there existing solutions to the problem?

For a Data Science project, you might be asked to do one or more of these two tasks:

Predict Something.
Explore the Dataset and find reasons for something.

One needs to be very clear about the task at hand. Not knowing what to do can lead to serious consequences. Once you answer these questions, you are ready to move on to the next step.

ACQUIRING DATA

Data is the most important cog of your Data Science Project wheel. The quality of whatever you are aiming to build depends on the Data you have. So, once you get the raw dataset, you have to explore it and get a top-level view of it. If it is a prediction task, then what is to be predicted? Is it a continuous variable(Regression Problem) or a categorical variable(Classification Problem)?

After answering these questions, if it is a classification problem, we have to make sure that there are approximately equal amounts of data for each class.

Example: Say we are making a cat-dog classifier with 100 images in the dataset. There are 80 cat images and 20 dog images. If we use this dataset, then the model will obviously be biased towards cats.

In such cases, we need to balance out the dataset. Imbalanced datasets can lead to very biased models, which is why they should be dealt with first.

DATA CLEANING

With a balanced dataset at hand, we need to clean the data. Cleaning the data is important because Machine Learning models work best with clean data.

What does Data Cleaning include?

Handling missing values or NaN values
Converting categories into numbers
Dealing with outliers in the dataset
Scaling the features in the dataset

Well, Data Cleaning is a vast subject and it can include a lot more tasks depending on the dataset. These are just the most common Data Cleaning jobs.

EXPLORATORY DATA ANALYSIS

This is the final step for Data Analysis tasks. In this step, data scientists, with the help of numbers, tables and visualizations, try to prove relationships between various variables. This is done for a few reasons:

To find the cause of something happening
To find features that affect each other the most
To create new features which affect the target variable even more than the existing features

Exploratory Data Analysis helps Data Scientists understand which features they should consider for modeling. Although it is a very long process, there is not much to explain here.

CREATING MODELS

In Machine Learning and Data Science terminology, a model is basically something that is trained on training data and we have to get it performing well before using it for predictions.

There are a lot of Machine Learning libraries such as Scikit-Learn, PyTorch, Tensorflow, and Keras among others which provide many helper functions to create models. There is a set of parameters called Hyperparameters. These are parameters of the model, which aren’t trainable(don’t depend on the training data).

Here is a simple template you can follow while creating models-

Based on the data and task, choose an Algorithm that will suit the problem
Create a model based on that algorithm
Train the model on the training set and a suitable performance metric
Check the performance of the model based on the test data

These four steps can be iterated till we get a good enough model.

HYPERPARAMETER TUNING

Hyperparameter tuning involves changing the untrainable parameters of the model to see whether the model’s performance can be enhanced further. It is usually done by declaring a set of values for each hyperparameter and then checking the model’s performance for each permutation of the set of values. This search helps us find the best values of each hyperparameter from the set we declared earlier.

DEPLOYMENT

Finally, we have deployment. If one has completed all the above steps, they have a model which is doing well on some performance metrics but it is not usable as it is just a file of code. Some sort of interface has to be created for the users to give inputs and get outputs. Mobile applications, Web applications, and creating APIs are the usual ways via which deployment is done. First, the best model has to be saved. Then, an interface has to be created for the user to give input. We have to, then, take the inputs and call the saved model to predict the outcome, which has to be displayed back on the interface. Several frameworks have emerged for deployment, a few of them being — Flask, Django, FastAPI, etc.

CONCLUSION

I hope this roadmap will help you understand the life cycle of a Data Science project. Hope you learned something new.