Data Science Life Cycle

Pooja Umathe
7 min readJun 30, 2020

--

Hello!!

Do you know how Data Science life cycle works?

Let’s see how it really works under the hood….

Data Scientists follow a certain process to understand the business problems and world. That process is called Data Science Life Cycle.

What is Data Science?

There are many definitions of Data Science out there, but I really like this one;

“The application of data centric computational and inferential thinking to understand the world and solve problems.” — Joseph Gonzalez, U.C. Berkeley.

The above figure shows each stage of data science life cycle. Let’s understand what is the role of each stage.

Stage 1: Business Problem Understanding

Business problem understanding is the first stage is Data Science life cycle. To solve any problem, we need to understand the problem first and that is what this stage is. Asking questions at this stage is an art and there are some questions which are relevant to any project for example what is the goal of the project? What application we are building? What results are we expecting? Which tasks we need to perform? What approach can we use?, etc.

In this stage we identify the problem by defining the objectives of the project along with the features and target variables. Also, formulating the questions which defines the business goals that the data science techniques can target is also important to start any data science project.

Stage 2: Data Collection

Data collection is the second stage of Data Science life cycle. Once the problem and the goal of the project is defined then data collection comes into the picture. Data collection is a step to collect data from various sources or from specific source. Gathering meaningful data plays an important role in this stage which enables one to answer relevant questions and evaluate outcomes of the project.

Data would be structured or unstructured. It depends on the sources because if you are gathering any information from any social media platform then your data might be unstructured. According to the project objective data should meet all requirements to perform meaningful analysis and accurate predictions. For example, if the project is related to medical or healthcare data then there should be variables such as patient id, age, gender, etc.

Stage 3: Data Cleaning and Preparation

Data cleaning and preparation is the third stage of data science life cycle. It is a process of cleaning and transforming raw data prior to data analysis and modeling. Once the data/information is gathered then data cleaning needs to be done in order to perform good modelling. As mentioned in the stage 2, data can be structured or unstructured, so it is important to enrich the data for better analysis and modeling. Incorrect or inconsistent data leads to false conditions and quality of the data really matters.

In other words, it is a process of detecting and correcting (or removing) inaccurate entries from the data set. It involves, duplicate entries, missing values, fix typos, spell check, lower case and upper-case letters, punctuation, formatting, blank spaces, etc.

Stage 4: Exploratory Data Analysis

Exploratory data analysis is the stage after data cleaning where we explore variables in the data set with the help of visualization, statistics, and other techniques. It is an approach to analyze data to summarize their main characteristics by plotting variables and their statistical factors. Exploratory data analysis is also called as EDA. It is a data exploration technique to understand the various aspects of data. It includes several techniques in a sequence that we must follow.

The objective of the EDA is to understand the data by exploring various features of data. Few things we must keep in mind while exploring the data by making sure that data should be clean, it does not have any redundancies or missing values or even null values in the data set. Also, we have to make sure that we identify the important variables in the data set and remove all unnecessary noise in the data that may actually hinder the accuracy of our conclusions when we work on model building. Also with the help of EDA we can understand the relationship between the variables, along with that we would be able to derive conclusions and gather insights about the data or conclusive interpretation in order to move on to more complex processes in the data science life cycle.

Stage 5: Feature Engineering

Feature engineering is the 5th step of data science life cycle. It refers to processing and transforming the variables when creating a predictive model using machine learning or statistical modeling. This process involves a combination of data analysis, applying rules of thumb, and judgement. Machine learning algorithms require training data to train and feature engineering is a critical task that data scientists must perform prior to training AI/ML models. This is a step to identify and choose important features before training the machine learning model.

This is the process to highlight the important information in the data, removing/isolating the unnecessary information such as outliers, adding your own expertise and domain knowledge to alter the data. Feature engineering is an art of introducing new and important features that did not exist before. Data Scientists spend 80% of their time performing feature engineering. Remaining 20% is the easy part which includes training the models and performing hyper-parameters optimization. Performing feature engineering is crucial to improve AI/ML model performance.

Stage 6: Machine Learning

Machine Learning comes after the feature engineering which is step 6. Once we got all important features from feature engineering to feed into the machine then we can train it with those features. Machine learning is an ability of a computer system to learn from the environment and improve itself from experience without the need for any explicit programming. Some of the applications of machine learning are Fraud and Risk Detection, Healthcare, Internet Search, Website recommendations, Image Recognition, Speech Recognition, etc.

The world is filled with data, a lot of data- pictures, music, words, spreadsheets, videos, and it does not look like it is going to slow down anytime soon. Machine learning brings the promise of deriving meaning from all that data. We see machine learning all around us in the products we use today. In simple terms machine learning is using data to answer questions. Training refers to using our data to inform the creation and fine tuning of a predictive model. This predictive model can then be used to serve in predictions on previously unseen data and answer those questions. As more data is gathered the model can be improved over time and new predictive models deployed.

Data is the key to unlocking machine learning, just as much as machine learning is the key to unlocking that hidden insight in data.

Stage 7: Model Evaluation

Model Evaluation is a step after training the model. Model Evaluation is an integral part of the model development process. It helps to find the best model that represents our data and how well the chosen model will work in the future. Evaluating model performance with the data used for training is not acceptable in data science because it can easily generate overoptimistic and over-fitted models.

There are two methods of evaluating models in data science, Hold-Out and Cross-Validation. The purpose of holdout evaluation is to test a model on different data than it was trained on. This provides an unbiased estimate of learning performance. Cross-validation is a technique that involves partitioning the original observation data set into a training set, used to train the model, and an independent set used to evaluate the analysis. To avoid over-fitting, both methods use a test set (not seen by the model) to evaluate model performance. Evaluation metric is also the way to quantify performance of machine learning model and it is basically a number that tells you if it is any good and you can use this number to compare different models.

Stage 8: Data Visualization

Data Visualization is a last step in data science life cycle. Data Visualization is a process of transforming the data into graphs, charts, images, and even videos that explains the numbers and allow us to gain insights from it. It changes the way we make sense of information to create value out of it to discover new patterns and spot trends. It will provide clear and actionable insights to take business decisions. Visualizing information is a great way to know the patterns and trends in the data. Telling story from the data is an art and today there are so many applications we can use to create dashboards, charts, graphs which allows us to focus only on the information that is important.

So, this was all about data science life cycle and how it works. If you are passionate about data science, then stay tuned to my next articles.

Thank you for reading and for more learning stay tuned!!

Pooja Umathe

Follow me on;

https://medium.com/@pumathe

https://www.linkedin.com/in/poojaumathe/

--

--