An Introduction to Data Science Life-cycle

Sahil Mankad
Analytics Vidhya
Published in
5 min readOct 10, 2019

The process of data science is much more than just predictive modelling, data cleaning and data visualization. The ultimate goal of data science is to generate value for organization and society in general. This starts with defining the goal or what we are supposed to gain from the data science process, then generating hypothesis to get into depth of various factors affecting our outcome followed by data extraction, data pre-processing and modelling.

Below is a simple flow chart to summarize the journey. Let’s get into the details of each of them.

The Data Science Life-cycle

Problem Definition:

Problem definition is like goal setting. As with life, you cannot succeed in a data science project without defining what success/end goal looks like. Sometimes the same parameter can have different interpretations depending on business outcome.

For example, if time spent on the application is more for Netflix, it means that people are getting hooked to your application. However, the same thing for Amazon could mean that the loading of pages takes time or users are unable to find their required products or slow payment gateway, which can possibly lead to customer attrition. The same parameter can have widely different impact on different businesses.

It’s important to remember that identifying the problem is a very important step in solving problems. Defining problems incorrectly can lead to unwanted efforts. The job of a data scientist is to get into the details with clients to identify the business problem and convert the business problem to data problem, generally in the form of an equation.

For example, let us say that our client is a bank and the core problem is to increase the deposit amount.

Deposit = ∑customer_balance*(1+roi) — (cost_marketing + ∑employee_salary )

Where,

  • roi: % rate of interest (eg. 8% = 0.08)
  • employee_salary: salary for each employee
  • customer_balance: balance for each customer
  • cost_marketing: marketing cost for gaining new customers or retaining existing ones.

Hypothesis Generation:

If the problem definition phase is about determining about what issue is to be addressed, hypothesis generation phase is about finding out what could be done to find the solution or pinpoint the root cause of the problem. There could be hundreds of hypothesis for a single problem, in fact this is quite common. No questions are silly at this stage of data science life cycle.

Hypotheses are divided into various sections depending on the problem. For instance sales of product in a retail chain can be attributed to demography, seasonal trends, competitors and even psychological factors. This step is done before looking at data, in order to avoid human bias.

There is a null hypothesis and an alternate hypothesis, one of which is proved right using data and statistics.

Wikipedia defines null hypothesis as

“In inferential statistics, the null hypothesis is a general statement or default position that there is nothing new happening, like there is no association among groups, or no relationship between two measured phenomena.”

Wikipedia defines alternate hypothesis as

“In statistical hypothesis testing, the alternative hypothesis is a position that states something is happening, a new theory is true instead of an old one.”

Validating hypothesis can sometimes be as simple as looking at a visualization, as shown in the 2 examples below.

Figure(1): Box plots validating the null hypothesis
Figure(2): Histogram and bar chart rejecting null hypothesis

Image credits for figure(1) and figure(2) : Analytics Vidhya

Data Extraction:

The next step is data extraction. We select data if it successfully checks all the boxes on the below points

  1. Cleanliness of data: While we do perform data cleaning before modelling, we should ensure that we minimize these efforts by selecting the dataset correctly.
  2. Availability of historic data: The data for the required timeframe should be available.
  3. Structure compatibility: The data analysis to be performed should be compatible with respect to data available. For example, it is not worth performing text analysis on a few social media comments if majority of data available is a structured table with mostly numerical values.
  4. Expense: Money and time to procure data should not outweigh the benefits the organization expects to gain from data.
  5. Dependency: Data source should not be unreliable. We check the effectiveness of our model on available data, but the model is deployed on real-world data. Our model should not be trained on data dissimilar to that we expect to see in the real-world.

Data Modelling:

The first step for any modelling process is deciding the target variable. If the target variable is real-valued (continuous) we use regression techniques and if the target variable is a definite class we use classification techniques. For both regression and classification we measure effectiveness using an evaluation metric. You can refer to this article for more information.

The next step in modelling involves sampling the available data into train and test datasets. We then train the model and measure its effectiveness on test set. The model is then put into production to gather insights for data or provide a feature such as providing recommendations to users. Below is a flowchart of the entire process.

The process of Data Modelling

Conclusion:

In this article we have seen the process of data science and the steps involved at each stage of the process. We did not cover data extraction and modelling in much detail. These are vast topics with each requiring an article to cover the details. I have attempted to explain the complete life-cycle of a data science project in this article. Hoping that it has been helpful for you.

--

--