Data Science Project Life-cycle A Primer

Bharadwaj Venkat
Analytics Vidhya
Published in
6 min readJun 19, 2020

Data Science projects are exiting as it involves teaching computer(models) to solve a business problem. However, jumping right into building those models does not achieve the desired result. Data Science project has multiple stages and multiple steps involved in each stage and they are iterative, for better results.

Let us try to understand Stages:

1.Business Understanding : This is the business stage of the project. The stakeholders decide the project objective and scope. The domain experts, business analysts , data engineers ,data scientist and BI experts all reach the common understanding about the project objectives.Review stages, deadlines and acceptance criteria are agreed together and documented.

2.Data Acquisition & understanding : We collect the data from different sources and try to make the sense of the variables and relationship to business problem and relationship to each other.Also Called as Exploratory Data Analysis.Here, we can build the pipeline from data source and machine learning environment.

Collecting data from different sources

3.Feature Engineering : In this stage we have to make the data suitable for building models.

4.Model Building : This is a phase we build a statistical or machine learning model that will be useful for business to take decisions.

5.Model deployment : We deploy the best model to production so that it can be made available to make decisions on new data.

image source :https://pypi.org/project/pymltoolkit/

Lets See what generally happens in each stage

1.Business understanding : Understanding the domain and the business problem that can be avoided or improved for better results to the business.This involves multiple stakeholders like domain experts, Business Analysts, data engineers, data scientist and software engineers.Each has a important role of defining the requirements and contributions at each stage of the project.

2. Data Acquisition & understanding : This is the beginning of the data science work. In this step, we collect all the available data from different sources and start analyzing the data.

This is the longest and important stage of the data science project life cycle.This involves multiple steps.

a) Data Collection- We collect data from the databases, understand the variables, define the predictor , target variables as per the business problem.

b) Exploratory Data Analysis - The collected data is analysed , descriptive and inferential analysis is carried out and insights are generated. Below are part of exploratory data analysis.

“EDA is where lot of insights are generated”

Data Cleaning : Data has to be cleaned to avoid missing values, outliers as these skews the data and reduces the statistical power of the models.

Missing value Imputation : We need to avoid missing values in the data and can be solved by imputing missing fields with mean, median for continuous data type and impute mode for categorical data type.

Outlier Treatment : Outliers needs to be identified and treated to avoid skewness in the data. Outliers can be identified visually by box-plots and scatter plots. Outliers can be identified and treated statistically by using percentile, using Inter Quartile ranges(IQR) and using Z-score ,Three- Standard deviation.

Uni-Variate analysis : The analysis of one variable at a time.This is done by drawing the basic bar charts, histograms, pie-charts, box-plots, count, count% etc…based on data type

Bi-variate and Multi-variate : The analysis of two or more variables.

Continuous-Continuous : For analysis if 2 variables are continuous is done by scatter-plot to understand relationship to each other. However, if we want to establish the strength of the relationship , correlation can also be established here.1 indicates positive correlation, -1 indicates negative correlation, 0 indicates no relationship. There is function in numpy to establish correlation.

Categorical-Categorical : Two-way table, Stacked-column chart and chi-square test.

Categorical- Continuous : We can draw box-plots and establish the statistical significance by performing Z-test/T-test and ANOVA

3.Feature Engineering : This is a step where we create and add new values or features to our dateset to improve the readability/understandibility/ comprehensibility of data by model.

Feature Engineering is critical step for machine learning model performance

Feature Scaling : This is a step which affects model’s performance. Not all algorithms require scaling. Distance based algorithms and if gradient descent is used to reach global minima point then scaling is required. Example : KNN, KMeans,linear and neural networks. Models which do not require scaling are Tree-based, bagging and boosting algorithms.

There are 2 ways this is done. Normalization and Standardization.Each has it own purpose.

Normalization : Is also called as min-max normalization. This technique scales values of the features between 0 and 1. Sklearn library has a function called ‘min-max Scaler’.

Standardization : Is also called as Z-score normalization. This technique scales values of the features considering the mean as ‘0’ & standard deviation as ‘1’ for calculation . Sklearn library has a function called ‘StandardScaler’.

Feature Encoding : Since, machine learning models are mathematical models and they can not comprehend categorical variables, encoding is required to solve this problem.

There are 3 types of categorical variables.

a) Binary Categorical Variable- Ex : Yes/No ,True/False

b) Nominal Categorical Variable : Ex: Names of things

c) Ordinal Categorical Variable : Ex : cold,warm,Hot or small,medium,large

For Binary category, encoding is done using binary encoding technique where one among two values of the variable is replaced as numeric 0 and second value as numeric 1.

For nominal category, encoding is done using one-hot encoding technique where the new dummy variables are created for each of the value and if the value is present , then it is replaced by 1, indicating the presence of the variable and 0 for not present. This process is repeated for each value of the variable.

For Ordinal category, encoding is done using label encoding suggesting the order in the values by mapping the numerical values to the features of the variable.

4. Model Building : This is the by far the most interesting and most hyped stage of the machine learning project.

In this stage, we use the algorithms on our data to build a model that produces the desires output. Using a algorithm depends on the business problem and data type of the target variable. The given business problem can be of supervised, unsupervised and reinforcement learning in nature , and there are multiple learning algorithms like regression,classification,clustering and finding associations. There are specific algorithms for each of the above learning algorithms.

Here’s an nutshell.

Please refer my other article for explanation of the learning algorithms and about scenarios to use these algorithms and its evaluation metrics.

After using the suitable algorithm on the data for building the model, there is an important step of model evaluation.There are multiple methods to evaluate a model.The model performance is evaluated against the accepted and agreed metrics during the business understanding or data understanding stages. If the model does not produce desired results, we need to go back to the previous stages hence the term ‘Project Life cycle’.

Model Deployment : Our champion model needs to be deployed to start taking decisions for the business.

This is software engineering stage of the machine learning project life-cycle.We can use multiple techniques and technologies like pickle, flask,docker and kubernetes,cloud .The converted models can be deployed on cloud.

There will be another article for the detailed explanation for the Model building,model evaluation and model deployment techniques.

Cheers!!!

--

--