Data Science Life Cycle in 5 Steps!

Vanshika Goel
7 min readJul 4, 2020

--

By Vanshika Goel July 4, 2020

Have you ever felt stuck while building your data science project? There can be many reasons behind it. Out of which the most common is not following the steps involved in the lifecycle. Many times people fail to follow the initial steps properly and hence, fail to complete the project. To avoid that pain, you need to follow the life cycle of a data science project properly. Whereas data science projects do not have a nice clean lifecycle with well-defined steps like software development lifecycle. Hence we can’t come up with a compact structure but the one tested and used by many data scientists is explained below.

Step: 1
DATA COLLECTION

Too many data projects fail at this very first step. Too many companies collect incomplete, unreliable data and everything they do after that… is just messed up. So as I always say, the first step is the toughest and most important in the process so doing it correctly is extremely important.
In this step, you will need to query databases, using technical skills like MySQL to process the data. You may also receive data in file formats like Microsoft Excel. If you are using Python or R, they have specific packages that can read data from these data sources directly into your data science programs. Some great web scraping tools are BeautifulSoup, Scrapy, etc.
Another popular option to gather data is connecting to Web APIs. Websites such as Facebook and Twitter allow users to connect to their web servers and access their data. All you need to do is to use their Web API to crawl their data.

And of course, the most traditional way of obtaining data is directly from files, such as downloading it from Kaggle or existing corporate data which are stored in CSV (Comma Separated Value) or TSV (Tab Separated Values) format. These files are flat text files. You will need to use a special Parser format, as a regular programming language like Python does not natively understand it.
All these ways can help you collect all the possible data you may require.
Now that we know how to collect the data, the question arises is-
What to collect and what not to!
The general idea is to collect everything you can — because data storage is relatively cheap nowadays.
The work of data collection and data storage lies with data engineers. To know more refer to my previous blog
Top 4 Career Options in Data Science”.

Step: 2
DATA PREPARATION

Once you know you have the data you need, then you go into data preparation. Depending on how sophisticated your business is, this could be a very big step, or it could be a very small step. In the most ideal situation, you’re usually just taking a couple of different tables and joining them together and organizing them in the way that the data scientist would like. Then we get into the data science process part of things. Often referred to as data cleaning or data wrangling phase. Data scientists often complain that this is the most boring and time-consuming task involving the identification of various data quality issues you can refer to the image below. Data acquired in the first step of a data science project is usually not in a usable format to run the required analysis and might contain missing entries, inconsistencies, and semantic errors.

Data preparation work is done by information technology (IT) and business intelligence (BI) teams as they integrate data sets to load into a data warehouse, NoSQL database, or Hadoop data lake repository. Also, data analysts can use self-service data preparation tools to collect and prepare data for analysis when using data visualization tools such as Tableau. One of the biggest benefits of instituting a formal data preparation process is that users can spend less time finding and structuring their data.
Hence this work is generally done by data analysts and sometimes by data scientists.

Step: 3
EXPLORATORY DATA ANALYSIS

Data analysis is defined as a process of cleaning, transforming, and modeling data to discover useful information for business decision-making. The purpose of Data Analysis is to extract useful information from data and taking the decision based upon the data analysis. Exploratory analysis is often described as a philosophy, and there are no fixed rules for how you approach it. There are no shortcuts for data exploration. Remember the quality of your inputs decides the quality of your output. Therefore, once you have got your business hypothesis ready, it makes sense to spend a lot of time and effort here.

Below are some of the standard practices involved to understand, clean, and prepare your data for building your predictive model:

-Variable Identification
-Univariate Analysis
-Bi-variate Analysis
-Missing values treatment
-Outlier treatment
-Variable transformation
-Variable creation

Finally, we will need to iterate over steps 4–7 multiple times before we come up with our refined model. The image suggests the tools used by a data analyst to analyze the data.

Step: 4
MODEL BUILDING

Modeling is the stage in the data science methodology where the data scientist has the chance to sample the sauce and determine if it’s banging on or in need of more seasoning. Once again, before reaching this stage, bear in mind that the scrubbing and exploring stage is equally crucial to building useful models. So take your time on those stages instead of jumping right to this process. Model Building is the core activity of a data science project. It is carried out either Statistical Driven — Statistical Analytics or using Machine Learning Techniques.

In the machine learning world, modeling is divided into 3 distinct stages — training, validation, and testing. These stages change if the mode of learning is unsupervised.
In any case, once we have modeled the data we can derive insights from it. This is the stage where we can finally start evaluating our complete data science system.
The end of modeling is characterized by model evaluation where you measure
-Accuracy — How well the model performs i.e. does it describe the data accurately.
-Relevance — Does it answer the original question that you set out to answer

Step: 5
MODEL DEPLOYMENT

Finally, all data science projects must be deployed in the real world. The deployment could be through an Android or an iOS app. Machine learning models might have to be recorded before deployment because data scientists might favor Python programming language but the production environment supports Java. After this, the machine learning models are first deployed in a pre-production or test environment before actually deploying them into production.

Whatever the shape or form in which your data model is deployed it must be exposed to the real world. Once real humans use it, you are bound to get feedback. Capturing this feedback translates directly to life and death for any project. The more accurately you capture the feedback, the more effective will be the changes that you make to your model and more accurate will your final results be. At this point, typical organizations document this flow and hire engineers to keep iterating the whole flow.

CONCLUSION

What I have presented here are the steps that data scientists follow chronologically in a typical data science project. If it is a brand new project, people usually spend about 60–70% of their time just on gathering and cleaning the data.
Also, it is an iterative process. You keep on repeating the various steps until you can fine-tune the methodology to your specific case. Consequently, you will have most of the above steps going on parallel. Python and R are the most used languages for data science. And to read about it you can refer my blog post “ Programming Languages for Data Science”. Furthermore, If you face any problems at any time, you can always reach out to me on LinkedIn (Click).Good Luck with your next project! :)

Originally published at https://www.datasciencenow.tech by Vanshika Goel.

--

--