Part-3 Data Science Methodology From Understanding to Preparation

Published in

ML Research Lab

6 min readAug 10, 2019

From Understanding to preparation

Welcome to the next step of Data science methodology. In this article we are discussing about data understanding to data preparation. Once you have been collected the data, you have to understand each every variable of the data and it’s characteristics using Exploration Data Analysis and Descriptive Statistics. Sometimes might be happens you have to do so many pre-processing operation on that. We can discuss this all things in detail with case study in this article. If you have missed to read above articles below has the links read once and come back here so you can understand more.

Article Series :

#1) Data Understanding

Understanding the data includes all activities related to creating the data-set. The “Understanding Data” section of the Data Science methodology answers the question:
Is the data you collect representative of the problem you are trying to solve?

Apply the understanding of our methodological data to the case study we are studying. To understand data on the onset of heart failure, descriptive statistics had to be established in the data columns that would become variables in the model.

First, these statistics included Hearst, Uni-variate, and statistics for each variable, such as mean, median, minimum, maximum, and standard deviation.
Second, pairwise correlations have been used to determine the degree of correlation between the linked variables and those that, if any, are highly correlated, meaning that they are essentially redundant, making it only relevant for the modeling.
Third, the histograms of the variables were examined to understand their distributions. Histograms are a good way to understand how values or variables are distributed and what kind of data preparation may be needed to make the variable more useful in a model.For example, if a categorical variable contains too many different values to be meaningful in a model, the histogram can help decide how to consolidate those values.

Univariate, statistics and histograms are also used to assess the quality of the data.

On the basis of the data provided, some values can be recorded or deleted if necessary, e.g. For example, if a particular variable has a lot of missing values.

The question then arises as to whether “missing” means something. Sometimes a missing value means “no” or “0” (zero), or sometimes simply “we do not know”.
Or if a variable contains invalid or misleading values; For example, a numeric variable called “age” containing 0 to 100 and 999, where “triple-9” actually means “missing”, will be treated as a valid value unless we have corrected it.

First, the importance of heart failure was determined based on a primary diagnosis of heart failure. However, the data comprehension study revealed that the initial definition did not cover all expected cases of heart failure due to clinical experience.
This involved returning to the data collection phase, adding secondary and tertiary diagnoses, and creating a more complete definition of heart failure approval.
This is just an example of the interactive processes in the methodology. The more you work with the problem and the data, the more you learn and the more the model can be adjusted, which ultimately leads to a better resolution of the problem.

#2) Data Preparation :

In a way, data preparation is like washing freshly cut vegetables, as long as they remove unwanted elements such as dirt or insects.

With data collection and understanding, data preparation is the slowest phase of a data science project. As a rule, it takes up 70% or 90% of the total project time. By automating certain data collection and preparation processes in the database, this time can be reduced to only 50%.
This saving of time means that data scientists should focus more on creating models.

To continue with our metaphor of cooking, we know that the process of cutting onions in a finer state allows the flavors to spread more easily in a sauce than if we dropped the whole onion in the pot.

Similarly, data transformation in the data preparation phase involves putting the data in a state where it may be easier to work.

More specifically, the data preparation phase of the methodology answers the question: how are the data processed? To work efficiently with data, missing or invalid values must be changed and duplicates removed to ensure proper formatting of all data.

Feature engineering is also part of the data preparation. Use domain knowledge on data to create features that work with machine learning algorithms. A feature is a property that can be useful for solving a problem. The functions in the data are important for the predictive models and influence the desired results.

Feature engineering is essential when machine learning tools are used to analyze data. When working with text, text analysis steps are required to code the data for manipulation.
The data scientist must know what he is looking for in his file to answer the question. Text analysis is essential to ensure that the correct groups are defined and that programming does not overlook what is hidden inside.
The data preparation phase is the basis for the next steps to answer the question. Although this phase may take some time, the results will support the project if done correctly. If this is omitted, the result is not at the same level and can sit on the drawing board.
Make sure you spend time in this area and use the tools available to automate common steps to speed up data preparation. Pay attention to details in this area. After all, all you need is a bad ingredient to ruin a good meal.

Case Study of Data Preparation:

Thanks for Reading…!!! Happy Learning…!!!

References :

https://www.coursera.org/learn/data-science-methodology

Part-3 Data Science Methodology From Understanding to Preparation

#1) Data Understanding

#2) Data Preparation :

References :

Written by Ashish Patel