Data Preparation in Data Science

Published in

Analytics Vidhya

5 min readMay 28, 2021

Data preparation: -

Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data and the combining of data sets to enrich data.

Data preparation is often a lengthy undertaking for data professionals or business users, but it is essential as a prerequisite to put data in context in order to turn it into insights and eliminate bias resulting from poor data quality.

The data preparation process usually includes standardizing data formats, enriching source data, and/or removing outliers.

Data Preparation is the process of collecting, cleaning, and consolidating data into one file or data table, primarily for use in analysis.

Why Prepare Data?

There are several reasons why we need to prepare the data.

· By preparing data, we actually prepare the miner so that when using prepared data, the miner produces better models faster.

· Good data is essential for producing efficient models of any type.

· Data should be formatted according to required software tool.

· Data need to be made adequate for given method.

· Data in the real world is dirty.

Incomplete data: Some data lack attribute values, lacking certain attributes of interest, or containing only aggregate data.

For example, First name = “” or Last name = “”

Noisy: Some data contains errors.

For example, Age = -10

Inconsistent: Some data contain discrepancies in codes and names

For example, Age = 56, Birthdate = ’04–05–1995’

Benefits of Data Preparation:

Data preparation helps:

· Fix errors quickly — Data preparation helps catch errors before processing. After data has been removed from its original source, these errors become more difficult to understand and correct.

· Produce top-quality data — Cleaning and reformatting datasets ensures that all data used in analysis will be high quality.

· Make better business decisions — higher quality data that can be processed and analyzed more quickly and efficiently leads to more timely, efficient and high-quality business decisions.

Data preparation steps:

The specifics of the data preparation process vary by industry, organization and need, but the framework remains largely the same.

The major tasks in data preparation are as follows:

1) Data discretization

2) Data cleaning

3) Data integration

4) Data transformation

5) Data reduction

Let’s understand each step in detail.

1) Data discretization:

It is a part of data reduction which contains particular importance especially for numerical data.

2) Data cleaning:

Manual data prep is error-prone, time-consuming and costly. Business decisions rely on analytics. But, if the data is inaccurate or incomplete, your analytics inform wrong businesses decisions. Bad analytics means poor business decisions. Altair Monarch is programmed with over 80 pre-built data preparation functions to speed up arduous data cleansing projects.

3) Data integration:

Access data from any source — no matter the origin, format or narrative and integrating them together. Monarch excels at intelligently and automatically extracting data from complex unstructured and semi-structured sources, like PDFs. Increased access to data means less manual work, faster insights and faster time to value realized by your organization.

4) Data transformation:

Being able to quickly change the way data is summarized and presented enables business analysts and executives to quickly consider different perspectives and views of data. Monarch make it easy to package your clean and blended data for insightful reporting you can confidently share with the rest of your organization.

5) Data reduction:

It obtains reduced representation in volume but produces the same or similar analytical results.

Data preparation example:

There are multiple values that are commonly used to represent the virus. A virus like COVID-19 could be represented by ‘SAR-Cov2’, ‘Corona’, ‘Covid’ or ‘Covid-19’ to name a few.

A data preparation tool could be used in this scenario to identify an incorrect number of unique values (in the case of virus, a unique count greater than suitable number in case of covid would raise a flag, as there are only few names aligned with virus). These values would then need to be standardized to use only an abbreviation or only full spelling in every row.

Data binning

Binning: -

Data binning, bucketing is a data pre-processing method used to minimize the effects of small observation errors. The original data values are divided into small intervals known as bins and then they are replaced by a general value calculated for that bin. This has a smoothing effect on the input data and may also reduce the chances of over fitting in case of small datasets.

There are 2 methods of dividing data into bins.

Equal Frequency Binning: bins have equal frequency.

Equal Width Binning: bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw] where w = (max — min) / (no of bins).

Importance of Data Binning: -

· Binning is used for reducing the cardinality of continuous and discrete data.

· Binning groups related values together in bins to reduce the number of distinct values.

· Binning can improve resource utilization and model build response time dramatically without significant loss in model quality.

· Binning can improve model quality by strengthening the relationship between attributes.

· Supervised binning is a form of intelligent binning in which important characteristics of the data are used to determine the bin boundaries.

· In supervised binning, the bin boundaries are identified by a single-predictor decision tree that takes into account the joint distribution with the target. Supervised binning can be used for both numerical and categorical attributes.

Advantages (Pros) of data smoothing

· Data smoothing clears the understandability of different important hidden patterns in the data set.

· Data smoothing can be used to help predict trends. Prediction is very helpful for getting the right decisions at the right time.

· Data smoothing helps in getting accurate results from the data.

Cons of data smoothing

· Data smoothing doesn’t always provide a clear explanation of the patterns among the data.

· It is possible that certain data points being ignored by focusing the other data points.

References: Dezyre, Bigdataanalyticsnew, Uniteforsight.

Data Preparation in Data Science

Data binning

Written by Swapnil Bandgar