A simple introduction to Data Pre-processing

Dumindu Kavinda
Analytics Vidhya
Published in
4 min readApr 4, 2020

Hello World !!! This is the second article in Data Science Tutorial. I am happy to welcome you here. The main intention of writing this article is to explain the core concepts in Data Science. Throughout this tutorial, I’m going to explain the important concepts in data science. Reading this article will also motivate you to learn more about Data Science.

Introduction :

Data pre-processing is an important step in the data mining process. The real-world datasets are incomplete, inconsistency and lacking certain behaviours. As a result of that, this kind of data leads to misleading results.

For example,

  • Out of range values (Age: -12)
  • Impossible data combinations (Gender: Male, Yes, Short)
  • Missing Values
  • Not in a proper structure

are some of the issues occur at the data collecting step.

Why Data Preprocessing?

If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take a considerable amount of processing time. Therefore we need to get rid of these issues to perform better in any kind of analysis. So Data preprocessing is a proven method of resolving such issues.

Major Steps in Data pre-processing

  1. Data Cleaning
  2. Data Integration
  3. Data Reduction
  4. Data Transformation

1.Data Cleaning

The data can have many irrelevant and missing parts. To handle this, we have to identify what are the issues with datasets and how to resolve those issues with different techniques.

Here are some common issues with datasets.

  1. Missing Values
  2. Duplicate rows
  3. Outliers
  4. Data type issues
  5. Fix typos
  6. Irrelevant data

If you like to read more about data cleaning I would highly recommend reading my article on Data Cleaning — Data Cleaning in Nutshell

2.Data Integration

Data integration is the process of combining data from different sources into a single, unified view. Integration begins with the ingestion process and includes steps such as cleansing, ETL mapping, and transformation. Data integration ultimately enables analytics tools to produce effective, actionable insights.

Some advantages of Data Integration

  • Saves time and boosts efficiency
  • Reduces errors (and rework)
  • Delivers more valuable data

are some of them.

3.Data Reduction

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results.

Data Reduction Strategies:-

1. Data Cube Aggregation

Aggregation operations are applied to the data in the construction of a data cube.

2. Dimensionality Reduction

In dimensionality reduction, redundant attributes are detected and removed which reduces the data set size.

3. Data Compression

Encoding mechanisms are used to reduce the data set size.

4. Numerosity Reduction

In numerosity reduction where the data are replaced or estimated by alternative.

5.Discretization and concept hierarchy generation

Where raw data values for attributes are replaced by ranges or higher conceptual levels.

4.Data Transformation

In data, transformation process data are transformed from one format to another format, which is more appropriate for data mining.

Some Data Transformation Strategies:-

1. Smoothing

Smoothing is a process of removing noise from the data.

2. Aggregation

Aggregation is a process where summary or aggregation operations are applied to the data.

3. Generalization

In general, low-level data are replaced with high-level data by using concept hierarchies climbing.

4. Normalization

Normalization scaled attribute data to fall within a small specified range, such as 0.0 to 1.0.

5. Attribute Construction

In Attribute construction, new attributes are constructed from the given set of attributes.

Final Thoughts !!!

Data pre-processing is an important step in preparing raw data for statistical analysis. Several distinct steps are involved in pre-processing raw data as described in this chapter: cleaning, integration, transformation, and reduction. Throughout the process, it is important to understand the choices made in pre-processing steps and how different methods can impact the validity and applicability of study results. So Use these little techniques to sharpen your data science project and to make better results.

See you next time.

Until then enjoy ‘DATA PRE PROCESSING’!

--

--

Dumindu Kavinda
Analytics Vidhya

Hello world ! You’re only here for a short visit. Don’t hurry, don’t worry. And be sure to smell the flowers along the way.