A simple introduction to Data Pre-processing
Hello World !!! This is the second article in Data Science Tutorial. I am happy to welcome you here. The main intention of writing this article is to explain the core concepts in Data Science. Throughout this tutorial, I’m going to explain the important concepts in data science. Reading this article will also motivate you to learn more about Data Science.
Introduction :
Data pre-processing is an important step in the data mining process. The real-world datasets are incomplete, inconsistency and lacking certain behaviours. As a result of that, this kind of data leads to misleading results.
For example,
- Out of range values (Age: -12)
- Impossible data combinations (Gender: Male, Yes, Short)
- Missing Values
- Not in a proper structure
are some of the issues occur at the data collecting step.
Why Data Preprocessing?
If there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take a considerable amount of processing time. Therefore we need to get rid of these issues to perform better in any kind of analysis. So Data preprocessing is a proven method of resolving such issues.
Major Steps in Data pre-processing
- Data Cleaning
- Data Integration
- Data Reduction
- Data Transformation
1.Data Cleaning
The data can have many irrelevant and missing parts. To handle this, we have to identify what are the issues with datasets and how to resolve those issues with different techniques.
Here are some common issues with datasets.
- Missing Values
- Duplicate rows
- Outliers
- Data type issues
- Fix typos
- Irrelevant data
If you like to read more about data cleaning I would highly recommend reading my article on Data Cleaning — Data Cleaning in Nutshell
2.Data Integration
Data integration is the process of combining data from different sources into a single, unified view. Integration begins with the ingestion process and includes steps such as cleansing, ETL mapping, and transformation. Data integration ultimately enables analytics tools to produce effective, actionable insights.
Some advantages of Data Integration
- Saves time and boosts efficiency
- Reduces errors (and rework)
- Delivers more valuable data
are some of them.
3.Data Reduction
Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results.
Data Reduction Strategies:-
1. Data Cube Aggregation
Aggregation operations are applied to the data in the construction of a data cube.
2. Dimensionality Reduction
In dimensionality reduction, redundant attributes are detected and removed which reduces the data set size.
3. Data Compression
Encoding mechanisms are used to reduce the data set size.
4. Numerosity Reduction
In numerosity reduction where the data are replaced or estimated by alternative.
5.Discretization and concept hierarchy generation
Where raw data values for attributes are replaced by ranges or higher conceptual levels.
4.Data Transformation
In data, transformation process data are transformed from one format to another format, which is more appropriate for data mining.
Some Data Transformation Strategies:-
1. Smoothing
Smoothing is a process of removing noise from the data.
2. Aggregation
Aggregation is a process where summary or aggregation operations are applied to the data.
3. Generalization
In general, low-level data are replaced with high-level data by using concept hierarchies climbing.
4. Normalization
Normalization scaled attribute data to fall within a small specified range, such as 0.0 to 1.0.
5. Attribute Construction
In Attribute construction, new attributes are constructed from the given set of attributes.
Final Thoughts !!!
Data pre-processing is an important step in preparing raw data for statistical analysis. Several distinct steps are involved in pre-processing raw data as described in this chapter: cleaning, integration, transformation, and reduction. Throughout the process, it is important to understand the choices made in pre-processing steps and how different methods can impact the validity and applicability of study results. So Use these little techniques to sharpen your data science project and to make better results.
See you next time.
Until then enjoy ‘DATA PRE PROCESSING’!