Basics of Data Preprocessing
Basic Understandings and Techniques of Data Preprocessing
What is Data Preprocessing?
According to Techopedia, Data Preprocessing is a Data Mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviours or trends and is likely to contain many errors. Data Preprocessing is a proven method of resolving such issues.
Suad A. Alasadi and Wesam S. Bhaya, in their journal, states that Data Preprocessing is one of the most Data Mining steps which deals with data preparation and transformation of the data set and seeks at the same time to make knowledge discovery more efficient.
In other words, we can say that Data Preprocessing is a step in Data Mining which provides techniques that can help us to understand and make knowledge discovery of data at the same time.
Why We Need Data Preprocessing?
Mirela Danubianu, in her journal, states that real-world data tend to be incomplete, noisy, and inconsistent. This can lead to a poor quality of collected data and further to a low quality of models built on such data. In order to address these issues, Data Preprocessing provides operations which can organise the data into a proper form for better understanding in data mining process.
We can see the image above is an example of raw data. The image shown above is iris data sample. We can’t understand the behaviours or trends of the data. Hence, we need to transform or organise it to make it into a proper format by using Data Preprocessing.
What are the Techniques Provided in Data Preprocessing?
There are four methods of Data Preprocessing which are explained by A. Sivakumar and R. Gunasundari in their journal. They are Data Cleaning/Cleansing, Data Integration, Data Transformation, and Data Reduction.
1. Data Cleaning/Cleansing
Real-world data tend to be incomplete, noisy, and inconsistent. Data Cleaning/Cleansing routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.
Data can be noisy, having incorrect attribute values. Owing to the following, the data collection instruments used may be fault. Maybe human or computer errors occurred at data entry. Errors in data transmission can also occur.
“Dirty” data can cause confusion for the mining procedure. Although most mining routines have some procedures, they deal incomplete or noisy data, which are not always robust. Therefore, a useful Data Preprocessing step is to run the data through some Data Cleaning/Cleansing routines.
2. Data Integration
Data Integration is involved in data analysis task which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. The issue to be considered in Data Integration is schema integration. It is tricky.
How can real-world entities from multiple data sources be ‘matched up’? This is referred as entity identification problem. For example, how can a data analyst be sure that customer_id in one database and cust_number in another refer to the same entity? The answer is metadata. Databases and data warehouses typically have metadata. Simply, metadata is data about data.
Metadata is used to help avoiding errors in schema integration. Another important issue is redundancy. An attribute may be redundant, if it is derived from another table. Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set.
3. Data Transformation
Data are transformed into appropriate forms of mining. Data Transformation involves the following:
- In Normalisation, where the attribute data are scaled to fall within a small specified range, such as -1.0 to 1.0, or 0 to 1.0.
- Smoothing works to remove the noise from the data. Such techniques include binning, clustering, and regression.
- In Aggregation, summary or aggregation operations are applied to the data. For example, daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities.
- In Generalisation of the Data, low level or primitive/raw data are replaced by higher level concepts through the use of concept hierarchies. For example, categorical attributes are generalised to higher level concepts street into city or country. Similarly, the values for numeric attributes may be mapped to higher level concepts like, age into young, middle-aged, or senior.
4. Data Reduction
Complex data analysis and mining on huge amounts of data may take a very long time, making such analysis impractical or infeasible. Data Reduction techniques are helpful in analysing the reduced representation of the data set without compromising the integrity of the original data and yet producing the qualitative knowledge. Strategies for data reduction include the following:
- In Data Cube Aggregation, aggregation operations are applied to the data in the construction of a data cube.
- In Dimension Reduction, irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.
- In Data Compression, encoding mechanisms are used to reduce data set size. The methods used for Data Compression are Wavelet Transform and Principle Component Analysis.
- In Numerosity Reduction, data is replaced or estimated by alternative and smaller data representations such as parametric models (which store only the model parameters instead of the actual data, e.g. Regression and Log-Linear Models) or non-parametric methods (e.g. Clustering, Sampling, and the use of histograms).
- In Discretisation and Concept Hierarchy Generation, raw data values for attributes are replaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of data at multiple levels of abstraction and are powerful tools for data mining.