Data Mining Introduction — Data Preprocessing

3 min readJan 3, 2018

Source: http://data-mining.philippe-fournier-viger.com/introduction-data-mining/

Data preprocessing is crucial in any data mining process as they directly impact success rate of the project. This reduces complexity of the data under analysis as data in real world is unclean.

Data is said to be unclean if it is missing attribute, attribute values, contain noise or outliers and duplicate or wrong data. Presence of any of these will degrade quality of the results.

Here are few important data pre-processing techniques that can be performed before getting into algorithm selection.

1. Aggregation

This combines two or more attributes into a single attribute. Purpose of aggregation can be

Reduce number of attributes
Change data scale: example — aggregation of cities to regions
Reducing variability in the data

2. Sampling

This technique helps in data reduction during initial and final data analysis.
It can be extremely useful when handling entire data is expensive or time consuming
Works when sample is representative of the entire data set
Sampling can be done in two ways.

i) Simple random sampling

The probability of selecting any item is same. Sampling can be done in two ways one is with replacement and other is without replacement.

ii) Stratified sampling

Data is split into several partitions and then random samples are drawn from each partition.

3. Dimensionality reduction

Data sparsity increases as the dimensionality increases which makes operations like clustering, outlier detection less meaningful as they greatly depend on density and distance between points.

Purpose of dimensionality reduction is to,

Avoid curse of dimensionality
Reduces time required by algorithms
Greatly reduces memory consumption
Ease of visualization of data
Eliminate irrelevant features
Eg: PCA, SVD, LLE, Random Projections, LDA etc.

4. Feature Subset Selection

This is also another way to reduce dimensionality of data
Removes duplicate or irrelevant features

5. Feature Creation

New features can be created which can best capture important and relevant information effectively than the provided attributes. Three general feature creation methodologies are

Feature extraction: Extracting edges from images
Feature construction: Obtaining density from given mass and volume
Mapping data into new space

6. Discretization

It is most commonly used in classification purpose
It works by converting a continuous attribute to ordinal attribute
Can be done in two ways supervised and unsupervised discretization.
Supervised method works by finding breaks in data values by visualizing it
In unsupervised method, class labels are used to find breaks
Discretization methods. For all the below methods consider number of buckets that data should be divided into.

i) Equal interval width — (max value — min_value) / Number of buckets

ii) Equal Frequency — (Number of values/Number of buckets)

iii) Clustering

7. Binarization

Maps continuous or categorical attribute to a binary variable
Majorly used in association analysis.

8. Attribute Transformation

Map attribute values to completely different new set of attribute values using simple functions like X^(k) , log(X), |x|
Normalization: Adjust differences among attributes in terms of mean, range, variance and frequency of occurrence.
Standardization: Subtract mean and divide by standard deviation

……. To be Continued

(Inspired from my data mining course)

Data Mining Introduction — Data Preprocessing

Written by Saishruthi Swaminathan