Data Mining Introduction — Data Preprocessing

Saishruthi Swaminathan
3 min readJan 3, 2018

--

Source: http://data-mining.philippe-fournier-viger.com/introduction-data-mining/

Data preprocessing is crucial in any data mining process as they directly impact success rate of the project. This reduces complexity of the data under analysis as data in real world is unclean.

Data is said to be unclean if it is missing attribute, attribute values, contain noise or outliers and duplicate or wrong data. Presence of any of these will degrade quality of the results.

Here are few important data pre-processing techniques that can be performed before getting into algorithm selection.

1. Aggregation

This combines two or more attributes into a single attribute. Purpose of aggregation can be

  • Reduce number of attributes
  • Change data scale: example — aggregation of cities to regions
  • Reducing variability in the data

2. Sampling

  • This technique helps in data reduction during initial and final data analysis.
  • It can be extremely useful when handling entire data is expensive or time consuming
  • Works when sample is representative of the entire data set
  • Sampling can be done in two ways.

i) Simple random sampling

The probability of selecting any item is same. Sampling can be done in two ways one is with replacement and other is without replacement.

ii) Stratified sampling

Data is split into several partitions and then random samples are drawn from each partition.

3. Dimensionality reduction

Data sparsity increases as the dimensionality increases which makes operations like clustering, outlier detection less meaningful as they greatly depend on density and distance between points.

Purpose of dimensionality reduction is to,

  • Avoid curse of dimensionality
  • Reduces time required by algorithms
  • Greatly reduces memory consumption
  • Ease of visualization of data
  • Eliminate irrelevant features
  • Eg: PCA, SVD, LLE, Random Projections, LDA etc.

4. Feature Subset Selection

  • This is also another way to reduce dimensionality of data
  • Removes duplicate or irrelevant features

5. Feature Creation

New features can be created which can best capture important and relevant information effectively than the provided attributes. Three general feature creation methodologies are

  • Feature extraction: Extracting edges from images
  • Feature construction: Obtaining density from given mass and volume
  • Mapping data into new space

6. Discretization

  • It is most commonly used in classification purpose
  • It works by converting a continuous attribute to ordinal attribute
  • Can be done in two ways supervised and unsupervised discretization.
  • Supervised method works by finding breaks in data values by visualizing it
  • In unsupervised method, class labels are used to find breaks
  • Discretization methods. For all the below methods consider number of buckets that data should be divided into.

i) Equal interval width — (max value — min_value) / Number of buckets

ii) Equal Frequency — (Number of values/Number of buckets)

iii) Clustering

7. Binarization

  • Maps continuous or categorical attribute to a binary variable
  • Majorly used in association analysis.

8. Attribute Transformation

  • Map attribute values to completely different new set of attribute values using simple functions like X^(k) , log(X), |x|
  • Normalization: Adjust differences among attributes in terms of mean, range, variance and frequency of occurrence.
  • Standardization: Subtract mean and divide by standard deviation

……. To be Continued

(Inspired from my data mining course)

--

--

Saishruthi Swaminathan

Passionate about transforming data into useful products. Happy sharing my knowledge in data science to all!!