Data Transformation

Types of Data Transformation

Devanshi Patel
CodeX
5 min readSep 8, 2022

--

Photo by Luke Chesser on Unsplash

According to Wikipedia, data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and management tasks such as data wrangling, data warehousing, and application integration.

Types of Data Transformation

  1. Aggregation
  2. Sampling
  3. Dimensionality Reduction
  4. Feature subset selection
  5. Feature creation
  6. Variable Transformation

Aggregation

It is the process of combining two or more objects/attributes into a single object/attribute. For example: merging daily sales figures to obtain monthly sales figures, Cities aggregated into regions, states, countries, etc. Days are aggregated into weeks, months, and years.

Need for Aggregation:

  • Data Reduction: This results in smaller data sets and hence requires less memory and processing time, and hence, aggregation may permit the use of more expensive data mining algorithms.
  • Change of Scale: Aggregation can act as a change of scope or scale by providing a high-level view of the data instead of a low-level view.
  • More “Stable” Data: Aggregated Data tends to have less variability.

The data may be obtained from multiple data sources to integrate these data sources into a data analysis description.

Sampling

Sampling is a commonly used approach for selecting a subset of the data objects to be analyzed. It is used for both the preliminary investigation process of understanding characteristics of data or models based on a subset of the original data. It is used in data mining because processing the entire set of data of interest is too expensive or time-consuming.

Need for Sampling:

  • The key aspect of sampling is to use a representative sample. A sample is representative if it has approximately the same property (of interest) as the original set of data.
  • Obtaining the entire set of data may not be necessary (and hence a waste of resources).
  • Using a sample will work almost as well as using the entire dataset if the sample is representative.

It is important to choose a proper sample size. With a smaller sample size, patterns may be missed or erroneous patterns might be detected.

Types of Sampling

1. Simple Random Sampling: There is an equal probability of
selecting any particular item.

  • Sampling without replacement: Once an item is selected, it is removed from the population for obtaining future samples.
  • Sampling with replacement: Selected item is not removed from the population for obtaining future samples.

2. Stratified Sampling: The data is split into several partitions and then random samples are drawn from each partition. Stratification is the process of grouping members of the population into relatively homogeneous subgroups before sampling — The strata should be mutually exclusive: every element in the population must be assigned to only one stratum. The strata should also be collectively exhaustive: no population element can be excluded.

3. Progressive Sampling: The proper sample size can be difficult to determine, so adaptive or progressive sampling schemes are sometimes used. These approaches start with a small sample and then increase the sample size until a sample of sufficient size has been obtained.

Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of random variables or attributes under consideration. When dimensionality increases, data becomes increasingly sparse in the space that it occupies. Definitions of density and distance between points, which are critical for clustering and outlier detection, become less meaningful. Curse of dimensionality: Data analysis becomes significantly harder as the dimensionality of the data increases.

Need for Dimensionality Reduction?

  • Many data mining algorithms work better if the dimensionality of data (i.e. number of attributes) is lower.
  • Allows the data to be more easily visualized.
  • If dimensionality reduction eliminates irrelevant features or reduces noise, then the quality of results may improve. Redundant features duplicate much or all of the information contained in one or more attributes. Eg: The purchase price of the product and the sales tax paid contain the same information. Irrelevant features contain no information that is useful for the data mining task at hand. Eg: Student ID numbers would be irrelevant to the task of predicting their GPA.
  • Can lead to a more understandable model.
  • Reduce the amount of time and memory required by data mining algorithms.

It can be one with:

  • Principal Components Analysis (PCA): Goal is to find a projection that captures the largest amount of variation in data
  • Simple Vector Decomposition

Feature subset selection

To reduce the dimensionality is to use only a subset of the features. There are three standard approaches to feature selection:

  • Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm
  • Filter approaches: Features are selected before the data mining algorithm is run, using some approach that is independent of the data mining task.
  • Wrapper approaches: These methods use the target data mining algorithm as a black box to find the best subset of attributes, in a way similar to that of the ideal algorithm described above, but typically without enumerating all possible subsets.

Feature Creation

It involves the creation of new attributes that can capture the important information in a data set much more efficiently than the original attributes. The three general methodologies are:

Feature extraction:

The creation of a new set of features from the original raw data is known as feature extraction. Consider a set of photographs, where each photograph is to be classified according to whether or not it contains a human face. The raw data is a set of pixels, and as such, is not suitable for many types of classification algorithms. However, if the data is processed to provide higher-level features, such as the presence or absence of certain types of edges and areas that are highly correlated with the presence of human faces, then a much broader set of classification techniques can be applied to this problem. This method is highly domain-specific.

Feature construction:

Sometimes the features in the original data sets have the necessary information, but it is not in a form suitable for the data mining algorithm. In this situation, one or more new features constructed out of the original features can be more useful than the original features. Example: dividing mass by volume to get the density.

Mapping data to new space:

A different view of the data can reveal important and interesting features. Consider, for example, time series data, which often contains periodic patterns. If there is only a single periodic pattern and not much noise then the pattern is easily detected. If, on the other hand, there are many periodic patterns and a significant amount of noise is present, then these patterns are hard to detect. Such patterns can, nonetheless, often be detected by applying a Fourier transform to the time series to change to a representation in which frequency information is explicit.

Variable Transformation

An attribute transform is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values. It can be one using:

  • Simple functions: power(x, k), log(x), power(e, x), |x|
  • Normalization: It refers to various techniques to adjust to differences among attributes in terms of frequency of occurrence, mean, variance, and range.
  • Standardization: In statistics, it refers to subtracting the means and dividing by the standard deviation.

Thank you for reading!

--

--