Transforming big data into smart data: feature extraction and sampling techniques.

Published in

Google Developer Students Club Vishwakarma Institute of Technology, Pune

7 min readMar 1, 2023

These days working with and handling big data has become a common theme. The dataset might have thousands of features or numerous observations about certain target values. If the number of features outweighs the number of observations in a dataset the model could be overfitted. Or if worked with imbalanced data, the model could disregard the useful information which would lead to biased results. If not dealt with correctly, this Big data could work against us and get us biased results and false insights, Which would defeat the whole purpose of any data science project. To minimise this risk, various feature extraction techniques and sampling techniques are used for better data visualisation and correct insights. An imbalance in the dataset could be reduced using various techniques which we will be taking a look at, in this blog.

Feature extraction

Now to talk about feature extraction, briefly, it could be termed a dimensionality reduction process. To get better results it removes or clubs the number of features in the dataset for better performance of the model it also helps us in better data visualisation and to draw insights from them as the number of features gets reduced.

Now, when it comes to machine learning in data science various techniques like PCA, LDA, LLE etc. could be used. we will discuss them in detail furthermore.

Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Locally Linear Embedding (LLE) are three popular techniques used for dimensionality reduction and feature extraction in machine learning and data science. In this blog, we will explore each of these techniques in detail and discuss their strengths, weaknesses, and use cases.

Principal Component Analysis (PCA):

PCA is a widely used linear transformation technique that is used to extract important features from high-dimensional datasets. It works by identifying the underlying structure of the data and reducing the dimensionality while retaining as much of the variance as possible. PCA is particularly useful when the data is highly correlated, and the number of features is much larger than the number of samples.

PCA transforms the original features of the data into a new set of orthogonal features called principal components. The principal components are sorted in order of decreasing variance, where the first principal component explains the largest amount of variance in the data, and each subsequent principal component explains a smaller amount of variance. The user can then select a subset of principal components that explain the most variance in the data and use those as input features for a downstream task.

One of the main strengths of PCA is that it is computationally efficient and can handle large datasets with many features. However, one of the main weaknesses of PCA is that it is a linear transformation and may not capture nonlinear relationships between features.

Linear Discriminant Analysis (LDA):

LDA is a supervised linear transformation technique that is used to extract features that are most relevant to the class labels of the data. Unlike PCA, LDA takes into account the class labels of the data and finds a linear transformation that maximizes the separation between classes while minimizing the variance within classes. LDA is particularly useful when the goal is to classify the data into different classes.

LDA transforms the original features of the data into a new set of linearly independent features called discriminant functions. The number of discriminant functions is equal to the number of classes minus one. The user can then select a subset of discriminant functions that provide the best separation between classes and use those as input features for a downstream task.

One of the main strengths of LDA is that it takes into account the class labels of the data and can improve classification accuracy. However, one of the main weaknesses of LDA is that it assumes that the data is normally distributed and that the covariance matrix is the same for each class.

Locally Linear Embedding (LLE):

LLE is a nonlinear dimensionality reduction technique that is used to extract low-dimensional embeddings from high-dimensional datasets. LLE works by finding a low-dimensional representation of the data that preserves the local structure of the data. LLE is particularly useful when the data is highly nonlinear and the goal is to visualize the data in a lower-dimensional space.

LLE works by first constructing a neighbourhood graph of the data and then finding a low-dimensional representation of the data that preserves the distances between neighbouring points. LLE is able to capture the local structure of the data and can provide a more accurate representation of the data in a lower-dimensional space.

One of the main strengths of LLE is that it is able to capture the local structure of the data and can provide a more accurate representation of the data in a lower-dimensional space. However, one of the main weaknesses of LLE is that it can be computationally expensive, especially for large datasets.

Sampling Techniques

As discussed earlier, In data science, dealing with imbalanced datasets is a common challenge. An imbalanced dataset refers to a situation where the number of instances of one class is much higher than the other class. This can result in a biased model, where the classifier may perform poorly on the class with fewer observations. To address this issue, we can use sampling techniques to balance the dataset.

In this blog, we will discuss the different sampling techniques used to remove imbalance from a dataset in a data science project.

What is Imbalanced Data?

Imbalanced data is a situation where the number of instances of one class is much higher than the other class. For example, if we have a dataset of rare disease infections the dataset will like to have more negatives (around 90%) than positives (around 10%). Or if we have a dataset of credit card transactions, and 98% of the transactions are legitimate and 2% are fraudulent, this is an imbalanced dataset.

The problem with imbalanced data is that it can result in a biased model. In the case of rare disease infection, a classifier trained on this dataset may only predict that a person is not infected, as it is the most common class. This means that the classifier may miss infected candidates, which can have serious consequences.

Sampling Techniques for Imbalanced Data

When it comes to sampling techniques they could be broadly classified as oversampling and undersampling techniques. Either we increase could increase the observation from the minority class (i.e class with fewer instances) or we could remove some observation from the majority class (i.e class with many instances).

Oversampling

Oversampling involves adding more instances of the minority class to the dataset. This can be done in different ways:

Random Oversampling

This involves randomly duplicating instances of the minority class until it is the same size as the majority class.

Synthetic Minority Oversampling Technique (SMOTE)

This technique involves creating new synthetic instances of the minority class, based on the existing instances. SMOTE creates new instances by selecting a random instance from the minority class and then selecting one of its neighbours. A new instance is then created at a point along the line between the two instances.

Undersampling

Undersampling involves removing instances of the majority class to balance the dataset. This can be done in various ways:

Random Undersampling

This involves randomly removing instances of the majority class until it is the same size as the minority class.

Tomek Links

This technique involves identifying pairs of instances (one from the minority class and one from the majority class) that are closest to each other. These pairs are called Tomek Links, and they are removed from the dataset.

NearMiss

NearMiss is an undersampling technique that selects instances from the majority class based on their distance from the minority class. The idea is to select the instances that are closest to the minority class, as these are likely to be the most informative. NearMiss comes in three variants:

NearMiss-1: This variant selects the instances from the majority class that are closest to the three nearest neighbours in the minority class.

NearMiss-2: This variant selects the instances from the majority class that have the farthest distance to the three nearest neighbours in the minority class.

NearMiss-3: This variant is a hybrid of the first two variants, and selects the instances from the majority class that are closest to the minority class, but also farthest from the majority class.

Conclusion

In summary, PCA, LDA, and LLE are three popular techniques used for dimensionality reduction and feature extraction in machine learning and data science. PCA is a linear transformation technique that is used to extract important features from high-dimensional datasets, LDA is a supervised linear transformation technique that is used to extract features that are most relevant to the class labels of the data, and LLE is a nonlinear dimensionality.

Dealing with imbalanced datasets is an important task in data science projects. By using sampling techniques, we can balance the dataset and improve the performance of our classifiers. Oversampling and undersampling are two common techniques used to balance imbalanced datasets. Each technique has its own advantages and disadvantages, and the choice of technique depends on the nature of the data and the research question.

Transforming big data into smart data: feature extraction and sampling techniques.

Written by PETKAR VAISHANAVI