This will be a series where you will be seeing the Whats and Hows to make your DATA ready to roll.

Grooming The Booming Data

ADITYA RAJ
tech@iiit-gwalior
Published in
5 min readDec 15, 2021

--

Look Woody that’s gonna decide our future!!

Firstly Kudos to technology for completely changing the world.
Today, the amount of data generated every second has significantly increased due to the invention of new data sources that didn’t exist before. Almost everything we do generates data.

Sources of data

Human-Generated: Data that humans create and share.
Social media has been a leading force in the propagation of human-generated data. You generate data every time you post a message, change your online statuses, upload images, or like and forward comments.

Machine-Generated: Data generated from machines that don’t rely on active human intervention.
As a fitness tracker, it knows what to do as it’s programmed to do so.

Organization-Generated: Data generated as organizations run their businesses.
Records are generated every time while purchasing things over the internet or at a shop.

Techniques to work with DATA

After organizations collect and process data using stream or batch processing, they usually use a variety of techniques to extract insights from it. After all, that’s the whole point of working with data — to be able to extract insights that can help drive business decisions.

popular techniques for working with data

Congratulations! You’ve reached the end of the first section. By now, you should be familiar with how data is increasing exponentially and the techniques involved to deal with it at a high level.

Now let’s see how to groom your data — it means making it ready so that you can feed it into your computers to get the most accurate results, this step has a technical name — Data Preprocessing.

Data Preprocessing is the modification of raw data into a more suitable form for modeling. It is used in predictive modeling projects, such as classification or regression.
Machine-learning algorithms operate on numeric data. It takes numbers as input and predicts a number as output. Data is seen as vectors and matrices.
You must transform the data prior to training, evaluating, and using machine learning models to meet the requirements of individual machine learning algorithms.

Tasks in Data Preprocessing

Data Cleaning: Identifying and correcting errors in the data.

Feature Selection: Identifying the most relevant feature variables to get the target variables.

Data Transforms: Changing the distribution of variables.

Feature Engineering: Formulating a dataset with meaningful, appropriate features.

Dimensionality Reduction: Finding patterns in data and using these patterns to re-express it in compressed forms.

DATA CLEANING

It is performed before any of the tasks mentioned above. Getting perfect data is absurd, so you will be getting messy data all the time. Messy is the one that has statistical noise, errors, missing values, and conflicting examples.

It involves using statistics to identify normal data and find the extreme values that don’t have any significant effect, imputing the missing values, removing duplicate rows and columns.

Outline of Data Cleaning

FEATURE SELECTION

Irrelevant features may mislead the learning algorithm and in turn, decrease the predictive performance. Therefore the motto is to use features that are required to make predictions.

Machine learning models that automatically select features as part of fitting the model — intrinsic
Machine learning models that explicitly choose features that result in the best performing model — wrapper
Machine learning models that score each input feature and allow a subset to be selected — filter
Different statistical methods are used based on feature variables. Correlation is used primarily for scoring feature variables.
RFE (Recursive Feature Elimination) removes features one by one until the optimal number of features is left.

Outline of Feature Selection

DATA TRANSFORMATION

There are two different categories of data-
Numerical — having numeric values(1,5,0.8,4.5)
Categorical — having labels (yes/no, red, green)

Features on a far larger scale than other features can unduly influence the model (KNN uses distance explicitly when making predictions), so we want features to be on a similar scale.

Normalization Transform: Scaling a variable to the range 0 and 1. Standardization Transform: Scaling a variable to a standard Gaussian.
Discretization Transform: Encoding a numeric variable as an ordinal variable.
Power Transform: Changing the distribution of a variable to be more Gaussian.
Quantile Transform: Imposing a probability distribution such as uniform or Gaussian.
Ordinal Transform: Encoding a categorical variable into an integer variable.
One Hot Transform: Encoding a categorical variable into binary variables.

Outline of Data Transform

FEATURE ENGINEERING

Feature engineering takes raw data and extracts features from it that are suitable for tasks like machine learning. You may think of data transform as Feature Engineering.

DIMENSIONALITY REDUCTION

You may consider the number of input features for a dataset as the dimensionality of the data.
Dimensionality reduction means removing less-informative “noise” features, which cause problems in prediction tasks. The more input variables, the more chances the dataset represents a very sparse and possible unrepresentative sampling of that space.

The most common approach to dimensionality reduction is to use
PCA(Principal Component Analysis)
SVD(Singular Value Decomposition)
These techniques remove the linear dependency between the correlated variables.

Other techniques are also used like —
SOM(Self Organizing Map)
tSNE(t-Distributed Stochastic Neighbor Embedding)
LDA(Linear Discriminant Analysis)

Different Dimensionality Reduction techniques

This blog was a brief overview of different tasks; I will be covering each one of them in my next blog.

Any queries or suggestions?
Reach out to me — Aditya Raj

--

--