Day 8: Data transformation — Skewness, normalization and much more

SaiGayatri Vadali
5 min readDec 29, 2017

--

This article is the eighth one in the series “Getting started with data science in 30 days using R programming”. All the previous articles are available here.

We have seen importing data, cleaning it and making it tidy. But cleaning alone will not make ready for data, it needs to undergo some more steps before it gets ready for actual analysis. Today, Let’s talk the about the next step which is data transformation.

Why should we transform data when it is already clean?

Different features in the data set may have values in different ranges. For example, in an employee data set, the range of salary feature may lie from thousands to lakhs but the range of values of age feature will be in 20- 60. That means a column is more weighted compared to other. Following code snippet may make it more clear to you. You can download the data set here.

To know more about ‘range()’ refer this article.

> airline_data <- read.csv("airline.csv")
> range(airline_data$Y)
[1] 1.214 23.619
> range(airline_data$W)
[1] 0.243 1.779
> range(airline_data$K)
[1] 0.559 9.508
> range(airline_data$YEAR)
[1] 1948 1979

You can see how different columns have different range of values.

In such a situation, applying statistical measures across this data set may not give desired result. Data transformation comes to our aid in such situations.

Data transformation

Data transformation predominantly deals with normalizing also known as scaling data , handling skewness and aggregation of attributes.

Normalization

Normalization or scaling refers to bringing all the columns into same range. We will discuss two most common normalization techniques.

  1. Min-Max
  2. Z score

Min-Max normalization:

It is simple way of scaling values in a column. But, it tries to move the values towards the mean of the column. Here is the formula

Converting it into R can be pretty simple as follows

> z<-  (value-min(column))/(max(column)-min(column))

Let’s apply this normalization technique to year attribute of our data set. To know more about exploring a dataset, read this article and perform the operations mentioned in it. Note, we undertake vector operations in R and hence the code looks so direct.

> z <- (airline_data$YEAR-min(airline_data$YEAR))/(max(airline_data$YEAR)-min(airline_data$YEAR))
> z
[1] 0.00000000 0.03225807 0.06451588 0.09677395
[5] 0.12903202 0.16129010 0.19354817 0.22580624
[9] 0.25806432 0.29033414 0.32261167 0.35486954
[13] 0.38709661 0.41935468 0.45161275 0.48387083
[17] 0.51612890 0.54838697 0.58067171 0.61290367
[21] 0.64519223 0.67742689 0.70967861 0.74196642
[25] 0.77419765 0.80645194 0.83874064 0.87096771
[29] 0.90322578 0.93548385 0.96774193 1.00000000

Z score normalization:

Now, let us see what Z score normalization is. In Z score normalization, we perform following mathematical transformation.

Thus we obtain z score normalized values.

Min- Max tries to get the values closer to mean. But when there are outliers in the data which are important and we don’t want to loose their impact ,we go with Z score normalization.

scale() function in R :

In R, there is a function called scale() which does Z score normalization for us. let us calculate the normalized values manually as well as using scale() function.

>  z_manual<- (airline_data$YEAR- mean(airline_data$YEAR))/sd(airline_data$YEAR)
> z_scale <- scale(airline_data$YEAR,center= TRUE, scale=TRUE)
> z_manual
[1] -1.6523175 -1.5457177 -1.4391188 -1.3325190
[5] -1.2259192 -1.1193194 -1.0127196 -0.9061198
[9] -0.7995200 -0.6928814 -0.5862173 -0.4796181
[13] -0.3731208 -0.2665210 -0.1599212 -0.0533214
[17] 0.0532784 0.1598782 0.2665661 0.3730796
[21] 0.4797802 0.5863026 0.6928814 0.7995795
[25] 0.9060906 1.0126778 1.1193789 1.2258762
[29] 1.3324760 1.4390758 1.5456756 1.6522754
> z_scale[1:6,] # I am choosing only first 6 values
[,1]
[1,] -1.6523175
[2,] -1.5457177
[3,] -1.4391188
[4,] -1.3325190
[5,] -1.2259192
[6,] -1.1193194

attr(,"scaled:center")
[1] 1963.5
attr(,"scaled:scale")
[1] 9.380881

Skewness of data:

According to Wikipedia,” In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.”

skewness() :

Skewness basically gives the shape of normal distribution of values. Skewness coefficient can be obtained with skewness() function from e1071 package in R.

> library(e1071)
> engine.displ<-skewness(mtcars$disp) > engine.displ
[1] 0.381657

If skewness value lies above +1 or below -1, data is highly skewed. If it lies between +0.5 to -0.5, it is moderately skewed. If the value is 0, then the data is symmetric

Once, we know the skewness level, we should know whether it is positively skewed or negatively skewed.

Positively skewed data:

If tail is on the right as that of the second image in the figure, it is right skewed data. It is also called positive skewed data.

Common transformations of this data include square root, cube root, and log.

Cube root transformation:

The cube root transformation involves converting x to x^(1/3). This is a fairly strong transformation with a substantial effect on distribution shape: but is weaker than the logarithm. It can be applied to negative and zero values too. Negatively skewed data.

Square root transformation:

Applied to positive values only. Hence, observe the values of column before applying.

Logarithm transformation:

The logarithm, x to log base 10 of x, or x to log base e of x (ln x), or x to log base 2 of x, is a strong transformation and can be used to reduce right skewness.

Negatively skewed data:

If the tail is to the left of data, then it is called left skewed data. It is also called negatively skewed data.

Common transformations include square , cube root and logarithmic.

We will discuss what square transformation is as others are already discussed.

Square transformation:

The square, x to x², has a moderate effect on distribution shape and it could be used to reduce left skewness.

Another method of handling skewness is finding outliers and possibly removing them.

Resolving outliers :

Outliers can be found using outliers() function from outliers package. This function returns the values at extreme distances from the mean. Once, these are found, we can handle them accordingly to reduce the skewness.

> set.seed(468) # code reproducibility
> x <- rnorm(100) # This gives 100 random normal distributed values
> install.packages("outliers")
> library(outliers) # load the library
> outlier(x)
[1] 2.3448

Aggregation of data

Aggregation of data refers to making subsets of data using various combinations of attributes, applying the statistical measures on them and reporting the results. To know more about aggregation refer to this article as it is little beyond the scope of our series. Generally, we apply the first two transformation techniques more compared to aggregation.

Till now, we have dealt with data cleaning and transformation. Transformation of data doesn’t end here , we apply feature reduction methods too called dimensionality reduction techniques which we will see in the coming articles. In the next article, we will know about one of the most popular data manipulation packages called ‘dplyr’.

I hope you enjoyed reading this article. Happy Learning !!!

--

--

SaiGayatri Vadali

An inquisitive Machine Learning Engineer, yoga trainer, fitness freak and a passionate writer!