Log transform for positivity.

Ravi CHandra
Analytics Vidhya
Published in
5 min readAug 1, 2020
Copyright: photo(cc) :: WalterMß Copytight © 2006 WalterMß All Rights Reserved.

In Simple terms, log transform squashes or compresses range of large numbers and expands the range of small numbers. So if x is larger, then slower the log(x) increments.

Log transform on range(1,1000), on x axis is real value and on y axis is log transformed value.

If you closely look at the plot above, which actually talks about log transformation on values ranging from 1 to 1000. As we can see from the plot, log has transformed values from [1,1000] into [0,7] range.

Note that how x values from 200 to 1000 get compressed into just ~5 and 7. So the larger the x, slower the log(x) increments.

Log is only defined when x>0. Log 0 is undefined. It’s not a real number, let’s say Log (base 10) 0=x, so 10^x=0, if you try to solve this, you will see that no value of x raised to the power of 10 gives you zero. 10⁰ is also 1.

Log transform is also known as variance stabilizing transform, which is useful when dealing with heavy tailed distributions. Log transform can make highly skewed distributions less skewed. So log transform reduces or removes skewness in data.

Log transform reduces or removes skewness and tries to make our distribution normal.

Using Log transform as feature engineering technique:

To reduce or remove skewness in our data distribution and make it more normal (A.K.A Gaussian distribution) we can use log transformation on our input features (X).

We usually see heavy tailed distributions in real world data where values are right skewed(More larger values in distribution) and left skewed(More smaller values in distribution). Algorithms can be sensitive to such distribution of values and can under perform if the range is not properly normalized.

Skewed distribution
Log transform distribution

It is common practice to apply a logarithmic transformation on the data so that the very large and very small values do not negatively affect the performance of a learning algorithm. Log transform reduces the range of values caused by outliers.

However it is important to remember that once log transform is done, observing data in its raw form will no longer have the same original meaning, as Log transforming the data.

Next question is: when we do linear regression and get coefficient for X (Independent variable) how do we interpret log transformed independent variables (X) coefficient (Feature importance).

For Independent variable(X) Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units.

Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002.

Note: I’m also attaching a link below which dives deep into interpreting log transformed features.

Using Log transform on target variable:

For example let’s consider a machine learning problem where you want to predict price of a house based on input features like (Area, number of bed rooms,…etc).

In this problem if you choose to create a linear regression model to fit prices(y) on X(Area, number of bed rooms….) and gradient descent in optimizing the model, the dataset would have some extreme prices (higher values properties) due to which your gradient descent algorithm would focus more on optimizing higher valued properties(Due to large error) and hence would produce a bad model. So performing a log transform on target variable makes sense when your performing linear regression.

More importantly linear regression can predict values that are any real number (Negative values). If your model is far off, it can produce negative values, especially when predicting some of the cheaper houses. Real world values like price, income, stock price are positive so its good to log transform it before using linear regression otherwise the linear regression would predict negative values as predictions which doesn’t make sense.

Example: Predicting house prices

If you look at the above example, if you chose to go with RMSE as the cost function then the model would focus more on high valued properties and would perform bad. If you chose log(Actual)-log(Predicted) value it intuitively works in optimizing the model and thereby produce a good model.

Model will be under more pressure on correcting large errors due to High valued properties so using log here makes sense.

Converting log predictions back to actual values.

Converting to actual predictions using np.exp:
But you would need actual predictions not the log of predictions, so you can always convert back to actual predictions using exponential of the value (Log(price)).

Log loss to improve models

Logarithmic loss (related to cross-entropy) measures the performance of a classification model where the prediction input is a probability value between 0 and 1. The goal of our machine learning models is to minimize this value. A perfect model would have a log loss of 0. Log loss increases as the predicted probability diverges from the actual label. So predicting a probability of .012 when the actual observation label is 1 would be bad and result in a high log loss.

Log loss in binary classification setting

If you look at the above example when true value is 1 and predicted probability is 0.1, the log loss is high. Whereas when true value is 1 and predicted probability is 0.9, log loss is low.

Log transformation in Text Classification (Natural language processing)

We use tf-idf method to encode our text data to fit machine learning models. Tf-idf uses log transform on inverse document frequency, so the word that appears in every single document will be effectively zeroed out, and a word that appears in very few documents will have an even larger count than before.

TF-IDF

--

--

Ravi CHandra
Analytics Vidhya

Self taught Data scientist who is passionate about how machine learning and AI can change the world.