# Why you should probably not transform your data

Often, statisticians and data scientists have to deal with data that is skewed. That is, the distribution is not symmetric. In the picture above, we have an example of right skewed data. If the tail was on the low end, it would be left skewed. And there is an urge to transform it to something resembling a normal distribution. This urge should, in general, be resisted.

One reason people rush to transform data is to meet the assumption of some statistical method, such as regression. There’s two reasons this isn’t a good reason. First, even OLS regression does not assume anything about the shape of the distribution of the *data* (only that it is continuous or nearly so). It assumes that the errors are normally distributed. The errors can be estimated by the residuals. It is not uncommon for very non-normal data to give normal residuals after adding appropriate independent variables. Second, OLS is not the only tool. Quantile regression makes no assumptions about the distribution of the residual. Robust regression is resistant to the problems that most violations cause. Regression trees do not require normal data or residuals or anything. These methods are not as well known as OLS regression, partly because Ronald Fisher was born before Alan Turing and statistics was developed without computers. OLS regression is much more mathematically tractable than the methods I mention here.

Another reason people transform data is to reduce the influence of outliers. This is a bit more sensible, but the same methods that deal well with non-normal residuals tend to deal well with outliers.

We should transform data when there are *substantive *reasons for doing so. For instance, when dealing with income or costs or other money-related variables, it is often good to take logs because we tend to think of this sort of variable in multiplicative rather than additive terms. For instance, if you make $20,000 per year and get a $5,000 raise, that’s huge. But if you make $200,000 a year, the $5,000 raise isn’t so big.

When data (or residuals) are not normal, the non-normality is often an important part of the research. If you transform it away, you are throwing away that critical piece.