How do I prove a machine learning problem is not directly solvable linearly?

Laurae: This post is about how to explain a problem is not solvable using linear techniques (understand: non-multiple layered linear combinations). It includes two posts, originally at Kaggle topic and kernel. The introduction was created, followed by the kernel content, and ending with the topic content.

This is what can happens when you work on a kernel at Kaggle… and how to look for non-linearity proofs VISUALLY

# This R environment comes with all of CRAN preinstalled, as well as many other helpful packages
# The environment is defined by the kaggle/rstats docker image: https://github.com/kaggle/docker-rstats
# For example, here's several helpful packages to load in

library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(caret)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

system("ls ../input")
train <- read_csv("../input/train.csv")

# Any results you write to the current directory are saved as output.

#prepare data
train <- train[, !names(train) %in% c("ID", "v3", "v22", "v24", "v30", "v31", "v47", "v52",
"v56", "v66", "v71", "v74", "v75", "v79", "v91", "v107", "v110", "v112", "v113", "v125")] #removes categorical variables
train <- na.omit(train) #cannot compute Mahalanobis distances with NAs
#don't do this in prediction production
#comboInfo <- findLinearCombos(train[, -1]) #if you want to get the perfect linear combination
train <- train[, !names(train) %in% c("v129")] #removes a part of a perfect linear combination to avoid ill-conditioned matrix / dividing by 0 when inverting matrix for Mahalonobis distance computations
train$target <- as.factor(train$target) #factorize target
levels(train$target) <- c("No", "Yes") #turning labels into No/Yes, not required


#calculate Mahalonobis distances using caret package
train_centroids <- classDist(train[, -1], train[, 1]) #get centroids' formulas
train_distances <- predict(train_centroids, train[, -1]) #parse centroid distances on current dataset
train_distances <- as.data.frame(train_distances) #turn data into data frame

#data plotting
splom(train_distances, groups = train[, 1], auto.key = list(columns = 2)) #one way to see data
xyplot(dist.No ~ dist.Yes, data = train_distances, groups = train[, 1], auto.key = list(columns = 2), type = c("p", "smooth")) #second way... overplots
ggplot(train_distances, aes(x = dist.No, y = dist.Yes)) + stat_density2d(aes(color = train$target, fill = ..level..), geom = "polygon") #third way... not this good
ggplot(train_distances, aes(x = dist.No, y = dist.Yes)) + geom_point(aes(color = train$target, fill = train$target), alpha = 1/10) + geom_density2d(aes(color = train$target, fill = train$target)) #overplots severely

Rplot001.png

Rplot002.png

Rplot003.png

Rplot004.png

@Alpha: description is in the comments of the code. It computes Mahalonobis distances and plots four times using different visuailzation settings.

  • If there is no clear visible separation between labels -> sorting labels out is non linear
  • If there is a clear visible separation between labels -> sorting labels out is approximately linear

However, there is a lot of human interpretation to perform as what might be separated for someone might not be for someone else. In BNP Paribas Cardif data set case, it is clearly inseparable linearly.

Morgane
Does it mean that, thanks to this analysis, we can guess that a GLM will not be efficient ?

@Morgane: Yes and No. In this case, the answer is “Yes, but”.

Yes: if you are using inputs in GLM which cannot be used to sort out the non-linearity (such as using raw data without doing much engineering job).

No: if you are using inputs in GLM which can be used to sort out the non-linearity (such as crafting key features).

It means if the engineered inputs are able to sort out visually the labels through Mahalanobis distance, you have crafted features which makes the machine learning task less non-linear (thus more linear, and easier to understand in most of the cases by most algorithms). GLM would benefit widely from this.

Morgane
Thank you Laurae :)
I'm trying to understand what your are doing here (sorry I have only a little experience in machine learning ...). So, to see the relation between target labels, you get the formula to calculate the Malhanobis distance between numerical predictors and labels of target (for example distance between v1 and target=0 and the distance between v1 and target=1) with ClassDist, and then you calculate it by the function "predict". Now you have a matrix of Malhanobis distances and you plot the distance between target=0 and numerical predictor against the distance between target=1 and numerical predictors. How do you link these distances and the relation between target labels ? I read "If there is no clear visible separation between labels -> sorting labels out is non linear" but I did not understand why and why we used Malhanobis distance to conclude to this non linearity. Hope you understood my issue... Thanks in advance :)

@Morgane This is not ML but pure statistics for the proof. Here is a non-mathematical proof: Mahalanobis distance is linked with the leverage statistic (more exactly a leverage measure), which is directly linked to linear regression (the same mathematical formula behind but presented differently through a ratio relation). Thus, if labelling Mahalanobis distances does not yield a separation, it is mostly not useful to try a linear method as you would have noticed that:

=> Computed Mahalanobis distance

=> Plotted the distances and labelled them

=> Noticed there is no clear separable window between the labels

=> Know Mahalanobis distance is linked to leverage statistic, used by linear regression but in another form as a ratio

=> Conclude with a very high human confidence you cannot separate the labels using a linear model

Technically it’s provable statistically (and also get the p-values with confidence intervals), just use the appropriate statistical distribution test that fits your distance distribution between each label. It’s also provable mathematically (getting from the leverage statistics formula in linear regression to/from Mahalanobis distance formula). An example of mathematical proof here: http://stats.stackexchange.com/questions/199686/prove-the-relation-between-mahalanobis-distance-and-leverage

Look at here also for a discussion about linearity and non-linearity: https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/forums/t/20146/linear-or-non-linear-problem — specificallyhttps://www.kaggle.com/c/bnp-paribas-cardif-claims-management/forums/t/20146/linear-or-non-linear-problem/115027#post115027

Alpha wrote:
Hi all, Is there any statistical test or method which tells(with certain confidence) whether a certain binary classification can be solved using linear methods or nonlinear methods….
Alpha wrote:
EDIT: statistical testing*

If you need to generalize to more dimensions and without inspecting visually all dimensions:

  • Compute Mahalonobis distance
  • Compute with the statistical test of your choice whether the distribution of the two classes (from the Mahalonobis distance you computed) are similar or not

You will probably have to use non-parametric statistical tests as the distributions are probably not gaussian (use tests like Shapiro-Wilk, Anderson-Darling, Lilliefors, Jarque-Bera to prove non-normality… then you just have to use the right statistical test to infer whether the two distributions are identical or not).

In case you are wondering why, squared Mahalonobis distance is linked to the leverage statistic, the latter being exactly identical (mathematically in a different form but you can go from one to another) by those computed by linear regressions.

It is the reason computing Mahalonobis distance and eyeballing a plotted chart colored by classes allow to infer immediately whether the data set you have is solvable linearly or non linearly.

morpheushoc wrote:
Empirically, You can try running an SVM with two different kernels: Linear and Gaussian.
If the SVM with Linear kernel is better than the Gaussian one, it means that the best way to separate your two classes is by a “straight line”.
Using non linear models applied to a linear classification problem is like trying to approximate a straight line with multiple curved lines.

Unfortunately this is not the case. The more empirical explanation would be:

  • When a linear separation is existing, the derivative (when n->infinite fitted lines) from a gaussian SVM is equal to the derivative of the linear SVM {1}, one can infer that a gaussian SVM is a linear SVM {2}
  • Therefore from {2}, one can infer that if a linear SVM matches a gaussian SVM performance, then labels should be separable linearly {3}.
  • As {2} is forcibly true when n->infinite, one can also infer that if a linear SVM cannot match a gaussian SVM performance, then labels should not be separable linearly {4} because we have to reject {1}.

Using SVM is probably overkill though.

Explanation: It gives identical (perfect) performance separation for both linear and gaussian SVMs when a linear separation exists.

Unlike the following:

Base 2-D data set:

Gaussian SVM doing very good:

Linear SVM struggling very hard as expected:

Alpha wrote:
Thanks Laurae…. An excellent write up…. Do u have any paper which explains more about linkage between mahalonibis distance and finding proper test to identify linearities or non linearities

Those four links should be enough:

Mahalonobis distance leverage formula (Wikipedia)

Statistical Leverage (linear regression) (Wikipedia)

Prove the relation between Mahalanobis distance and Leverage? (stackexchange)

Bottom to top explanation of the Mahalanobis distance? (stackexchange)

As for papers… I am not sure there are any that can be easily found (I know there are, but I do not seem to find any using scholar.google — probably I’m using the wrong keywords).

Andrey Vykhodtsev wrote:
Hi Laurae, do you mind sharing code for your wonderful visualisations?

I use mldemos. Mostly used for research/teaching because it’s near useless if you want to visualize data sets such as BNP Paribas Cardif’s train set. It’s an extremely good toy software when you want to introduce machine learning to students.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.