My first approach toward understanding Machine Learning was to do the online course from courseera by Professor Andrew Ng. It is a very effective course and I highly recommend it. The next step was to look for real data and start applying several machine learning models (logistic regression, Naive Bayes, SVM , etc …) and make predictions. Kaggle has few competitions meant for learning machine learning tricks. After doing one or two of the educational examples I started doing some of the real challenges such as ones launched by Amazon, Walmart and others .
In this blog I will talk about general machine learning guidelines, given a data set
- How to deal with feature engineering? What do we do when we have 10 features? What do we do when we have a 1000 features?
- How do you decide between logistic regression or Random forests? i.e which modeling technique is applicable.
Feature Engineering and Data Exploration.
A correct choice or combination features leads to better solutions in both the test and the training sample. There are no definite path for feature engineering. It is very much guided by available data. Hence data exploration is very important before feature selection. Figuring out what percentage of data are present (not null) for all the fields available, what is the correlation among the fields and the correlation of each field with the predicted variable are important. Another useful and simple visualization tool is the histogram of the features themselves. What values do they take? Is the distribution skewed? Does it have a long tail? Is the long tail (outliers) important towards making the prediction? What happens when you remove the long tail? Do the predictions get better?
One should also look at box-plots for each individual features. The box plot displays the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.
We can look for outliers beyond 3*IQR (Inter Quartile Range). Bootstrapping the data set also provides some insight on outliers.
If the data volume is too large doing Data Sidekick , a term coined by Abe Gong, evaluations and predictions are also useful. The idea of Data Sidekick is to use a small part of your data to figure out what insights can be drawn from that data. For instance you may a huge corpus of text, you can use a small portion of it to test various sentiment analysis models and choose the one which gives the best results and is scalable.
Some of the other aspects of feature engineering are,
- Converting continuous variables into categorical variables
- Certain combination of features result in better predictor variables
- Considering the square or cube (or using non-linear models) of the features can also provide better insights
- Forward Selection: start with the strongest feature and keep adding more features. This computationally expensive.
- Backward selection: start with all the features and remove the weakest features. This computationally expensive.
- When having number features becomes significantly large using Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) to find the right combination of features is useful.
Which model should I choose?
Part of this decision is dictated by data. Here I will list the most commonly used machine learning techniques:
- If the outcome is binary (rolling of a dice, detecting if a tumor is benign or malignant, etc.) with finite number of features and a large training set Logistic Regression is one of the better choices. One of the advantages of Logistic regression is that if you have additional training including it in the model is trivial.
- For a multi-category output with independent features with a finite number of training samples (example: document classification given some categories) Naive Bayes is a good choice. It is embarrassingly simple to incorporate. Both Logistic regression and Naive Byes give a probabilistic interpretation to the output.
- Decision Trees are easy to understand and non-parametric which makes it less prone to outliers and we do not have to worry about linear or non-linear classification boundaries. They have a tendency to overfit, but thats when ensemble methods such as Random Forests come to play.
- Random Forests use an ensemble of decision trees and bootstrap the training sample to deal with the overfitting problem. A more detailed explanation is given here. This technique also trains faster and scales easily.
- Support Vector Machines(SVM) have high accuracy, theoretical guarantee against overfitting with an appropriate kernel that works even when the features are not linearly separable. However, it can be memory intensive and does not scale very well.
- Neural Networks require less formal statistical training . This modeling technique detects complex nonlinear relationships between dependent and independent variables and all possible interactions between predictor variables. This technique encompasses multiple training algorithms. The main disadvantages is its “black box” nature, greater computational burden, proneness to overfitting, and the empirical nature of model development.
The most important aspect of which model to use is its performance on data i.e run diagnostics on the predictions. The most popular diagnostic is to look at the bias variance trade of. Run the top 3 most applicable machine learning model from above.. Do a 80–20 k-fold (maybe 10) cross validation and tune parameters to get the lowest accuracy on the test set. Look at the error as a function of modeling parameters for both the test and training sample. There will be an optimal solution with lowest bias and variance. This solution will minimize the error in the test solution. The figure illustrates the sweet spot in parameter space with the optimal solution.
If we have high bias both training and test error will be low, if we have high variance training error will be low and test error will be high. We should choose the the technique that optimizes for both bias & variance. In order to fix high variance we can collect more training data or do feature engineering to use fewer features. In order to reduce bias using a larger set of features or a different set of features might help. Tuning the regularization parameter, iterating through a higher number of steps or using simulated annealing for stepping are also some ways to get an optimal solution.
Another important aspect is comparing false negatives vs false positives. Depending on the problem getting a higher accuracy on one or the other is important. For instance if you are doing a disease prediction getting false positives might be better than getting false negatives. Hence it is important to look at the confusion matrix. A confusion matrix is a way of visualizing predictions made by a classifier and is just a table showing the distribution of predictions for a specific class. The x-axis indicates the true class of each observation while the y-axis corresponds to the class predicted by the model.
Two important quantities here are precision and recall. How often does my classifier predict a +ve outcome correctly? This is recall. When my classifier predicts a +ve output how often is it actually true. This is precision. I will write different blog and go into more details about the different aspects of the confusion matrix.
Originally published at deblivingdata.net.