White must see 8 common machine learning algorithms

Profile

Too many machine learning algorithms, classification, regression, clustering, recommend, image recognition, and so on, it is not easy to find a suitable algorithm, so in practice, we are generally based on a heuristic learning ways to experiment. Usually the start we’ll generally accepted algorithm, such as SVM,GBDT,Adaboost, deep learning is very hot now, neural networks are also a good choice. If you care about accuracy (accuracy), the best way is via cross-validation (cross-validation) to test each algorithm, one by one, compare, and then adjust the parameters to ensure optimal solutions for each, the final choice of the best one. But if you’re just looking for a “good enough” algorithm to solve your problem, or here are some tips you can refer, the following analysis of the advantages and disadvantages of each method, based on the advantages and disadvantages of, easier to us to select it.

Bias and variance

In statistics, a model, is measured according to the bias and the variance, so we go through popularization of deviation and variance:

Deviation: describe the predicted value (estimate) gap between the expected E’ and the real value of y. Deviations larger deviate from real-world data.

Variance: Describes the range of predicted values p, dispersion, is the variance of the predicted values, that is, distance from its expected value e. Variance the larger, more dispersed distribution of the data.

Model of the real error is between and, as shown below:

If it is a small training set, deviation high/low variance of classifier (such as naive Bayesian NB) than the lower deviation classification/Takakata-difference advantage (for example, KNN), since the latter would have fitted. However, with the growth of your training, better predictive power of the model for the original data, deviations will be reduced and low deviation/Takakata the performance of the classifier will gradually its advantage (because they have lower asymptotic error) high deviation at this time is insufficient to provide an accurate model of the classifier.

Of course, you can also think of this model is generated (NB) and models (KNN) is a difference.

Why Naive Bayes is a high deviation low variance?

The following insight:

First, assume you know the training set and testing set. In simple terms is a model we want to learn in the training set, and then to the test set used, the effect is good the error rate measured according to the test set. But most of the time, we can only assume that the test sets and training set are in line with the same data, but was unable to get a real test data. How to see only the training error rate cases, measure test error rates?

Since very few training samples (at least not much), through the training set models, is not really correct. (Even on the training set accuracy 100%, nor that it depicts the true data distribution, know that depict the true data distribution is our objective, not only describe the limited set of data points). Also, practice, training often have noise error, so if you pursue on the training set perfectly by a complex model, will make the model inside the training set error as the true data distribution, resulting in incorrect data distribution. In this case, the true test would be a mistake to mess (this phenomenon is called overfitting). But neither does a very simple model, otherwise more complex when data distribution model would have been enough to describe the data distribution (reflected even in the training set error rate is very high, a phenomenon less fit). Fitting in shows that the model is more complicated than the true data distribution, poor fitting model than the true data distribution easier.

In the framework of statistical learning, when we describe the model complexity, there is a view, that Error = Bias + Variance. Error here can be understood as the model predicted error rate is made up of two parts, partly due to the simple model as a result of inaccurate estimates of parts (Bias), in part due to changes in model is too complex and bring more space and uncertainty (Variance).

Therefore, it is easy to analysis of Naive Bayes. It simply assumes that the data are independent, is a heavily simplified model. So, for such a simple model, most situations Bias portion is greater than the Variance portion, that is warp and low variance.

In practice, in order for Error as small as possible, need to balance when we select the model Bias and Variance proportion, or balance between under-fitting and over-fitting.

Relationship between bias and variance and model complexity using more clearly in the following figure:

When the model complexity increases when deviation becomes smaller, and variance will gradually become larger.

Common algorithms

1. Naive Bayes

Naive Bayes is generative models (about building models and discriminative models, mainly lies in whether the joint distribution is required) and is very simple, you just do a bunch of counts. If the conditional independence assumption (a more strict conditions), the Naive Bayes Classifier will converge faster than models, logistic regression, so you only need less training data. Even if NB conditional independence assumption fails, NB classifier continued to perform very well in practice. Its main drawback is that it cannot be interaction between the learning feature, use mRMR r, is the feature redundant. References a classic example, for instance, while you like Brad Pitt and Tom Cruise’s movie, but it’s not learning you do not like them in movies.

Naive Bayes models originating in classical mathematics, has a solid mathematical Foundation, as well as the stability of classification efficiency.

Data on small scale did very well, and can handle multiple classification tasks, suited to incremental training;

Not very sensitive to missing data, the algorithm is relatively simple, commonly used in text categorization.

Need to calculate the prior probability;

Classification error rate;

Expression is very sensitive to the input data.

2.Logistic Regression (logistic regression)

Belong to the discriminant model, there are a lot of regularized models (L0, L1,L2,etc), and you don’t have to worry about your character like using Naive Bayes are related. Compared with the decision tree and SVM machine, you can also get a good probability interpretation, you can easily take advantage of new data to update the model (using on-line gradient descent algorithm, online gradient descent). If you need a probability framework (for example, simply adjust the classification thresholds indicating uncertainties, or to obtain a confidence interval), or you want to quickly gets more training data into the model, and then use it.

Sigmoid function:

Implements a simple, widely used in industrial issues;

When calculation is very small, fast, low storage resources;

Facilitate the probability of observing a sample score;

Logistic regression, collinearity is not a problem, it can be combined with L2 regularization to solve the problem;

When the feature space is very large, performance is not very good logistic regression;

Poor fitting easier, accuracy is not so high

Unable to deal with a large number of features or variables;

Only two classifications (softmax derived can be used on the basis of this classification), and must be linearly separable;

Nonlinear characteristics of conversion is required;

3. linear regression

Linear regression was used to return, unlike the Logistic regression is used to classify, the basic idea is to use gradient descent method on optimized least squares error function, of course you can also use normal solution of equation parameters are obtained directly, the result is:

At LWLR (locally weighted linear regression), the parameters of the expression as follows:

Thus LWLR is different from LR, LWLR is a non-parametric models because each is calculated to traverse the training samples at least once.

Pros: easy to implement simple;

4. recently led algorithm–KNN

KNN algorithm for nearest neighbor, the main processes are:

1. training and testing samples of each sample point in the distance (common distance metric are Euclidean distances, Mahalanobis distance, and so on); 2. To sort the distance value of all above; 3. K samples minimum distance before the election; 4. According to the voted label of k samples, final classification category;

How to choose the best k values, depending on the data. Typically, larger values of k in the classification to reduce noise impact. But blurs the boundaries between categories. A good k can be obtained by various heuristic techniques, such as cross-validation. Noise and the presence of non-correlation characteristic vector will reduce the accuracy of k-nearest neighbor algorithm.

Nearest neighbor algorithm has strong consistency results. As the data tends to infinity, arithmetic error rate no more than a Bayesian algorithm twice times the error rate. For some values of k, k-nearest neighbor to ensure an error rate does not exceed the Bayes error rate.

Mature theory, simplistic, classification can also be used for regression can be used to do;

Can be used to non-linear classification;

Training time complexity is o (n);

No assumptions on the data, high accuracy not sensitive to outlier;

Calculation;

Sample imbalances (number of samples that some categories and other small number of samples);

Needs a lot of memory;

5. decision tree

Easy to interpret. It can handle without pressure characteristics of interaction between relationship and non-parameterized, so you don’t have to worry about outliers, or whether the data linearly separable (for example, the decision tree can easily deal with category a at the end of a feature dimension x, category b in the Middle, and then reappeared on the characteristic dimension x category a front-end). One of its disadvantages is not supported by online learning, and after the arrival of new samples, decision tree requires full reconstruction. Another disadvantage is prone to overfitting, but this is the random forest such as RF (or promotion boosted trees tree) integration methods, such as starting point. In addition, the winners of the random forest is often a lot of classification (usually better than support vector machine so little on), adjustable training fast and, at the same time you don’t have to worry about adjusting a lot of parameters like support vector machines, so it has been very popular in the past.

Decision trees in the select a property it is important that the branch, so pay attention to the information gain calculation formula and understand it.

Information entropy is calculated as follows:

Where n represents the n classification categories (such as assumed to be 2 problems, n=2). Calculate this probability of 2 samples appear in the total sample, P1 and P2, so you can figure out information entropy of the property before the branch is not selected.

Xixi now select a property used for branching, branching rule is: If XI=VXI=v, sample points to one of the branches of the tree; if a unequally in another branch. It is clear, from branch sample probably consists of 2 categories, calculating each of the 2 branches of the entropy of H1 and H2, calculated the total entropy of branches after H’ =p1H1+p2 H2, you gain ΔH = H-H’. The principle of information gain, all properties are tested, select one to gain maximum property as this branches property.

Simple, easy to understand, to explain the strong;

More suitable for handling samples with missing property;

To deal with unrelated characteristics;

In a relatively short period of time large data sources can be made workable and satisfactory results.

Prone to overfitting (random forests can greatly reduce overfitting);

Ignore the correlation between data;

Inconsistent data for those different kinds of samples, in a decision tree, information gain is towards the more numerical features (as long as you use the information gain, there is this like RF).

Adaboost is an additive model, each model is based on the error rate of the previous model, focus too much on the wrong sample of correct classification of samples and reduced attention, successive iterations, you can get a better model. Is an example of boosting algorithms. The following is a summary of its strengths and weaknesses.

Adaboost is a high accuracy classifier.

Can use a variety of methods to build classifiers, Adaboost algorithm provides a framework.

When simple classifier is used, calculated results are understandable, and construction of weak classifiers is extremely simple.

Simple, without feature selection.

Not prone to overfitting.

Random forests and GBDT Combinatorial algorithms, refer to this article: machine learning-combining algorithm summary

6.SVM support vector machine OtterBox Note 4

High accuracy, in order to avoid overfitting provides a very good theory, and even if the data in the original feature space linearly inseparable, as long as a suitable kernel function, it can run very well. Text of ultra high-dimensional classification problem is quite popular. Unfortunately, memory-intensive, difficult to explain, running and scheduling is also a little annoying and random forests was just avoided these shortcomings, is more practical.

High-dimensional problems can be solved, namely large feature spaces;

Capable of handling nonlinear interaction;

Without relying on an entire data;

To improve the generalization ability;

When observing a sample in many cases, the efficiency is not very high;

No general solutions of nonlinear problems, sometimes it’s hard to find a suitable kernel function;

Sensitive to missing data;

Nuclear choice is also a skill (libsvm comes with four core functions: linear polynomial with nuclear, nuclear, RBF and sigmoid kernel):

First, if the sample size is smaller than the characteristic numbers, then there is no need to select a nonlinear nuclear, can be a simple linear kernel;

Second, if the sample size is greater than the characteristic number could use non-linear cores, samples are mapped to higher dimensions, can generally get better results;

Third, if equal to the sample size and number of features, which you can use for nonlinear nuclear, principles and the second.

For the first situation, can also reduce the dimension of the data, and then use the nonlinear nuclear, which is a method.

The advantages of artificial neural networks:

Classification of high accuracy;

Parallel distributed processing capability, distributed memory and learning ability,

Nerves of robustness and fault tolerance to noise, can fully approximation of complex nonlinear relationship;

Have the function of associative memory.

Disadvantages of artificial neural networks: OtterBox Note 4

Neural network requires a lot of parameters, such as network topologies, the initial values of the weights and thresholds;

Cannot be observed between the learning process, the output is difficult to explain, will affect the credibility and acceptability of the results;

Learning time is too long and might not meet the learning objectives.

8, k-means clustering

Write an article before article on k-means clustering, post link:-K-means clustering machine learning algorithms. On the derivation of the K-Means, which has a very strong EM.

Simple and easy to implement;

For dealing with large data sets, which are scalable and efficient, because of its complexity is approximately o (nkt), where n is the number of all objects, k is the number of clusters, and t is the number of iterations. Usually k<<n. Usually local convergence of the algorithm.

Algorithm to try to find a k-Division of minimum squared error function. When the clusters are dense, globose or mission-and differences evident between the clusters and cluster, clustering results.

High demands on data type, suitable for numerical data;

May converge to a local minimum, on large-scale data convergence is slow

K value is more difficult to select;

Sensitive to initial cluster value, for different initial values, may lead to different clustering results;

Not suitable for non-convex shape cluster or clusters vary in size.

“Noise” and outlier data sensitive, a small amount of this kind of data can have great impact on average.

Selected references

Foreign articles have translated before, a simple algorithm is given in the article selection tips:

1. first choice is logistic regression, what if it is not, then you can reference its results as a benchmark, based on comparison with other algorithms;

2. then try decision trees (random forests) and see if you could significantly improve the performance of your model. Even if you didn’t think of it as the ultimate model, you can remove noise using random forests to variable and feature selection;

3. If the number of characters and the particularly large number of samples, when sufficient resources and time (this condition is important), use the SVM is to be an option.

Typically: “GBDT>=SVM>=RF>=Adaboost>=Other …”, now deep learning is very popular, and many fields are used, it is based on the neural network, I am still learning, but theory is not very thick, understanding is not deep enough, will not be covered here.

Algorithm is important, but good data is better than good algorithms, good design feature is helpful. If you have a large data set, then no matter which algorithm do you use may affect performance are not much (at this time according to the speed and ease of use to choose).