Explain the basic concepts of machine learning with proper graph

Published in

The Startup

5 min readJul 16, 2019

While explaining the basic concepts of machine learning, I found myself always returning to a limited number of diagrams. Below is a list of the most inspiring items I think.

Test and training error

Why low training error is not always a good thing: the above picture uses the model complexity as a variable test and training error function.

Under and overfitting

Examples of low-degree fitting or over-fitting. The polynomial curve in the above figure has a variety of commands M, represented by a red curve, which is generated by adapting the green curve to the data set.

Occam’s razor

The above picture shows why Bayesian reasoning can be embodied in the Occam razor principle. This picture gives a basic and intuitive explanation of why complex models turned out to be small probability events. The horizontal axis represents the possible data set D space. Bayes’ theorem feeds the model proportionally to the extent to which their predicted data appears. These predictions are quantified by the normalized probability distribution on data D. The probability of the data gives a model Hi, P(D|Hi) is called evidence supporting the Hi model. A simple model H1 can only achieve a finite prediction, shown as P(D|H1); a more powerful model H2, for example, can have more free parameters than model H1, and can predict more kinds of data set. This also shows that, in any case, H2’s prediction of datasets in the C1 domain is not as powerful as H1. Assuming that equal prior probabilities are assigned to the two models, then the data set falls in the C1 region, and the less powerful model H1 will be a more appropriate model.

Feature combinations

(1) It is irrelevant why collectively related features are viewed separately. This is also the reason why (2) linear methods may fail. Look at the slides extracted from the Isabelle Guyon feature.

Irrelevant features

Why insignificant features can damage KNN, clustering, and other methods of gathering with similar points. The left and right diagrams show that the two types of data are well separated on the vertical axis. The image on the right adds an irrelevant horizontal axis that breaks the grouping and makes many points a close neighbor of the opposite class.

Basis functions

The basic function of nonlinearity is how to transform the classification problem of a low-dimensional nonlinear boundary into a high-dimensional linear boundary problem. Andrew Moore’s Support Vector Machine (SVM) tutorial slide shows: A single-dimensional nonlinear problem with the input x is transformed into a 2-dimensional linearly separable z=(x, x²) problem.

Discriminative vs. Generative

Why discriminative learning is simpler than production: For example, the density of the classification conditions for the two methods above, there is a single input variable x (left), along with the corresponding posterior probability (right). Note that the pattern of the classification condition density p(x|C1) on the left side is indicated by a blue line in the left figure, and has no effect on the posterior probability. The vertical green line in the right image shows the decision boundary in x, which gives the smallest false positive rate.

Loss functions

The learning algorithm can be viewed as optimizing different loss functions: the above figure is applied to the “hinge” error function graph in the support vector machine, represented by blue lines, for logistic regression, with the error function being factor 1 / ln (2) Re-adjust, it is represented by a red line through the point (0, 1). Black lines indicate misclassification and mean square error is indicated by green lines.

The geometry of least squares

The above figure has an N-dimensional geometry with two predicted least squares regressions. The resulting vector y is orthogonally projected onto the hyperplane spanned by the input vectors x1 and x2. The projection y^ represents the vector of the least-squares prediction.

Sparsity

Why does the Lasso algorithm (L1 normalization or Laplacian prior) give a sparse solution (eg: a weighted vector with more zeros): the estimated image of the lasso algorithm above (left) and the estimate of the ridge regression algorithm Image (right). Shows the wrong contours and constraint functions. Separately, when the red ellipse is the contour of the least-squares error function, the solid blue region is the constrained region |β1| + |β2| ≤ t and β12 + β22 ≤ t2.