The four key parts in Machine learning

Qiang Chen

Published in

Machine Learning and Math

6 min readAug 26, 2018

Foreword

I have already published three stories about how to solve the problem in life using machine learning technology.

Scalars, Vectors, and Matrice point out that everything in life can be represented by numbers. A well-defined machine learning problem includes the 1) input part: a matrix X of shape m×k which contains m records, and each record is represented as k scalars, we also need meaning of the k dimensions. 2) target part: a matrix Y of shape m×1, which is our regression target.
Regression, Mapping, Matrices Multiplication discuss the scalar prediction problem which is also named regression problem. This story indicates that matrices multiplication can be used to map the input to our target.
Classification, Sigmoid function analysis the class prediction problem, we name it classification as well. This story talked about how to represent our class by the scalar and why and how do we transfer a number from (－∞, ＋∞) to another number from (0, 1), then solve the problem by matrices multiplication.

Based on the three stories, this story will summarize how to analyze machine learning problem in a general way which is suitable for various machine learning problem and machine learning algorithm. In machine learning problem, we always have four key parts.

input representation: we need to get the representation of what is observed which will help us to predict our target. for example, we represent the house information in order to predict the house prices. the email content needs to be represented in scalars, vector or matrices so that it is easy to recognize the email content as spam or not.
output representation: the prediction target need to be numeric so that it can be deal in the computer system. if there are only two classes, 0 can represent one class, 1 can represent another class.
the mapping from input to output: the simplest mapping is the matrice multiplication with a matrix of shape k×1, which can map any k dimension to 1 dimension.
the mapping performance criterion: in order to pick up the best matrix of shape k×1, a criterion is needed. The simplest criterion is to compare the differences between the output of our mapping and the real target. if the difference is smaller the mapping is better.

Input representation

general field: vector representation ( each dimension has his own meaning, review our representation about email content )
Computer Vision: The image can be represented with the RGB format in the most cases. The RGB format means that a matrix of shape 3 × width × height is used to present an image in which width means the pixel number in the width direction, height means the pixel number in height direction. If the depth information is accessible we can add one more dimension, the shape will be 4× width × height.
Natural Language Process: An article or sentence contains many words, in order to present the article or sentence, a limited word list of N words can be constructed. The article or sentence can be presented by a vector of N dimension, the index i is for the times the i index word appears in the article and sentence. Since the development of deep learning, word2vec technique is always used to generate the vector for words, and the representation of the article and sentence can be generated based on the vector or words.

We do have multiple inputs for some machine learning problems, I introduce some basic idea of input representation, you can find or even invent more efficient representation in the future.

Output representation

Regression problem: the target can be represented using scalar, you might carefully choose the unit of the number which will help the machine to learn easily.
Binary classification problem: 0, 1 are always used to represent the two classes, -1 and 1 can be also used to do that.
Multi-classification problem: one hot encoding is a popular representation way, A N dimension vector is used for N classes. If it belongs to the class i, the index i’s value will be 1 and the others are 0.

As we have the more complex problem, we can have other representation methods, we can also combine these methods to describe our target.

The mapping from input to output

Linear regression: it maps the N dimension information to 1 dimension or more than 1 dimension which is often used to solve the regression problem.
Logistic regression: it is similar to Linear regression except that each value of the output is in range (0, 1)
Support vector machine: it maps the N dimension to 1 which is always to solve the binary classification problem.
Multilayer perceptron: it is similar to Linear regression and has the higher capability to simulate non-linear mapping process. Different Activation function can be used to change the output range.
Convolutional neural network: it is designed for the image input, which input is matrix and output can be 1 dimension or more than 1 dimension.

We list some basic mapping methods. The researchers in the academia are developing and studying novel method to solve the machine problem better. The engineers are trying to use machine learning to improve their recommendation result, search function, they are testing different methods to choose the best one for their business.

The mapping performance criterion

Binary Cross Entropy Criterion: This is for binary classes problems and the output is two dimensions, y’ = [y₀’, y₁’] where y₀’, y₁’ ∈ (0, 1) and y₀’+y₁’ = 1. cost(y’, y) = y₀×ln(y₀’) +y₁×ln(y₁’)
Margin Criterion: This is also for binary classes but the target output is scalar 1 or -1. cost(y’, y)=max(0, margin-y×y’) where the margin is configurable, it always set to 1 and y’ is the prediction result, y is the target result.
Cross-Entropy Criterion: This is for the multi-classification problem, the output is N dimension vector

4. Abs Criterion: This is suitable for regression and the output is N dimension

I list some basic criterion, hope you can find and even invent more useful criterion to measure the model performance and help to optimize the model better.

Summary

For some competitions in machine learning competition platforms, such as the Kaggle which is a popular machine learning competition platforms, the competitions already define the input representation, the output representation and the criterion for the given prediction result which produced by the crafted model. The participants only need to focus on the mapping part, build the best model which can mapping the input to output. The one who owns the best criterion result wins. In academia, many researchers are also working on inventing more powerful model.

And experienced machine learning engineer customized the four parts for a given machine learning related problem to solve the problem with the best performance.

For the most machine learning problem, as long as you define the four key part: input representation, output representation, the mapping from input to output and the mapping performance criterion, the problem is kind of solved. The gradient descent optimization can be used to get the best model. Many other optimization methods can speed up the optimization process and most of them are based on gradient descent optimization. The premise of gradient descent is that the weight in mapping and the criterion is derivable.