ML04: From ML to DL to NLP

A concise concept map

Yu-Cheng (Morton) Kuo
Analytics Vidhya
8 min readOct 29, 2020

--

This article is like a concise concept map from ML to ANN to NLP. I wouldn’t pay much attention on the complicated math behind ML, DL and NLP. Instead, I try to just run through all the concepts and leave the details to the readers.

Outline
(1)
Machine Learning Basics
1–1 Supervised learning
1–2 Unsupervised learning
1–3 Reinforcement learning
1–4 Model Evaluation
1–5 Data splitting & cross-validation
1–6 Data preprocessing and feature engineering
1–7 Overfitting and underfitting
1–8 Workflow of a ML project
(2) Neural Network Basics
2-1 Visualizations of NN
2–2 Activation functions
2–3 Loss functions
2–4 Optimizers
2–5 Batch learning
2–6 Batch normalization
2–7 Dropout
2–8 Hyper-parameter
2–9 Data splits & cross-validation
(3) Neural Network Models
3–1 Perceptron
3–2 FNN
3–3 MLP
3–4 CNN
(4) Neural Network in NLP
4-1 Data Pre-processing
4–2 BOW approaches
4–3 CNN
4–4 RNN
4–5 LSTM
(5) References

(1) Machine Learning Basics

1–1 Supervised learning

— Regression problems
— Classifications problems
— Image segmentation
— Speech segmentation
— Language segmentation

1–2 Unsupervised learning

— Clustering
— Dimensionality reduction (e.g. SVD, PCA)

1–3 Reinforcement learning

1–4 Model evaluation

For numeric targets, we have:
— MSE
— RMSE
— MAPE
For categorical targets, we have:
— Accuracy
— Precision
— Recall
— F1-score

1–5 Data splitting & cross-validation

— Three-way data splits: Splitting the datasets into three parts — training, validation and test datasets. It’s stricter than and have better performance than the two-way data splits—only splitting the datasets into training and test datasets. Three-way data splits is also called “splitting data machine learning validation”, this term strangely doesn’t have a unified name.

— K-fold cross-validation: Preventing overfitting and making the models more stable.

Figure 1: Three-way data splits [2]
Figure 2: 4-fold cross validation & three-way data splits [2]

1–6 Data preprocessing and feature engineering

— Vectorization: A must-do process for data of formats like text, sound, image and video.
— Handling missing values: Deleting or imputing them. I wrote a very detailed article on missing value imputation a month ago on my medium blog [1], concluding that:

1. In general, the complex ways of missing value imputation (random forest, Bayesian linear regression and so on) won’t perform worse than the simple ways like just imputing mean or median, contradicting to famous and popular some ML books.
2. Theoretically, random forest boasts better speed than kNN with similar accuracy, contradicting to famous and popular some ML books.
3. Bayesian linear regression (BayesianRidge in Python) and random forest model (ExtraTreesRegressor in Python) probably have the best performances in accuracy than other models.

1–7 Overfitting and underfitting

— Getting more data
— Reducing the size of the network (i.e. reducing the complexity of ML models)
— Apply weight regularization
— Dropout (only for ANN models, not suitable for SVM, RF and so forth)
— Underfitting

1–8 Workflow of a ML project

— Problem definition and dataset creation
— Measures of success
— Evaluation protocol
— Data preparation
— Baseline model
— Large enough to overfit
— Apply regularization
— Learning rate picking strategies

(2) Neural network Basics

2–1 Visualizations of NN

Figure 3: Visualization of a perceptron [3]
Figure 4: Visualization of a neural network [3]
Figure 5: Low-level operations and DL algorithm [2]

As we can see, there are a few main concepts of NN — — weights, activation function (in a perceptron), loss function, optimizer, weight updates. So, let’s probe into these concepts.

2–2 Activation functions

— Sigmoid
— Tanh
— ReLU
— PReLU (leaky ReLU is a kind of PReLU): Eliminate the “dying ReLU” in ReLU.
— Softmax: Useful for classification.

Figure 6: Relation between PReLU & leaky ReLU [4]
Figure 7: Plots of common activation functions [5]
Figure 8: Saturated & non-saturated activator [6]

2–3 Loss functions

— L1 loss
— MSE loss
— Cross-entropy loss: for classification
— NLL loss
— NLL loss2d

2–4 Optimizers

— SGD: Stochastic gradient descent
— Momentum
— AdaGrad
— RMSprop (= AdaGrad + Momentum)
— Adam (= Advanced RMSprop)

Figure 9: Optimizers comparison — SGD, Momentum, AdaGrad, Adam [7]
Figure 10: Optimizers comparison on MNIST: SGD, Momentum, AdaGrad, Adam [7]

In general, Adam > AdaGrad > Momentum > SGD (> represents “better than”), but in the preceding MNIST case, AdaGrad > Adam > Momentum > SGD. For most of the use cases, an Adam or RMSprop optimization algorithm works better.

2–5 Batch learning

— Mini-batch: Close to the concept of bootstrap.

2–6 Batch normalization

Normalization is an essential procedure for NN.

2–7 Dropout

Significant for avoiding overfitting.

2–8 Hyper-parameter

Tuning parameters like:
— Amount of perceptron of each layer
— Batch size
— Learning rate
— Weight decay

2–9 Data splitting & cross-validation

Better to adopt three-way data splits & k-fold cross-validation.

(3) Neural Network Models

3–1 Perceptron

Neuron is a minimum unit of neural network. A perceptron is a single-layer neural network.

3–2 FNN

Feedforward neural network (FNN), an artificial neural network wherein connections between the nodes do not form a cycle.

3–3 MLP

A multilayer perceptron (MLP) is a class of feedforward ANN.

3–4 CNN

CNN, convolutional neural network, is a kind of FNN. Fully connected layer (or linear layer) is too complex and loses all spatial information, whereas CNN avoid the preceding issues and leverage convolution layers and pooling layers to yield outstanding real-world outcomes in computer vision. [2]

CNN has two major merits in computer visions: [8]
— Translation invariant
— Spatial hierarchies of patterns

Popular CNN’s network architecture: [7][9]
— LeNet
— AlexNet
— ResNet
— GoogLeNet
— VGGNet
— ImageNet

Moreover, others major concepts of CNN: [6]
— Conv2d (Conv2D)
— Pooling (MaxPooling2D)
— Nonlinear activator — ReLU
— Transfer learning
— Pre-convoluted features

Figure 11: Fully connected layer [2]
Figure 12: A simplified version of CNN [2]
Figure 13: How convolution works [8]

For more elaboration on CNN, check this:

(4) Neural Network in NLP

NLP (natural language processing) had developed before ANN (artificial neural network) was feasible, though not until ANN was added into NLP did it prosper. The classic NLP book “Natural Language Processing with Python” [11] , published in 2009, only elaborate the statistical language modeling without mentioning any ANN methods.

4–1 Data Pre-processing

Converting text into matrix before going into NN:
— Use contraction dictionary
— Tokenization
— Deleting stopwords
— Stemming

4–2 BOW approaches

Then, we could treat the text as Bag-of-words (BOW) and do vectorization, either one-hot encoding or word embedding.

One-hot encoding: A traditional NLP approach usually used with TF-IDF. Data is too sparse here, facing the curse of dimensionality problem, and hence it’s rarely used with deep learning. Also, it often comes with n-gram model.
Word embedding: Converting the data into dense matrix. Word2vec is a popular measure.

However, the BOW approaches lose the sequential nature of text. So, then we turn to RNN to make good use of the sequential nature of text. [2]

4–3 CNN

CNNs solves problems in computer vision by learning features from images. In images, CNNs works by convolving across height and width. In the same way, time can be treated as a convolutional feature. 1-D CNNs sometimes perform better than RNNs and are computationally cheaper. Another usage of CNN in NLP is text classification. [2]

4–4 RNN

Recurrent neural network (RNN), which is not FNN, aims to address sequential data. RNN can solve problems like natural language understanding, document classification, sentiment classification. RNN uses backpropagation through time (BPTT) instead of backpropagation (BP). [11]

Figure 14: A simple RNN [8]

The simple version of RNN, in practice, finds it difficult to remember the contexts that happened in the earlier parts of sequence. LSTMs and other different variants of RNN solve this problem by adding different neural networks inside the LSTM which later decides how much, or what data to remember. [2]

4–5 LSTM

Long short term memory networks (LSTM) is a kind of RNN, capable of learning long-term dependency. The simple RNN has problems like vanishing gradients and gradient explosion when addressing large sequence. LSTMs are designed to avoid long-term dependency problems by having a design by which is natural to remember information for a long period of time. [2]

LSTM has 5 parts—cell state, hidden state, input gate, forget gate, output gate. [12]

Figure 15: Anatomy of an LSTM [8]

(5) References

[1] Kuo, M. (2020)。ML02: 初探遺失值(missing value)處理。取自 https://merscliche.medium.com/ml02-na-f2072615158e
[2] Subramanian, V. (2018). Deep Learning with PyTorch. Birmingham, UK: Packt Publishing.
[3] Bre, F. et al. (2020). An efficient metamodel-based method to carry out multi-objective building performance optimizations. Energy and Buildings, 206, (unknown).
[4] Guo, H. (2017). How do I implement the PReLU on Tensorflow?. Retrieved from https://www.quora.com/How-do-I-implement-the-PReLU-on-Tensorflow
[5] Endicott, S. (2017). Game Applications of Deep Neural Networks. Retrieved from https://bit.ly/2G8nUIQ
[6] Taposh Dutta-Roy (2017). Medical Image Analysis with Deep Learning — II. Retrieved from
https://medium.com/@taposhdr/medical-image-analysis-with-deep-learning-ii-166532e964e6
[7] 斎藤康毅 (2016). ゼロから作るDeep Learning ―Pythonで学ぶディープラーニングの理論と実装 (中譯:Deep Learning:用Python進行深度學習的基礎理論實作). Japan, JP: O’Reilly Japan.
[8] Chollet, F. (2018). Deep learning with Python. New York, NY: Manning Publications.
[9] 邢夢來等人 (2018)。PyTorch 深度學習與自然語言處理。新北市,台灣:博碩文化。
[10] Bird, S. et al. (2009). Natural Language Processing with Python. California, CA: O’Reilly Media.
[11] Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch. California, CA: O’Reilly Media.
[12] Ganegedara, T. (2018). Natural Language Processing with TensorFlow. Birmingham, UK: Packt Publishing.

--

--

Yu-Cheng (Morton) Kuo
Analytics Vidhya

CS/DS blog with C/C++/Embedded Systems/Python. Embedded Software Engineer. Email: yc.kuo.28@gmail.com