ML04: From ML to DL to NLP
A concise concept map
Read time: 20 minThis article is a part my mid-term report of the course PyTorch and Machine Learning in NCCU. Original report: https://bit.ly/2UZftXq
This article is like a concise concept map from ML to ANN to NLP. I wouldn’t pay much attention on the complicated math behind ML, DL and NLP. Instead, I try to just run through all the concepts and leave the details to the readers.
Outline
(1) Machine Learning Basics
1–1 Supervised learning
1–2 Unsupervised learning
1–3 Reinforcement learning
1–4 Model Evaluation
1–5 Data splitting & cross-validation
1–6 Data preprocessing and feature engineering
1–7 Overfitting and underfitting
1–8 Workflow of a ML project
(2) Neural Network Basics
2-1 Visualizations of NN
2–2 Activation functions
2–3 Loss functions
2–4 Optimizers
2–5 Batch learning
2–6 Batch normalization
2–7 Dropout
2–8 Hyper-parameter
2–9 Data splits & cross-validation
(3) Neural Network Models
3–1 Perceptron
3–2 FNN
3–3 MLP
3–4 CNN
(4) Neural Network in NLP
4-1 Data Pre-processing
4–2 BOW approaches
4–3 CNN
4–4 RNN
4–5 LSTM
(5) References
(1) Machine Learning Basics
1–1 Supervised learning
Regression problems
Classifications problems
Image segmentation
Speech segmentation
Language segmentation
1–2 Unsupervised learning
Clustering
Dimensionality reduction (e.g. SVD, PCA)
1–3 Reinforcement learning
1–4 Model evaluation
For numeric targets, we have:
MSE
RMSE
MAPE
For categorical targets, we have:
Accuracy
Precision
Recall
F1-score
1–5 Data splitting & cross-validation
Three-way data splits: Splitting the datasets into three parts — training, validation and test datasets. It’s stricter than and have better performance than the two-way data splits—only splitting the datasets into training and test datasets. Three-way data splits is also called “splitting data machine learning validation”, this term strangely doesn’t have a unified name.
K-fold cross-validation: Preventing overfitting and making the models more stable.
1–6 Data preprocessing and feature engineering
Vectorization: A must-do process for data of formats like text, sound, image and video.
Handling missing values: Deleting or imputing them. I wrote a very detailed article on missing value imputation a month ago on my medium blog [1], concluding that:
1. In general, the complex ways of missing value imputation (random forest, Bayesian linear regression and so on) won’t perform worse than the simple ways like just imputing mean or median, contradicting to famous and popular some ML books.
2. Theoretically, random forest boasts better speed than kNN with similar accuracy, contradicting to famous and popular some ML books.
3. Bayesian linear regression (BayesianRidge in Python) and random forest model (ExtraTreesRegressor in Python) probably have the best performances in accuracy than other models.
1–7 Overfitting and underfitting
Getting more data
Reducing the size of the network (i.e. reducing the complexity of ML models)
Apply weight regularization
Dropout (only for ANN models, not suitable for SVM, RF and so forth)
Underfitting
1–8 Workflow of a ML project
Problem definition and dataset creation
Measures of success
Evaluation protocol
Data preparation
Baseline model
Large enough to overfit
Apply regularization
Learning rate picking strategies
(2) Neural network Basics
2–1 Visualizations of NN
As we can see, there are a few main concepts of NN — — weights, activation function (in a perceptron), loss function, optimizer, weight updates. So, let’s probe into these concepts.
2–2 Activation functions
Sigmoid
Tanh
ReLU
PReLU (leaky ReLU is a kind of PReLU): Eliminate the “dying ReLU” in ReLU.
Softmax: Useful for classification.
2–3 Loss functions
L1 loss
MSE loss
Cross-entropy loss: for classification
NLL loss
NLL loss2d
2–4 Optimizers
SGD: Stochastic gradient descent
Momentum
AdaGrad
RMSprop (= AdaGrad + Momentum)
Adam (= Advanced RMSprop)
In general, Adam > AdaGrad > Momentum > SGD (> represents “better than”), but in the preceding MNIST case, AdaGrad > Adam > Momentum > SGD. For most of the use cases, an Adam or RMSprop optimization algorithm works better.
2–5 Batch learning
Mini-batch: Close to the concept of bootstrap.
2–6 Batch normalization
Normalization is an essential procedure for NN.
2–7 Dropout
Significant for avoiding overfitting.
2–8 Hyper-parameter
Tuning parameters like:
Amount of perceptron of each layer
Batch size
Learning rate
Weight decay
2–9 Data splitting & cross-validation
Better to adopt three-way data splits & k-fold cross-validation.
(3) Neural Network Models
3–1 Perceptron
Neuron is a minimum unit of neural network. A perceptron is a single-layer neural network.
3–2 FNN
Feedforward neural network (FNN), an artificial neural network wherein connections between the nodes do not form a cycle.
3–3 MLP
A multilayer perceptron (MLP) is a class of feedforward ANN.
3–4 CNN
CNN, convolutional neural network, is a kind of FNN. Fully connected layer (or linear layer) is too complex and loses all spatial information, whereas CNN avoid the preceding issues and leverage convolution layers and pooling layers to yield outstanding real-world outcomes in computer vision. [2]
CNN has two major merits in computer visions: [8]
Translation invariant
Spatial hierarchies of patterns
Popular CNN’s network architecture: [7][9]
LeNet
AlexNet
ResNet
GoogLeNet
VGGNet
ImageNet
Moreover, others major concepts of CNN: [6]
Conv2d (Conv2D)
Pooling (MaxPooling2D)
Nonlinear activator — ReLU
Transfer learning
Pre-convoluted features
For more elaboration on CNN, check this:
(4) Neural Network in NLP
NLP (natural language processing) had developed before ANN (artificial neural network) was feasible, though not until ANN was added into NLP did it prosper. The classic NLP book “Natural Language Processing with Python” [11] , published in 2009, only elaborate the statistical language modeling without mentioning any ANN methods.
4–1 Data Pre-processing
Converting text into matrix before going into NN:
Use contraction dictionary
Tokenization
Deleting stopwords
Stemming
4–2 BOW approaches
Then, we could treat the text as Bag-of-words (BOW) and do vectorization, either one-hot encoding or word embedding.
One-hot encoding: A traditional NLP approach usually used with TF-IDF. Data is too sparse here, facing the curse of dimensionality problem, and hence it’s rarely used with deep learning. Also, it often comes with n-gram model.
Word embedding: Converting the data into dense matrix. Word2vec is a popular measure.
However, the BOW approaches lose the sequential nature of text. So, then we turn to RNN to make good use of the sequential nature of text. [2]
4–3 CNN
CNNs solves problems in computer vision by learning features from images. In images, CNNs works by convolving across height and width. In the same way, time can be treated as a convolutional feature. 1-D CNNs sometimes perform better than RNNs and are computationally cheaper. Another usage of CNN in NLP is text classification. [2]
4–4 RNN
Recurrent neural network (RNN), which is not FNN, aims to address sequential data. RNN can solve problems like natural language understanding, document classification, sentiment classification. RNN uses backpropagation through time (BPTT) instead of backpropagation (BP). [11]
The simple version of RNN, in practice, finds it difficult to remember the contexts that happened in the earlier parts of sequence. LSTMs and other different variants of RNN solve this problem by adding different neural networks inside the LSTM which later decides how much, or what data to remember. [2]
4–5 LSTM
Long short term memory networks (LSTM) is a kind of RNN, capable of learning long-term dependency. The simple RNN has problems like vanishing gradients and gradient explosion when addressing large sequence. LSTMs are designed to avoid long-term dependency problems by having a design by which is natural to remember information for a long period of time. [2]
LSTM has 5 parts—cell state, hidden state, input gate, forget gate, output gate. [12]
(5) References
[1] Kuo, M. (2020)。ML02: 初探遺失值(missing value)處理。取自 https://merscliche.medium.com/ml02-na-f2072615158e
[2] Subramanian, V. (2018). Deep Learning with PyTorch. Birmingham, UK: Packt Publishing.
[3] Bre, F. et al. (2020). An efficient metamodel-based method to carry out multi-objective building performance optimizations. Energy and Buildings, 206, (unknown).
[4] Guo, H. (2017). How do I implement the PReLU on Tensorflow?. Retrieved from https://www.quora.com/How-do-I-implement-the-PReLU-on-Tensorflow
[5] Endicott, S. (2017). Game Applications of Deep Neural Networks. Retrieved from https://bit.ly/2G8nUIQ
[6] Taposh Dutta-Roy (2017). Medical Image Analysis with Deep Learning — II. Retrieved from
https://medium.com/@taposhdr/medical-image-analysis-with-deep-learning-ii-166532e964e6
[7] 斎藤康毅 (2016). ゼロから作るDeep Learning ―Pythonで学ぶディープラーニングの理論と実装 (中譯:Deep Learning:用Python進行深度學習的基礎理論實作). Japan, JP: O’Reilly Japan.
[8] Chollet, F. (2018). Deep learning with Python. New York, NY: Manning Publications.
[9] 邢夢來等人 (2018)。PyTorch 深度學習與自然語言處理。新北市,台灣:博碩文化。
[10] Bird, S. et al. (2009). Natural Language Processing with Python. California, CA: O’Reilly Media.
[11] Rao, D., & McMahan, B. (2019). Natural Language Processing with PyTorch. California, CA: O’Reilly Media.
[12] Ganegedara, T. (2018). Natural Language Processing with TensorFlow. Birmingham, UK: Packt Publishing.