Classifying Text Data into Multiple Categories
Single-Label Multi-class Classification
Overview
Basics:
- Goal
- Input & output
- Training
- Regularization
- Predicting
Additional experiments:
- Deeper layers
- Wider layers
- Narrower layers
Part I: The Basics
Goal
Our goal is to classify a text into 46 different classes. Since it’s not a binary classification, we will use a last layer of densely connected 46 noder with a softmax
activation function.
Input & Output
Our input is a text stored in a string. We will encode it into a one-hot vector. Our output, will be be vector of 46 values, where each value represents a probability. Naturally they all add up to 1 and the highest probability is the guess.
We will import the Reuters news-wire data set that’s part of the keras
datasets.
Encoding & Decoding
Encoding the training data, in other words, our input string into array of numbers. These numbers signify the rank of the word, if they were all ranked by the frequency of occurrence in the entire data set.
We’re also gonna encode our training labels into one-hot vectors. While the keras data set doesn’t provide any label strings. A very similar data set of Reuters newswires has labels such as:
wheat
corn
coffee
nat-gas
etc...
Decoding the result is simple, the highest probability value in the 46 vector is the network’s best guess.
Architecture
We will use a 3 layer deep neural network. That’s densely connected. There’s no skipped connections. The last layer, output layer, has 46 nodes. Which brings up an important point about neural network architectures. That generally, the hidden layers leading up to the output layer often need more nodes than the output layer.
Our optimizer is rmsprop
. Our loss function is categorical_crossentropy
, which means our neural network is always trying to minimize the cross entropy between the actual label data and the network’s current best guess.
Regularization
Our primary method of regularization so far, is early stopping. We do that by plotting the accuracy and loss on the training and validation sets and just looking where the validation has the lowest loss.
The Code
Part II: Additional Experiments
Our baseline architecture used 3 layer with 64, 64, 46 elements and got 82.90% accuracy on the validation set. Let’s see if more layers improve this accuracy.
Deeper Layers
While keeping everything else constant, we only vary the number of hidden layers:
- At 5 layers, we get a best validation accuracy of 80.70%.
- At 4 layers, we get a best validation accuracy of 80.70%.
- At 3 layers, we got a best validation accuracy of 82.90%.
- At 2 layers, we get a best validation accuracy of 82.60%. One thing to note here is that we reach this peak accuracy much earlier on and with each iteration there’s less over-fitting than the deeper layers.
Wider or Narrower Layers
While keeping the deepness at 3, we vary the number of nodes in the hidden layers:
- At 92 elements, we get peak accuracy of 81.90%.
- At 70 elements, we get peak accuracy of 82.10%.
- At 65 elements, we get peak accuracy of 82.20%.
- At 64 elements, we got peak accuracy of 82.90%.
- At 63 elements, we get peak accuracy of 81.80%.
- At 58 elements, we get peak accuracy of 81.70%.
- At 30 elements, we get peak accuracy of 81.90%.
Other Articles
This post is part of a series of stories that explores the fundamentals of deep learning:1. Linear Algebra Data Structures and Operations
Objects and Operations2. Computationally Efficient Matrices and Matrix Decompositions
Inverses, Linear Dependence, Eigen-decompositions, SVD3. Probability Theory Ideas and Concepts
Definitions, Expectation, Variance4. Useful Probability Distributions and Structured Probabilistic Models
Activation Functions, Measure and Information Theory5. Numerical Method Considerations for Machine Learning
Overflow, Underflow, Gradients and Gradient Based Optimizations6. Gradient Based Optimizations
Taylor Series, Constrained Optimization, Linear Least Squares7. Machine Learning Background Necessary for Deep Learning I
Generalization, MLE, Kullback-Leibler Divergence8. Machine Learning Background Necessary for Deep Learning II
Regularization, Capacity, Parameters, Hyper-parameters9. Principal Component Analysis Breakdown
Motivation, Derivation10. Feed-forward Neural Networks
Layers, definitions, Kernel Trick11. Gradient Based Optimizations Under The Deep Learning Lens
Stochastic Gradient Descent, Cost Function, Maximum Likelihood12. Output Units For Deep Learning
Stochastic Gradient Descent, Cost Function, Maximum Likelihood13. Hidden Units For Deep Learning
Activation Functions, Performance, Architecture14. The Common Approach to Binary Classification
The most generic way to setup your deep learning models to categorize movie reviews15. General Architectural Design Considerations for Neural Networks
Universal Approximation Theorem, Depth, Connections16. Classifying Text Data into Multiple Classes
Single-Label Multi-class Classification
Up Next…
Coming up next is the comparison of S4TF and NumPy for neural networks. If you would like me to write another article explaining a topic in-depth, please leave a comment.
For the table of contents and more content click here.
References
Adams, R. A. (2017). Calculus. Prentice-Hall.
François, C. (2018). Deep Learning with Python and Keras. MITP-Verlags GmbH & Co. KG.
Goodfellow, I. (2017). Deep Learning. MIT Press.
Nicholson, K. (2009). Linear Algebra with Applications.
Sutton, R. S. (2018). Reinforcement Learning. A Bradford Book.
Wackerly, D. D. (2007). Mathematical Statistics with Applications. Belmont, CA: Nelson Education.
(n.d.). A First Course In Linear Algebra — Open Textbook Library. Retrieved February 24, 2020, from https://open.umn.edu/opentextbooks/textbooks/a-first-course-in-linear-algebra-2017