ADVANCEMENTS IN MACHINE LEARNING in 2021
Top 4 Important Machine Learning and Deep Learning Papers You Should Read in 2021
These papers help us to keep up to date with the latest advancements in the world of AI.
Machine Learning suddenly became one of the most critical domains of Computer Science and just about anything related to Artificial Intelligence. Every company is applying Machine Learning and developing products that take advantage of this domain to solve their problems more efficiently.
“Key research papers in natural language processing, conversational AI, computer vision, reinforcement learning, and AI ethics are published yearly”
The area of Machine Learning has far-reaching applications from the text, audio, image, video to supervised and reinforcement learning. In this article, I have listed three novel Machine Learning articles that made a breakthrough in this field.
Single Headed Attention RNN: Stop Thinking With Your Head
In this paper, the Harvard grad Steven Merity introduces a state-of-the-art NLP model called as Single Headed Attention RNN or SHA-RNN. Stephen Merity, an independent researcher that is primarily focused on Machine Learning, NLP and Deep Learning. The author demonstrates by taking a simple LSTM model with SHA to achieve a state-of-the-art byte-level language model results on enwik8.
The author’s primary goal is to show that the entire field might have evolved in a different direction if we had instead been obsessed with a slightly different acronym and somewhat different results.
For full details, see the paper Single Headed Attention RNN: Stop Thinking With Your Head. In summary, “stop thinking…
The central concept of the model architecture proposed by Steven consists of a LSTM architecture with a SHA based network with three variables (Q, K and V).
Each SHA-RNN layer contains only a single head of attention that helps with keeping the memory consumption of the model to the minimum by eliminating the need to update and maintain multiple matrices.
The Boom layer is related strongly to the large feed-forward layer found in Transformers and other architectures. This block reduces and removes an entire matrix of parameters compared to traditional down-projection layers by using Gaussian Error Linear Unit (GeLu) multiplication to break down the input to minimize computations.
Let’s look at the actual comparison below. In 2016, The Surprisal-Driven Zoneout, a regularization method for RNN, achieved an outstanding compression score of 1.313bpc on the Hutter Prize dataset, enwiki8 which is a one-hundred-megabyte file of Wikipedia pages.
The SHA-RNN managed to achieve even lower (bpc) compared to the model in 2016. That is impressive. Bits per character is a model proposed by Alex Graves to approximate the probability distribution of the next character given past characters.
Further on, the Single Headed Attention RNN (SHA-RNN) managed to achieve strong state-of-the-art results with next to no hyper-parameter tuning and by using a single Titan V GPU workstation. And also, his work has undergone no intensive hyper-parameter tuning and lived entirely on a commodity desktop machine that made the author’s small studio apartment a bit too warm to his liking. Now that’s the passion for Machine Learning.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
In this paper, the authors systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. A new scaling method that uniformly scales all dimensions of depth, width and resolution using a simple yet highly effective compound coefficient is demonstrated in this paper.
A PyTorch implementation of “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”. …
The papers propose a simple yet effective compound scaling method described below:
A network that goes through dimensional scaling (width, depth or resolution) improves accuracy. But the caveat is that the model accuracy drops with larger models. Hence, it is critical to balance all three dimensions of a network (width, depth, and resolution) during CNN scaling for getting improved accuracy and efficiency.
The compound scaling method as above consistently improves model accuracy and efficiency for scaling up existing models such as MobileNet (+1.4% Image Net accuracy), and ResNet (+0.7%), compared to conventional scaling methods
Scaling doesn’t change the layer operations; instead, they obtained their base network by doing a Neural Architecture Search (NAS) that optimizes for both accuracy and FLOPS. The scaled EfficientNet models consistently reduce parameters and FLOPS by an order of magnitude (up to 8.4x parameter reduction and up to 16x FLOPS reduction) than existing ConvNets such as ResNet-50 and DenseNet-169.
EfficientNets also achieved state-of-the-art accuracy in 5 out of the eight datasets, such as CIFAR-100 (91.7%) and Flowers (98.8%), with an order of magnitude fewer parameters (up to 21x parameter reduction), suggesting that the EfficientNets also transfers well.
GPT - 3
Generative Pre Training, or more commonly known as ‘GPT-3’ is an autoregressive language model developed by OpenAI with 175billion parameters. In the paper linked above, the Engineers and developers at OpenAI used the same model and architecture as the previous state-of-the-art language model, GPT-2, that includes modified initialisation, pre-normalisation, and reversible tokenisation along with alternating dense and locally banded sparse attention patterns in the layers of the transformer.
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of…
GPT-3 can produce identical and continuous words in a sentence, making it seem like a human writes them. These days, language models can perform various functions, but perhaps the most popular is the generation of novel text. This is where GPT-3 comes as a breakthrough.
Generative Pretraining from Pixels or in short Image GPT extends the capabilities of GPT to the use of images. Also developed by OpenAI, the brief introduction to the paper is said as below:
We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples. By establishing a correlation between sample quality and image classification accuracy, we show that our best generative model also contains features competitive with top convolutional nets in the unsupervised setting.
In the domain of language, unsupervised learning algorithms that rely on word prediction (like GPT-2 and BERT) have been extremely successful, and the results have been proven. One of the reasons why the text is successful is that language is often structured to follow questions.
In contrast, sequences of pixels do not contain labels for the images they belong to. However, large models such as the GPT-2 can also be utilised to detect image classes by performing next-pixel prediction that is only possible using large transformers.
This is where unsupervised learning comes into play. Generative sequence modelling is a universal unsupervised learning algorithm that allows a transformer to be directly applied to any data type without additional engineering.
With the help to unsupervised learning, Image GPT is capable of learning powerful image features. Image GPT is similar to GPT-2, where it is made up of a transformer decoder block. The transformer decoder takes an input sequence of discrete tokens and outputs a d-dimensional embedding for each position. When it comes to evaluation, the results are promising. On the CIFAR-10 dataset, iGPT-L achieves 99.0% accuracy, and on CIFAR-100, it reaches 88.5% accuracy after fine-tuning.
In the end
More and more papers will be published as the Machine Learning community grows every year. Our part is to read up on the new and reasonable articles to equip ourselves with the latest and state-of-the-art breakthrough in the community. Keep reading fellow enthusiast!
If I have managed to retain your attention to this point, please leave a comment if you have any advice for this series as it would significantly increase my knowledge and improve my way of writing. Prem Kumar is a selfless learner that is passionate about the everyday data that revolves us. Please connect with me on LinkedIn mentioning this story if you want to speak about this and the future developments that await.