Akira’s Machine Learning news — #issue 31

Akihiro FUJII
Analytics Vidhya
Published in
6 min readOct 11, 2021

--

Featured Paper/News in This Week.

  • A published study shows a sudden improvement in generalization performance from random results: overfitting starts at about 10² steps, and a sudden improvement in generalization performance from random prediction is reported at about 10⁶ steps. Thus, weighted decay seems to be the key to generalization. On the other hand, Yannic Kilcher proposed a hypothesis that “weight decay may enable the models to draw a smooth line for generalization while suppressing abrupt changes,” which I thought was very interesting.
  • Researchers proposed that the results of transformer models in image systems, such as ViT, may be due to patching rather than transformers. It is an outstanding achievement to get 96% accuracy with full scratch training of CIFAR10 in the ViT system, which requires a large amount of data.

— — — — — — — — — — — — — — — — — — –

In the following sections, I will introduce various articles and papers not only on the above contents but also on the following five topics.

  1. Featured Paper/News in This Week
  2. Machine Learning Use Case
  3. Papers
  4. Articles related to machine learning technology
  5. Other Topics

— — — — — — — — — — — — — — — — — — –

1. Featured Paper/News in This Week

After some time, the neural net suddenly generalizes.mathai-iclr.github.io

[GROKKING: GENERALIZATION BEYOND OVERFITTING ON SMALL ALGORITHMIC DATASETS]
They found that the smaller the data set, the longer it takes to optimize the neural net. While overfitting occurs in about 10² steps, generalization to a valid set requires about 10⁵ steps, which leads to a sudden increase in accuracy from random results. It was essential to use weight decay for this generalization.

Is patching more critical than the transformer?openreview.net

[Patches Are All You Need? | OpenReview]
This is a study of a Transformer Encoder-like mechanism using Conv, which can be implemented with six lines of PyTorch and is more efficient than ViT or MLP-Mixer and can achieve 96% accuracy even on small datasets such as CIFAR. From this result, the authors suggest that patching the image was more important than the transformer itself.

— — — — — — — — — — — — — — — — — — –

2. Machine Learning Use case

China will continue to be the “world’s factory” with AI technologykaifulee.medium.com

This is an article on China’s application of AI technology. China is said to be the world’s factory, which will remain true in 2020, stating that it is using AI technology to innovate in manufacturing and other areas as labor costs rise due to a slowing population.

— — — — — — — — — — — — — — — — — — –

3. Papers

Finding candidates for gravitational lensing with self-supervised learningarxiv.org

[2110.00023] Mining for strong gravitational lenses with self-supervised learning
Research on finding image candidates for gravitational lensing by self-supervised learning. First, they use a pre-trained model with self-supervised learning to find candidates by similarity in known images. After that, they build a classification model using linear regression and other methods. They stated that this could significantly lower the barrier to entry when dealing with survey data and open up many avenues for collaboration.

Transformer optimized with evolutionary algorithmarxiv.org

[2109.08668] Primer: Searching for Efficient Transformers for Language Modeling
This is a study of the NAS of the Transformer for language models in the evolutionary algorithm. As a result of the search, they found MDHA, which convolves information between heads, and Squared ReLU, which squares ReLU, and Primer equipped with them can reduce training time by 1/3 to 1/4.

Re-evaluating ResNet and re-establishing the baseline for the learning procedurearxiv.org

[2110.00476] ResNet strikes back: An improved training procedure in timm
This is a study in which ResNet was re-evaluated using the latest regularization and data augmentations. As a result, Top-1 Acc was improved from 75.3% to 80.4%. In addition, ResNet was scored differently by different papers, but their learning procedure was made public through timm, and a new baseline was shared.

ViT can learn semantic domain segmentation information by self-supervised learningarxiv.org

[2104.14294] Emerging Properties in Self-Supervised Vision Transformers
This is a study of self-supervised learning of ViTs. They propose DINO, which performs self-supervised learning by a distillation-like mechanism, to train ViTs so that their distributions are consistent across multiple cropped images.

Achieve high performance with less or more data.arxiv.org

[2106.04803] CoAtNet: Marrying Convolution and Attention for All Data Sizes
This is research on combining Transformer and CNN. First, SelfAttention with Relative Positional Encoding, then choose CNN or Transformer layer at the stage level, and build up the stages. Finally, use ImageNet to gain SotA performance and achieve high performance with less or more data.

— — — — — — — — — — — — — — — — — — –

4. Articles related to machine learning technology

Background and Foreground Separationai.googleblog.com

This blog by Google on “Omnimatte: Associating Objects and Their Effects in Video” (CVPR2021). It states that separating the less correlated parts is possible by letting CNNs learn things like the correlation between people and shadows and it can separate background and foreground .

Differences between ViT’s and CNNsyncedreview.com

A commentary article on [Do Vision Transformers See Like Convolutional Neural Networks?] discusses the differences between ViT and CNNs. It states that ViT’s skip connections for representation propagation are more influential than ResNet one and may substantially impact performance and representation similarity.

— — — — — — — — — — — — — — — — — — –

5. Other Topics

Tensorflow Similarityblog.tensorflow.org

[Introducing TensorFlow Similarity — The TensorFlow Blog]
An introduction to Tensorflow Similarity, which can search nearest neighbor data and can be implemented in 20 lines of code.

— — — — — — — — — — — — — — — — — — –

Other blogs

About Me

Manufacturing Engineer/Machine Learning Engineer/Data Scientist / Master of Science in Physics / http://github.com/AkiraTOSEI/

Twitter, I post one-sentence paper commentary.

--

--