16GeneralizationImproving the model’s generalization capacity is what we pursued for long time. There are two main ways of improvement:Aug 9, 2020Aug 9, 2020
16Do we really have to decrease the loss of training set to zero?Obviously not. Generally speaking, we are using training set to train the model, but the performance is evaluated on the validation. When…Aug 8, 2020Aug 8, 2020
16The multi-head attention based transformer models are quite dominant nowadays.Bottleneck of attentionAug 2, 2020Aug 2, 2020
16It’s never easy to explain a deep learning algorithm.The time and space complexity of transformer is O(n^2), which is, obviously not optimal. Lots of methods were aimed to improve this point…Jul 30, 2020Jul 30, 2020
16Self attention (BERT)Self attention meads K= V = Q. If the input is a sentence, then every word needs to be *attention* calculated with all the words in…Jul 28, 2020Jul 28, 2020
16BERT attention main ideas in 2 minsIt should be good enough to understand three main ideas of BERT at the beginning.Jul 25, 2020Jul 25, 2020