Essential for fine-tuning T5 v1.1 and mT5 models

Photo Credit

Originally published at https://blog.ceshine.net. Some of the mathematical expressions and explanations are removed in the Medium version. Please check the link for the complete article.

Motivation

The Adafactor optimizer, in my experience, can provide much better convergence than fine-tuning the T5 v1.1 and mT5[1] pre-trained models. However, I encountered problems when using a custom learning rate scheduler with the Adafactor implementation from the huggingface/transformer library. I combed through the paper and the source code to find and fix the cause of the problem, which turned into a tiny contribution to the library.

To further squeeze value from the time I’ve invested…


Building competitive self-explaining NLP models

Photo Credit

Introduction

Model interpretability is crucial if we want to use AI models to make high-stake decisions (e.g., making medical diagnoses, preventing suicides, etc.). In NLP, one common way to get interpretability is to extract information from the trained models. For example, some use gradient-based input attribution techniques, some perturb the input to get explanations, and some use influence functions to find the most influential training examples to this particular input sequence. Another way is to make the model intrinsically explainable (e.g., a decision tree).

Selective rationalization creates self-explaining models without specialized designs nor architectural choices for the base model. This paper…


Useful for fine-tuning on a subset of available languages

Photo Credit

Motivation

Q: Why and when would we want to trim down the vocabulary size of a pretrained model?

A: When a large portion of the vocabulary isn’t used in your downstream task, it will make sense to get rid of the redundant part of the vocabulary to increase the model speed.

For example, Google’s multilingual version of T5mT5 — was pretrained on 101 languages. Imagine if we only use English, Japanese, and Chinese in our downstream text generation task. …


The global step is not what you think it is

Photo Credit

PyTorch Lightning reached 1.0.0 in October 2020. I wasn’t fully satisfied with the flexibility of its API, so I continued to use my pytorch-helper-bot. This has changed since the 1.0.0 release. Now I use PyTorch Lightning to develop training code that supports both single and multi-GPU training.

However, one thing that bugged me is that the logging doesn’t work as expected when I set the number of gradient accumulation batches larger than one. The steps recorded in the training loop is still the raw step number, but those recorded in the validation is divided by the number of gradient accumulation…


A case study: detecting credit fraud

Photo Credit

Introduction

Recently I came across the article “How to Generate Synthetic Data? — A synthetic data generation dedicated repository”. The post introduces Wasserstein GAN[1] and demonstrates how to use it to generate synthetic(fake) data that looks very “real” (i.e., has similar statistical properties as the real data). This topic interests me as I’ve been wondering if we can reliably generate augmented data for tabular data. …


A Worrying Analysis of Recent Neural Recommendation Approaches

Photo Credit

Introduction

Today we’re examining this very interesting and alarming paper in the field of recommender systems — Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches. It also has an extended version still under review — A Troubling Analysis of Reproducibility and Progress in Recommender Systems Research.

The first author of the papers also gave an overview and answered some questions to the first paper in this YouTube video (he also mentioned some of the contents in the extended version, e.g., the information leakage problem):

Key Points

  1. Reproducibility: less than half of the top papers (7⁄18 in…


A great tool for eliminating pipeline debts

Photo Credit

(This post is also published on my personal blog.)

Introduction

If you are familiar with software engineering, you’d know that automatic testing and continuous integration can save you a lot of debugging time when a project is complex enough and/or involves collaboration between contributors. They help you make sure the new code doesn’t break anything that it’s not supposed to and quickly narrow down the scope of places that could go wrong when failures inevitably happen.

For data scientists, we have to test not only against code but also against data to make sure our data pipelines are working correctly. Just…


If your dataset is small enough

Photo Credit

(This blog is also published on my personal blog)

Introduction

Recently I was asked this question (paraphrasing):

I have a small image dataset that I want to train on Google Colab and its free TPU. Is there a way to do that without having to upload the dataset as TFRecord files to Cloud Storage?

First of all, if your dataset is small, I’d say training on GPU wouldn’t be much slower than on TPU. But they were adamant that they wanted to see how fast training on TPU can be. That’s fine, and the answer is yes. …


Permutation importance can be very misleading

Photo Credit

(This post is also published on my personal blog. It’s recommended to read there since the math notations are not as readable here due to the limitation of Medium.)

This post summarizes the findings and suggestions from the paper “Please Stop Permuting Features ‒ An Explanation and Alternatives” by Giles Hooker and Lucas Mentch.

(Note: Permutation importance is covered in one of my previous posts: Feature Importance Measures for Tree Models — Part I.)

Permutation importance (permuting features without retraining) is biased toward features that are correlated. Avoid using it, and use one of the following alternatives:

  1. Conditional variable importance


Improving accuracy for low-resource languages

Photo Credit

(This post is also published on my personal blog.)

The Google AI Blog post

This post on Google AI Blog explains the premise, background, and related works of this paper pretty well. I’m not going to repeat them in this post. Instead, I’ll try to fill in some of the gaps I see as someone that is familiar with this topic but does not follow very closely with the latest development.

Firstly, I want to point out something in the Google AI post that confuses me. In the first paragraph the authors stated:

While these existing multilingual approaches yield good overall performance across a number…

Ceshine Lee

Data Geek. Maker. Researcher. Twitter: @ceshine_en

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store