A brief analysis of huggingface’s implementation


Gradient checkpointing is a technique that reduces the memory footprint during model training (From O(n) to O(sqrt(n)) in the OpenAI example, n being the number of layers). The price is some computing overhead (multiple forward-pass on the same input).

Essential for fine-tuning T5 v1.1 and mT5 models

Originally published at . Some of the mathematical expressions and explanations are removed in the Medium version. Please check the link for the complete article.


The , in my experience, can provide much better convergence than fine-tuning the and [1] pre-trained models. However, I encountered problems when…

Building competitive self-explaining NLP models


Model interpretability is crucial if we want to use AI models to make high-stake decisions (e.g., making medical diagnoses, preventing suicides, etc.). In NLP, one common way to get interpretability is to extract information from the trained models. For example, some use gradient-based input attribution techniques, some perturb the input…

Useful for fine-tuning on a subset of available languages


Q: Why and when would we want to trim down the vocabulary size of a pretrained model?

A: When a large portion of the vocabulary isn’t used in your downstream task, it will make sense to get rid of the redundant part of the vocabulary to increase the model speed.

The global step is not what you think it is

reached 1.0.0 in October 2020. I wasn’t fully satisfied with the flexibility of its API, so I continued to use my . This has changed since the 1.0.0 release. Now I use PyTorch Lightning to develop training code that supports both single and multi-GPU training.

However, one thing…

A case study: detecting credit fraud


Recently I came across the article . The post introduces Wasserstein GAN[1] and demonstrates how to use it to generate synthetic(fake) data that looks very “real” (i.e., has similar statistical properties as the real data). This topic interests me…

A Worrying Analysis of Recent Neural Recommendation Approaches


Today we’re examining this very interesting and alarming paper in the field of recommender systems — . …

A great tool for eliminating pipeline debts

(This post is also .)


If you are familiar with software engineering, you’d know that and can save you a lot of debugging time when a project is complex enough and/or involves collaboration between contributors. …

If your dataset is small enough

(This blog is also )


Recently I was asked this question (paraphrasing):

I have a small image dataset that I want to train on Google Colab and its free TPU. …

Permutation importance can be very misleading

(This post is also . It’s recommended to read there since the math notations are not as readable here due to the limitation of Medium.)

This post summarizes the findings and suggestions from the paper

Ceshine Lee

Data Geek. Maker. Researcher. Twitter: @ceshine_en

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store