Published in

Chronological language model logic

A group of researchers recently published TimeLMs, a paper and series of models trained on Tweets from each quarter of 2020 and 2021. The paper is compelling, and illustrates how language models age after training with a beautifully gradient-ed table:

‘Up-to-date language models’ is a topic which I wrote about in 2020 with similar covid-era examples and failed to get into BigScience in 2021, though my idea was vaguely to ‘patch’ models. This paper is among the best in this space (also notable: Mind the Gap: Assessing Temporal Generalization in Neural Language Models and Temporal Effects on Pre-trained Models for Language Processing Tasks).

How do TimeLMs keep improving?

The concept of TimeLMs was to capture language evolving over time (a non-covid example shown in the paper is ‘Squid Game’ overtaking other ‘_ game’ phrases). Is this the reason that the newest (12/2021) TimeLM model outperforms the 03/2020 model on more recent Tweets?

The authors show that the newest model also performs slightly better on the 03/2020 validation set than the model trained on other 03/2020 Tweets. The authors attribute this to: “newer models are also trained on more data for more time periods”. This quirk in the experiment design makes it difficult to prove temporal change.

I forked the TimeLM code to use non-academic Twitter API, downloaded a small set of 02/2022 Tweets, and tested it against each chronological model. Just from that small sample, I was happy to measure an improvement in pseudo-perplexity (PPPL) with each model closer to the present.
Next I ran each TimeLM model against the dataset of AOC replies which I had scraped from Twitter back in spring 2019. I still see improvement of each newer model — about 1/2 as strong of an effect, but it’s there.

In addition to changing language and training dataset factors mentioned in the paper, I worry that Twitter decays. When the researchers used an API to download ‘old’ Tweets, they got only Tweets that are visible today. The quarterly benchmarks are disconnected from this problem because training and validation sets would decay equally. But if this theory is correct, both the old and new streams of Tweets would be markedly difficult for ‘old’ TimeLM models if we get them unfiltered, like in the 2019 or present-day examples.

If you could see me: vanished Tweets

I decided to measure if hidden Tweets (deleted, suspended, privatized, etc) show a stronger change in TimeLMs compared to average. Luckily the AOC reply dataset has plenty of toxic and since-hidden Tweets and accounts.

I used a script to divide the 110k Tweets into batches of 95 and send them to the Twitter /statuses/lookup.json endpoint. When the API response has IDs missing, we can put those in the ‘hidden’ category. About 40% of these Tweets are no longer accessible, which is such a significant portion that I worried it wouldn’t be distinguishable from the original. Twitter rate-limited me, which left me with ~10k hidden and ~16k visible Tweets after the TimeLM deduplication and other preprocessing.

the AOC datasets have a higher PPPL and less steep drop; since-deleted Tweets track the norm
  • The hidden Tweets start out with a PPPL slightly higher than the full dataset. The change in prediction quality from visible to hidden Tweets is similar to using a model from 2–3 quarters ago.
  • Hidden Tweets are less predictable, but this gap is consistent and the newest (12/2021) model did not show any advantage.
  • Deleted or toxic Tweets may be under-represented in training and under-predicted by TimeLM models, but not in a way that deletions cause the temporal change in these models.
  • I didn’t investigate this much, but if we use the AOC dataset as a baseline, it’s evidence that only 1/3 to 1/2 of the improvement of recent quarterly models comes from training over additional data or time.

🔵 Become a Writer




Data Scientists must think like an artist when finding a solution when creating a piece of code. ⚪️ Artists enjoy working on interesting problems, even if there is no obvious answer ⚪️ 🔵 Follow to join our 18K+ Unique DAILY Readers 🟠

Recommended from Medium

Feeding the beast — understanding and optimizing the data loading path for training deep learning…

Simple Object Identification In Images Using Scores

Non-maximum Suppression (NMS)

Linear Regression with scikit-learn

Interpreting your deep learning model by SHAP

How to use fastai to evaluate DICOM medical files

Neural Networks and their Applications in Regression Analysis

Step by Step to Train a k-NN Model Using AWS SageMaker

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nick Doiron

Nick Doiron

Web->ML developer and mapmaker.

More from Medium

Forget Complex Traditional Approaches to handle NLP Datasets, HuggingFace Dataset Library is your…

Translation of French articles followed by Summarization

Word2Vec SkipGram with Math and Implementation

Pre-trained Language Models for Relational Data