Fine-tuning DistilBERT on senator tweets

Mary Newhauser
NLPlanet
Published in
10 min readMar 18, 2022

--

A guide to fine-tuning DistilBERT on the tweets of American Senators with snscrape, SQLite, and Transformers (PyTorch) on Google Colab.

Photo by Ian Hutchinson on Unsplash

Introduction

Tweets are short bits of text that can (sometimes) be packed with valuable data. In the case of United States Senators, official Twitter accounts contain their opinions on a wide range of political issues and are (generally) in good grammatical form. Kaggle’s Toxic Comment Classification Challenge famously demonstrated the power of the transformer in text classification on tweets. But can a transformer perform as well on a binary task trained on a much smaller dataset of more diplomatic veiled language?

To find out, I fine-tuned the DistilBERT transformer model on a custom dataset of all 2021 tweets from US Senators. The result is a powerful text classification model that can determine a senator’s political party based on a single tweet with 90.8% accuracy.

In this article, I take you through the data procurement and fine-tuning process, demonstrating how I:

  • Scrape all official Twitter accounts of US Senators with snscrape, clean them with Preprocessor and store them in a local sqlite database.
  • Transform tweets into a HuggingFace Dataset and fine-tune the DistilBERT base model with PyTorch.

--

--

Mary Newhauser
NLPlanet

Senior Data Scientist at Wiley. I also help people become data scientists. datajump.com