Writing Buzzfeed Articles with Fine-Tuned GPT-2 Transformer Models
Unfulfilled Dreams & NLP
In college, one of the ways I earned money was by writing ‘top ten’ lists for a now defunct click bait farm. The process was long and the potential reward for an accepted article was lacking. It wasn’t an efficient use of time, but maybe it could have been had I known about Natural Language Processing. Natural Language Processing (NLP) is a subfield of machine learning focused on building models capable of “understanding” language data. NLP tasks include language translation and sentiment analysis. However, for the aspiring click-bait auteur, the most useful use of NLP is natural language generation (NLG). NLG aims to build models capable of producing text that is both syntactically correct and based in reality. While it’s too late for this old dog to learn new tricks, writing articles should be a relatively easy task for a neural network to accomplish. However, my models can’t work for just any outlet. The only place acceptable for the skill of my strong & intelligent robot spawn is the New York Times of click-bait: Buzzfeed.
The Sauce
In order to train a neural network for the purposes of natural language generation, a lot of text-data is required. Since I want my models to write Buzzfeed style articles, lots of Buzzfeed text data is required. Unfortunately, there is no magical ‘download data’ button to press this time around — a little extra work was required. Using NewsAPI I was able to collect enough articles for training.
Once data was collected and cleaned, I could get to training some models. My strategy for writing believable Buzzfeed-style posts was to use the acquired article data to do some fine-tuning of a sophisticated and pre-trained model — looking at you GPT-2 [1]. The idea is that a model like GPT-2 already has the ‘understanding’ required to write cogent sentences, but could use some pointers on how to write in the style of a Buzzfeed contributor. I drafted a diagram illustrating how the model pipeline should work (shown in Figure 1).
To mold a Buzzfeed influenced GPT-2 model, I used the HuggingFace transformers and pytorch python libraries. The HuggingFace library acts as api for using hundreds of transformers models on NLP tasks. It’s definitely worth playing around with! In any case, the first part of every article is a title. To this end, I first trained a GPT-2 model to generate titles by using the collected article titles as a corpus. After a 12 hour training session (seriously!), BuzzTitle-GPT-2 (BTG) was born. How well does it perform? After generating some sample titles and making sure that the model hadn’t just copied existing titles from the training data, I’m glad to say it worked pretty well! I’ve shown some examples of generated titles in Figure 2.
Once it was apparent that BTG can generate believable Buzzfeed-esque titles, I could get to the real meat and potatoes of this project and train a GPT-2 model to write whole articles. The structure of a full Buzzfeed article usually includes a title, a tagline or subtitle, and the body of the article. The individual training articles in my corpus reflect this structure as well. BuzzArticle-GPT-2 (BAG) entered the world after several 16 hour training periods (mistakes were made). Using titles generated from BTG as input text, I generated some sample articles to make sure BAG wasn’t outright plagiarizing anyone’s articles.While I believe it was a success, judge for yourself. Here’s an excerpt from a sample article.
“18 Times Teen TV Shows Were Just Really, Really Good
that we don’t need to see. We recently asked the BuzzFeed Community to tell us their favorite teen TV shows for their teens. Here are the brilliant results. 1. The L Word [Image] “It was a really cute show about a group of friends with a very low social status. It had lots of great plot lines and characters. I loved it.” 2. The O.C. [Image] “The O.C. had a good storyline that was super fun. It’s good to see a group of teens hanging out and being together. It was a good show that I think would stand the test of time, and I loved it.” 3. The Office [Image] “It was a great show about a bunch of young adults who were working together to solve problems. It had lots of great characters, and I loved each episode.” 4. The Office [Image] “The Office is one of my all-time favorite shows. It was a really cool show with a lot of great character development and a lot of great storylines. I didn’t know anything about it until recently, but I loved it.” 5. The Suite Life of Zack and Cody [Image] “It was one of my all-time favorites. It was so good and had a good premise and overall pretty good storylines. I loved the scripts and was super excited to see the series develop…”
There’s a lot to unpack here, so lets start with the negative aspects. The model randomly puts the phrase ‘that we don’t need to see’ at the start of the article. This doesn’t make sense syntactically and it doesn’t fit the context of the article. The article also misses the point of its title; the goal was to write an article about times a shows were great, not great shows. This is a subtle point that the model missed when generating the article. Finally, the text can be fairly repetitive as well. Luckily, there are more pros than cons in my opinion.
For starters most of the article is grammatically correct. The first words of sentences are capitalized, punctuation is properly placed, and proper nouns are capitalized. The model understands the basic structure of a Buzzfeed list-style article: it numbers each buzz before declaring the subject, names the subject, inserts some sort of media, and provides a quote or caption to go after the media. Probably the most important and exciting aspect of this generated article is that the model was able to pick up on the context of the title and provide several real shows that teens might actually enjoy. Not bad eh? I’ll put a link to the full text at the end of the article [2].
Once I was able to generate articles, it was time to put BAG to the test and make our first Buzzfeed post. All I had to do was copy and paste the generated article text into the relevant boxes on the Buzzfeed article editor. The most work involved in this was finding relevant images to add for the [Image] tags in the generated text, which was actually pretty annoying. So I clipped the article to ’10 Times’ instead of 18 and published it [3].
Sauce to Go
In the end, the fine-tuning a GPT-2 model to write articles actually worked pretty well. However, at BAG’s current level it probably isn’t ready to write on its own and definitley needs a human editor to look over its work. Ways to make these models more effective are:
- Training with more articles.
- Re-checking the quality of the training data and fixing any issues therein.
- Adding an image selecting model to the pipeline so that relevant media could be automatically assigned to the [Image] and [gif] tags the model outputs.
I’m looking forward to improving this pipeline and excited for future results. If you’re looking to try these models out for yourself there’s a few ways to do it.
- You can fine-tune GPT-2 yourself. Depending on your machine’s specs, this might be prohibitive.
- Second, I’ll be uploading the models I trained to the HuggingFace model repo. You can use the transformers library to call the models from their website. Use this code snippet:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torchmodel = GPT2LMHeadModel.from_pretrained('jordan-m-young/buzz-article-gpt-2')tokenizer = GPT2Tokenizer.from_pretrained("gpt2")input_ids = z.encode("18 Times Teen TV Shows Were Just Really, Really Good\n",return_tensors='pt')# Seed for result reproducibility, change to change outputs
torch.manual_seed(10)
sample_output = model.generate(
input_ids,
do_sample=True,
max_length=1000,
top_k=0,
temperature=0.7
)print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=False))
3. Finally, you can play around with the models (without any coding) here: jordan-m-young/buzz-article-gpt-2. Just type in a buzzfeed-like title.
All the materials I used for this project will be available in my github repo [4]. Thanks for reading this article and stay tuned!
-Jordan
References
[1] GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
[2] https://github.com/Jordan-M-Young/Buzzfeed_NLP/blob/main/18Times.txt
[3] https://www.buzzfeed.com/jroscoe/18-times-teen-tv-shows-were-just-really-really-go-7vpf4ase5j