A summary of Text Generation using Deep Learning

Motivation

I want to use AI to create content in the form of text, inspired by OpenAI’s Post. They have published their technique to make an incredibly good language model called GPT-2. But what is a language model? In the field of Natural Language Processing (NLP), a Universal Language Model (ULM) is a Neural Network trained to predict the next word in a sentence. Since prediction of a word requires the neural network to understand the order of the previous ones, most of the ULMs have an architecture based on either some sort of Recurrent Neural Network (RNN) or Transformers.

Currently, the most popular application of Language Models is to be used as a pre-trained model to fine tune for another use, like Sentiment Analysis or Text Classification. This is called in general Transfer Learning. It was originally a technique used in Computer Vision and was brought to NLP thanks to a great work by Jeremy Howard and Sebastian Ruder, on the beginning of 2018. The technique was called ULMFit, and it was a ULM pre-trained on the Dataset Wiki-103 from Salesforce. Ever since then, there have been major improvements of using Transfer Learning in NLP, including ELMO and Google’s BERT as well as efforts to make ULMFit Multilingual. There are many exciting things coming to Transfer Learning in NLP!

However, for text generation (unless we want to generate domain-specific text, more on that later) a Language Model is enough. For Language Models, the most important thing is to decide what is the unit of information which is going to be predicted. In other words, do we consider the text as a sequence of words, or as a sequence of characters (or a mix of both)? These units are called tokens, and the process of converting the text to these tokens is called tokenization.

Tokenization

The tokenization process takes the Corpus (the whole text dataset) and split in tokens, which are saved in an encoded way to be processed as fast as possible. Because of this, we need to consider the number of tokens to be used in the training process beforehand. The list of the tokens and their respective text meaning is called the vocabulary. The amount of tokens to be considered is called vocabulary size. It is important to choose a good vocabulary size because tokens not in the vocabulary will just be ignored by the network, which is basically lost information.

Traditionally, RNNs were used to make predictions on a character by character fashion. Examples of this are shown in the great articles of Gilbert Tanner and Donald Dong. But if we trained a Neural Network based on words, the generated text tends to make more sense. Nevertheless, the best approaches so far use a combination of both (called subword tokenization), and this is what OpenAI’s GPT-2 does with something called Byte Pair Encoding. A great and popular subword tokenizer is called SentencePiece by Google. For now, the tokenizer I am using is called Spacy (which as of March 2019 is fastai’s current default). Spacy is a word tokenizer. Why? Because Spanish is a language with not too many words so the Language Model does not get confused. I will explain this better.


Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com


Language Models start getting worse at predictions when they have a vocabulary size which is way too big (over 60,000 tokens, but this can vary). This makes sense: there are too many possibilities out of which to predict the next word. But we need to consider most words in our token list, otherwise, we are missing a lot of information. This is where the subword tokenizer come into play. Many words in these rich languages are composed of common prefixes, suffixes, declinations, etc. If we split words into these components, we can cover the whole text with fewer tokens. Subword tokenizers have proved to be very useful for languages which have a very big vocabulary, like Russian or Polish and for considering rare words.

Dataset

As mentioned in the intro, ULMFit pre-trained models available in the fastai library are trained on the Dataset WikiText-103. So I wanted to train the Language Model in Spanish using Wikipedia in Spanish. A great collaboration group in the fastai community has already set up some ready-to-use scrapers, which download all of Wikipedia for you. I used Andreas Daiminger’s open source code which uses WikiExtractor. I extracted all the Spanish articles from Wikipedia with at least 1000 words. It ends up being about 3GB of text. It is worth mentioning that the dataset that the GPT-2 uses is much much bigger. It is called GPT-2 WebText and it is said to be 40GB of Internet Text! It is made out of all the pages linked in Reddit in posts with at least 3 karma. Which means my generator would not be able to generalize and write as good text as GPT-2. For now. There are some people who already have built scrapers to get this data, like Joshua Peterson’s OpenWebText or eukaryote31’s.

Network Architecture

The Architecture I used is a TransformerXL from Google, which is a combination of RNNs and Transformers in one. Rani Horev has written a great description comparing the three architectures. It is great for understanding the relationship between words which are very far away from each other in a text. The network is made by 80 million parameters. In contrast, GPT-2 is based on the first version of Transformers, but it is huge in size. It has around 1.5 Billion parameters! This allows it to generalize better and accept much bigger corpora to train on.

Prediction

The most basic way of generating text would be to have an initial text pass through the network and make it predict the next word. There is also a technique called beam search, which would not be looking for the most likely word to come next, but the most likely group of words. A great description of beam search is made by Andrew Ng himself, on his Deep Learning Course. I am however not using it, because the current implementation I tried is not giving good results. I will try to find out the reason soon. It’s in my To-do list for now.

Results and Further Ideas

I have set up an online Demo on AWS using Starlette, a great production-ready framework for building your server on Python. It supports asynchronous requests. I have used the pre-trained English Language Model from the fastai library and my trained Language Model in Spanish. The text generators are available in http://textgen.cristianduguet.com.

They are not as good as the generated text from OpenAI (for all the differences mentioned in this article) but at least generates text that most of the time makes sense in small sentences, and is grammatically correct most of the time. If we fine-tune this LM with a domain specific corpus, like a legal text, we can obtain much better results. And that is exactly what I intend to do. I want to fine tune this on a fashion corpus, and use it together with a fashion image classifier, so I can automatically make text descriptions for fashion items.

Why? Because I have many people close to me working in retail shops, they’re not tech friendly and they are struggling to make content for every item they are selling in their shop. E-commerce is growing fast in 2019, and the retailer cannot afford not to have an online presence, or they will miss a lot of customers. They need to go omnichannel if they want to compete with the Amazons and the Zalandos. I think, in fact, that AI will radically improve Ecommerce and Marketing automation this year. I want to pull this off with a small startup called cax.ai.

Soon I will be posting a more Technical Article about it, and about my next ideas. But if you cannot wait you can immediately visit my repository of the Language Model and of the Web App. Stay tuned!


Originally published at HELLO, I AM CRISTIAN DUGUET.