Using FastAI to Analyze Yelp Reviews and Predict User Ratings (Polarity)

A Practical Example of Applying the Power of Transfer Learning to Natural Language Processing

In this post, I’ll be showing you how you can apply the power of transfer learning to a classification task.

The goal is to see how well we can classify a new Yelp review by training an algorithm on past Yelp reviews. FastAI makes this a more approachable problem than it would otherwise be.

You can follow along with this GitHub repo code.

Why FastAI and What is It?

FastAI is a library built by Jeremy Howard, Rachel Thomas and the rest of the good folks at Its current version is built on top of PyTorch — a fast-rising deep learning framework open sourced by Facebook.

It’s very much in the mold of Keras or TensorFlow, but in my opinion, packs a little more punch with it.

FastAI abstracts a lot of the lower level detail and control which TensorFlow requires you to fiddle with. And, unlike Keras, allows you to focus on the task at hand rather than mess with so many parameters.

This way you focus more on the Science than the actual Art of deep learning.

FastAI also allows you to leverage a lot of cutting edge ML techniques adopted from new research. This includes applying the learning-rate finder and leveraging Transfer Learning. What is that by the way?

Transfer Learning

Photo Credit: HOerwin56

Ever ride a bicycle?

Learning to balance was quite a task, but once you nailed that down learning to balance on a Scooter comes way easier.

What about a motorcycle?

Though they are different in many ways if you have ever ridden a scooter your learning curve is less steep on a motorcycle. You basically transfer some previous learnings to that new experience.

Similarly, Transfer Learning allows you to leverage a pre-trained model or deep learning architecture to speed up learning within a specific problem domain.

Typically, a deep learning architecture starts by guessing — randomly — which weights or biases to apply to your parameters. Over time it gets better at guessing as it seeks to minimize its mistakes (or loss).

It’s kinda like trying to hit the bull’s eye on a dartboard while adjusting how you throw the dart each time.

In this case, using Transfer Learning is like having a coach who gives you tips before you start throwing.

In practice, if your task is to distinguish between images of a dog vs a cat you apply an architecture like ResNet-50 (50 layers deep and trained on a million images from the ImageNet database).

Having been trained on so many pictures, ResNet-50 makes the early layers of your neural network pretty good at detecting basic features like edges and shapes. While your final layers can focus on your key domain of distinguishing a dog from a cat.

Similarly, in classifying reviews, it helps to have a pre-trained model which understands some of the semantics of language beyond our training data to speed up training and increase accuracy.

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia …Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger— Stephen Merity

Just like ResNet, WikiText gives us the transfer learning edge we need for our NLP challenge.

Mo’ data, Mo’ better.

Source: Stephen Merity

Our Problem Domain

So, what does our data look like and what exactly are we looking to accomplish?

Our dataset consists of Yelp reviews divided up into negative and positive polarities. Per the readme.txt

The Yelp reviews polarity dataset is constructed by considering stars 1 and 2 negative, and 3 and 4 positive. For each polarity 280,000 training samples and 19,000 testing samples are take randomly. In total there are 560,000 trainig samples and 38,000 testing samples. Negative polarity is class 1, and positive class 2.

As a supervised learning task, the reviews will serve as the input and while the polarities we’re looking to predict will serve as the outcome.

During training, the polarities will be our labeled data (the bull’s eye in this case) by which we build a fine-tuned model on which to apply to a brand new user review.

As you can see below, FastAI has a collection of datasets including the Yelp reviews for NLP related problems.

You can get a copy of this data by going here.

If using this dataset for research, it is important to cite the authors of the original paper. Much thanks to them for providing easy access to such a useful dataset.

Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

Once you download this dataset, you have a variety of options on how or what tool you use to perform this analysis. You can use a Python Notebook like Jupyter or a development editor like PyCharm or Microsoft VisualCode.

Due to the size of the dataset, I chose to run this on GCP using JupyterLab. I’m running this on a Compute Engine VM with FastAI image.

To learn how to set up a FastAI Image VM you can check here. The rest of the instructions assumes this setup, though you don’t need that to follow along.

Getting and Previewing the Data

Below is what JupyterLab looks like. You can run similar commands in your Jupyter Notebook or command line.

I ran the command below to fetch the yelp data


The ‘!’ before wget allows me to run the operation as I would on the command line.

Now, I unpack the .tgz file to take a peek inside the data. I run this command to do that

!tar -xvzf yelp_review_polarity_csv.tgz

And, this is what we get. As you can see there’s training data and test data.

Next, we import the fastAI libraries and dependencies and set a path to the folder holding our files. The path makes for easier reference down the line.

from fastai import * 
from fastai.text import *
from fastai.core import *
path = Path('yelp_review_polarity_csv')

Now, we can use the Python package Pandas to examine the dataset. Below it’s given the alias pd.

train_csv = path/'train.csv'
train = pd.read_csv(train_csv, header=None)

We use the read_csv method to create a dataFrame which we call train. The head() method gives a preview of the first five records in the dataFrame.

The first column (0) shows us the polarity. The 2nd column is the actual review.

This is neat, but let’s take a closer look at one of the reviews. Let’s pull the second record.


Oops! Two stars for Dr. Goldberg.

You can try following similar steps for the test data.

valid_csv = path/'test.csv'
valid = pd.read_csv(valid_csv, header=None)

We can also confirm how many classes we have for the labels using valid[0].unique(). We expect just a polarity score of 1 (for -ve) and 2 for (+ve).

Alright, we’ve got our training and test datasets loaded. As far as we know, the data is clean and each review has a polarity and vice versa.

Starting with a DataBunch

We need a way to pass our dataset into our Neural Network. We also want to load them very efficiently maybe in batches. That’s where a DataBunch object comes in handy.

Neural network computations involve a lot of number crunching. Yet, our data is mainly text. So, we need a way to Numericalize the words.

We also need a way to break down the body of text into individual words as it is the smallest unit of meaningful information.

Lastly, to speed up our learning we also want to prioritize which words are the most useful. Accomplishing this involves a process called Tokenization.

I did say that fastAI packs a punch right?

A DataBunch object allows us to accomplish all this in one shot.

data_lm = TextLMDataBunch.from_csv(path, 'test.csv')
data_clas = TextClasDataBunch.from_csv(path, 'test.csv', vocab=data_lm.train_ds.vocab)

The output of the above two steps is a language model and classifier. One dataBunch for each.

Now we can take a peek into what the output looks like


Special tokens are used for words that appear rarely in the corpus. You can think of these as unknown words which are the tokens starting with ‘xx’.

We can also view this for the classifier.

We can actually take a closer look at the tokens.


Wiki Data for Transfer Learning

Now is where we start to get to business. But, before we create a language model we need to pull in the pre-trained Wiki data.

We create a folder to store the model, then we download and store the models.

model_path = path/'models'
url = ''
download_url(f'{url}lstm_wt103.pth', model_path/'lstm_wt103.pth')
download_url(f'{url}itos_wt103.pkl', model_path/'itos_wt103.pkl')

Creating a Language Model

Now we need to create a language model.

Why do we need this and what exactly is a language model?

With a language model, we start getting into the meaning of a text. The semantics of how words are structured and organized also start to come together. Using the wiki data allows us to speed up this process.

learn = language_model_learner(data_lm, AWD_LSTM, pretrained_fnames=['lstm_wt103', 'itos_wt103'], drop_mult=0.5)

Proof that we have a good language model comes in being able to predict the next sequence of words based on a given set of words.

The command below provides 5 words and tries to predict the next 50.

learn.predict('This was such a great ', 50, temperature=1.1, min_p=0.001)

Here’s what the model gives without any tuning.

This is not a coherent sentence, but it’s pretty amazing that we see the use of commas, periods and some reasonable sentence structures.

Fine Tuning the Model

We now need to fine-tune this model and this involves some training. However, historically picking a learning rate has been sort of an art.

FastAI makes this very easy by leveraging the concept of cyclical learning rates specified in this 2015 paper. In this approach, the learning rate is increased until the loss stops decreasing.

So, we run these two commands.


Here’s what we get with the loss plotted against the learning rate.

We can see that the loss stops decreasing around 1e-1, so we’ll start one step before then.

Out first result is not all that great and needs some fine-tuning. 28.6% accuracy means it guesses the next predicted word in a sequence correctly more than 1 in 4 times.

The next step is to unfreeze and retrain with a lower learning rate. The initial training had a ‘frozen’ layer which was not being trained or updated.

This unfreeze() comman unfreezes all layers helping us further fine-tune.

learn.fit_one_cycle(10, 1e-3,moms=(0.8,0.7))

That’s a slight improvement. It’s only about a 3rd of the way training and it’s guessing correctly a 3rd of the time.

After 10 iterations it hovers close to 37% accuracy

We can save this fine-tuned model. This is good enough for us to proceed. We don’t need it to be super duper at predicting the next word.


For the fun of it let’s see how well the model predicts the next 50 words, given just 3 words.

How about predicting the next 25 words given the first 4 words?


Remember it’s making up the rest of the sentence by being provided some seed words to start with.

Getting into Training and Classification

Remember our aim here is to be able to classify a review with a +ve or -ve polarity. So we build a classifier using our trained model.

First, instead of language model learner, we instantiate the text classifier learner.

learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)

You can read up more on AWD_LSTM. It essentially is an architecture that helps with regularizing and optimizing language models.

Then, we load the trained model and train the classifier.

learn.fit_one_cycle(1, 1e-2)

After one run and under 4 mins of training, we are seeing an accuracy close to 92% accuracy in predicting the polarity of a review. Nice, but far from great.

The state of the art at the time of the 2015 paper resulted in an error rate of 4.36%. This means an accuracy of 95.64%. Can we beat that?

Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015)

We can try to freeze all the layers of the model except for the last 2. Let’s try that.

learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

94.7% is getting closer to the state of the art. All in less than 40mins. Also, the validation loss is less than the training loss, means we are not overfitting.

You may have noticed the 2.6 to the fourth divided into the learning rate. Without boring you with details, all that has to do with discriminative learning rates.

Essentially, as we progress layer by layer by how much do we decrease the learning rate. Jeremy Howard figured out that for NLP RNNs that’s the magical number.

If you really want to dig deeper I suggest you take the Practical Deep Learning for Coders course FastAI offers.

In our case, you can keep going if you want, unfreezing one layer at a time. My notebook started having some memory issues, so I’ll stop here.

Save the model so you don’t have to retrain again and load back the tuned model to proceed.'second')

Prediction Time — Testing Our Model

Our text_classifier_learner has a predict function that now allows us to pass in a review for classification. According to the documentation, the output generates a tuple.

The first two elements of the tuple are, respectively, the predicted class and label. Label here is essentially an internal representation of each class, since class name is a string and cannot be used in computation.

You’ll see what this looks like in a second.

We can check what each label corresponds to by and as you can see there are two classes.

So, at index zero is 1 which implies -ve polarity label. At index one is 2 which implies +ve polarity label.

Now, let’s see if we can predict the outcome of a made up review. Let’s start with something really simple and obvious. Remember, an outcome of label 1 (index 0) is -ve polarity and 2 (index 1) is +ve polarity.

The second element in the tuple is tensor(0). This is a reference to the class index which is 1, meaning it classified this review as having -ve polarity.

The documentation clarified that

The last element in the tuple is the predicted probabilities.

So, there’s a 56.7% chance it’s -ve polarity and 43.3% it’s +ve. The former wins out.

Let’s try something more upbeat.

In this case, the second element in the tuple is tensor(1). This is a reference to the class index which is 2, meaning it classified this review as having +ve polarity. 99.8% probability feels pretty clear.

Right on the money!

What about a real Yelp review?

tensor(0), negative polarity. Got it right with a 62.8% probability.

Another negative polarity well in line with a rating of 1. Probability, in this case, is really high at 96.8%.

That’s 2 for 2.

Let’s look for something which should have a positive polarity.

3 for 3.

99.8% probability this review has a +ve polarity. Aligns well with a rating of 4 stars.


We covered a lot of ground in this post.

I started off by describing what transfer learning is and how it helps tremendously especially around Natural Language Processing tasks. For our use case, WikiText gave us the boost we needed.

You should walk away with a good understanding of how to fetch this data, build a language model, train a classifier and predict polarity on any given yelp review.

You can take this further by

  • Trying this on a completely different corpus, beyond just restaurant reviews
  • Exploring how you can predict exact ratings and not just polarity
  • Even better deploying the model as an application to predict a set of reviews passed in as input

Props to Wang Shuyi whose post on a similar topic got me inspired.

There’s a lot to learn in this space and I hope this has similarly inspired you.

If you want to learn more about how FastAI can help you solve real problems check out their Machine Learning courses.

Best wishes!