Training a TensorFlow model to recognize emotions

Jose Flores
May 24, 2018 · 10 min read

We are going to write a python script to train a custom supervised machine learning model using Tensorflow and Keras that will be able to recognize the emotions of a face.

I decided not to go with a retrain script provided by popular (and powerful) models like InceptionV3. Going with Inception would almost definitely be faster to train and the model would probably have better accuracy but having done some of the basic TensorFlow tutorials out there I wanted to challenge myself and try to learn the process from end to end of training a model and saving it ready for production , combining pre-trained weights with newly trained weights and load the model in a JVM server side project and in GCP.

Full disclosure, I’m not a data scientist and I’m also not a fan of python or dynamically typed languages (don’t hate me Data Scientists) but that may be because of my unfamiliarity with them. There will undoubtedly be things that I don’t know ranging from writing idiomatic python to some machine learning concepts but I am willing and eager to learn.

Getting the data

So we need to get labeled data (Supervised Learning!) of faces with the emotions shown. This means we need a lot of images of faces that have been labeled with an emotion (“Angry”, “Sad”) or better the range of emotions being felt and their values(“Angry 10%”, “Hungry 90%”).

We could try to create our own data by building applications that will take a picture and ask the user to tell us what emotions he feels but this would be a long process that still doesn’t guarantee that the data would be ‘clean’ or that incorrect data wasn’t fed in. We could also use things like Amazon’s mechanical turk which allows you to pay for real human engagement (on-demand workforce) and is a good option if you have the money. Another option, and the one that we are going with, is to use Kaggle an online repository of high quality public datasets ranging from Pokemon stats to baseball stats. It wasn’t hard to find some data with human faces and labeled emotions.

Image for post
Image for post

From the data description:

“A model is only as good as the data that it receives”

“Garbage in, Garbage out”

In other words there are always limitations imposed by choosing a specific data set. In this case some of them are:

  • Only 7 emotions are recognized
  • Images are small (48, 48) => small model input => will have to resize high res photos to low res -> losing high res detail (slight smile won’t register)
  • Images have the face are “centered and occupies the same amount of space in each image” => hard to replicate without building something to crop the real world images fed in to better match what the it was trained on => will probably have bad accuracy if we don’t do this (we won’t)

Exploring the data

It’s always a good idea to start exploring the data as soon as possible so we can have an idea of what we’re working with. A good way to explore the data is using a Jupyter Notebook, previously known as iPythonNotebook, which allows you to write code, notes or run sections of code.

Let’s check it out.

import pandas as pdraw_data_csv_file_name = ‘data/fer2013.csv’
raw_data = pd.read_csv(raw_data_csv_file_name)

pd.read_csv(…) is a function provided by the Pandas library and returns a DataFrame object that contains all the data with helpful and an easy to use api.

First let’s a quick description of the data using

Image for post
Image for post

We can see there are 35887 entries or rows of data and the name of the columns the data holds(emotion, pixels, Usage). Looks like emotion has the type of int64 and that the other data is an object which means it can be any type of python object. Since we just read from a csv file we can assume the data is a string. Let’s check out the first 5 entries.

Image for post
Image for post

Ok, the emotion data is an int and matches the description (0–6 emotions), the pixels seems to be a string with space separated ints and Usage is a string that has “Training” repeated so probably a categorical attribute that would have something like “testing”. The pixels are ints and we know it’s a grayscale image from the description so they’re probably in the 0–255 range.

Let’s get more data on Usage since I have I’ve made a few assumptions about it so far.

Image for post
Image for post

Nice, it looks like the data is already split for testing.

Let’s transform the input pixels to see how the first image looks.

import numpy as np

img = raw_data["pixels"][0] # first image
val = img.split(" ")
x_pixels = np.array(val, 'float32')
x_pixels /= 255
x_reshaped = x.reshape(48,48)
%matplotlib inline # only if running in jupyter notebook
import matplotlib
import matplotlib.pyplot as plt
plt.imshow(x_reshaped, cmap= "gray", interpolation="nearest") plt.axis("off")
Image for post
Image for post
Pixels to face

We now have a pretty good idea of what and how the data is structured let’s look at our architecture.

Model Architecture

Before this project I only had experience with the low level TensorFlow api so I wanted to try something at a higher level to be able to quickly write and iterate over different architectures (and because my low level TensorFlow ones weren’t performing well).

I decided to try out Keras which I had heard a lot positive things about from forums and meetups. Keras is a high level neural network api that can run on top of TensorFlow and can have access to the underlying TensorFlow graph which was important for me because I wanted to use the standard TensorFlow saved model approach after I trained it.

Also to circumvent some of the issues I was originally seeing in my first prototyping models I decided to use the help of a previously trained model named VGG16.

Using previously trained models

The only experience I had using a previously trained model was through some retrain scripts like the one for Inception models that would have you drop some images in a folder and run a python script.

“Transfer learning and domain adaptation refer to the situation where what has been learned in one setting … is exploited to improve generalization in another setting” — Deep Learning, 2016

This is more generally known as transfer learning and the idea is you use a previously trained model to help you accomplish a task. There are a lot of variations of this technique for example using a retrain script, using a known architecture model and/or using weights learned from training.

Our architecture

We are going to use part of a model graph named VGG16 and then add some of our own layers on top. There are countless ways of even accomplishing something like this and because of that we could have some wrong assumptions which could slow down our progress so this is probably worth taking some time to understand.

VGG16 is a popular neural network architecture and Keras makes it easy to get a model. You can also see how the model was implemented and how straight forward it is in Keras by looking through the library.

vgg16 = VGG16(include_top=False, input_shape=(48, 48, 3), weights='imagenet')
  • inlcude_top=False — Don’t include the top 2 fully connected (dense) layers
  • input_shape=(48,48,3) — The input shape has to have 3 dimensions because of how the VGG16 model was built (we could re-implement without this but we’ll leave for now. 48 x 48 is the size of our training images. We will “copy/paste” the values in the 48x48 to three dimensions.(48,48) => (48,48,3)
  • weights='imagenet'Use the learned/pre-trained weights
Image for post
Image for post
Created with

Model Input

  1. We’ve transformed the pixels column (space separated ints) to float values from 0–1 by dividing by /255.
  2. Reshaped the arrays of floats to a 48x48 matrix.
  3. Duplicated the 48x48 to create 48x48x3 matrix (for VGG16 input)
  4. We’re going to feed the inputs to the VGG16 and get the 512 predictions for each input.

So far we haven’t gone over any part of the model that will actually be trained. The model we are going to train is going to take in the 512 values that the VGG16 model outputs for the original input.

We’re using part of the VGG16 model with the previously learned weights to get more data from our inputs. It’s important that the weights are previously trained on the “imagenet” dataset because it means that the layers have basically been “tuned” to recognize and output meaningful features.

The idea of using a model , or layers, to get more ‘data’ from an input is generally known as feature extraction and is usually helpful when you have a small input that you want to get more meaningful data from.

Model to be trained

The actual model that we are going to train is going to be a small 3 fully connected layers (+output layer) that is going to take in the 512 float values. Using the high level Keras api makes this incredibly easy.

topLayerModel = Sequential()
topLayerModel.add(Dense(256, input_shape=(512,), activation='relu'))
topLayerModel.add(Dense(256, input_shape=(256,), activation='relu'))
topLayerModel.add(Dense(128, input_shape=(256,), activation='relu'))
topLayerModel.add(Dense(NUM_CLASSES, activation='softmax'))
  1. 512 Input => 256
  2. 256 => 256
  3. 256 => 128
  4. 128 => 7 classes (Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral)

Now all we need is to train(.fit()) the model.

adamax = Adamax()

optimizer=adamax, metrics=['accuracy']), y_train,
validation_data=(x_train_feature_map, y_train),
nb_epoch=FLAGS.n_epochs, batch_size=FLAGS.batch_size)
  • x_train_feature_map- array of (n_inputs, 512)
  • y_train one hot vector of correct emotion (ex. {0=happy = [1, 0, 0, 0, 0, 0, 0]})(ex. {5=Surprise = [0, 0, 0, 0, 0, 1, 0]})

Merging the models

Now that we’re done with training we want to merge the models so that when we input new images they will go through the same VGG16 “preprocess” that our training data went through (take in 48,48,3 and output 512).

When we were training the previous model (by using .fit()) it’s important to note that we were only updating the weights of the small 3 fully connected layer and not any part of the VGG16 model (by design for faster training).

We’re now going to merge the part of the VGG16 model with the imagenet weights and our model with the weights that were learned from training (.fit()). Keras makes this task straight forward too.

inputs = Input(shape=(48, 48, 3))
vg_output = vgg16(inputs)
model_predictions = topLayerModel(vg_output)
final_model = Model(input=inputs, output=model_predictions)

final_model :

  • 48x48x3 matrix
  • VGG16(48x48x3) => 512
  • topLayerModel(512) => 7 dimension array

Save for TensorFlow Serving

We now have a model that is trained and we want to save the model out so that we can throw it on a TensorFlow serving server, GCP, load it from Java, Python or Javascript. Saving a model in TensorFlow used to be painful because there were a couple ways to save the models but I’ve found this format to be the most compatible across languages and frameworks.

from tensorflow.python.saved_model import tag_constants, signature_constants
from tensorflow.python.saved_model.signature_def_utils_impl import build_signature_def, predict_signature_def
config = final_model.get_config()
weights = final_model.get_weights()
# dont have to create a new one
model_to_save = Model.from_config(config)

export_path = 'export_path'

builder = saved_model_builder.SavedModelBuilder(export_path)

signature = predict_signature_def(
outputs={'scores': model_to_save.output})

with K.get_session() as sess:
signature_def_map={'predict': signature})


I trained this model for 10,000 epochs with a batch size of 50.

You may notice that once the model predicts a certain emotion on an image, all the other ones have similarly low values. This is happening because of the way we trained it and the dataset we used. Each image only had one associated label so during training it was basically taught that given an image it should only have one emotion (one hot vector & cost function).


  • Training Set (Model has seen these): 99.8%
  • Test/Validation Set (New inputs to model): 43.7%

The difference in accuracy is somewhat expected but it can also be pointing to a problem with overfitting. We also should not expect this model to be 43% accurate in real world instances unless we do some preprocessing on new images to more closely match the images fed in during training/testing.

After training you should have a folder with the artifacts from saving and they should look like:

├── saved_model.pb
└── variables
└── variables.index

In the next section we’ll load this model in GCP and in a Java project to perform inference on new images. The code and python notebook for this post can be found here.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store