Bayesian CNN model on MNIST data using Tensorflow-probability (compared to CNN)

Published in

Python experiments

6 min readJan 29, 2019

Motivation

I’ve been recently reading about the Bayesian neural network (BNN) where traditional backpropagation is replaced by Bayes by Backprop. This was introduced by Blundell et al (2015) and then adopted by many researchers in recent years. Instead of point estimate of weights, BNN approximates the distribution of weights, commonly a Gaussian/normal distribution with two hyperparameters (mean and standard deviation), based on prior information and data. The prediction uses the posterior distribution of weights. The backpropagation is to update the hyperparameters of the weights. This way, the model can provide uncertainty estimates of the weights and predictions.

BNN can be integrated into any neural network models, but here I’m interested in its application on convolutional neural networks (CNN).

So far, there are several existing packages in Python that implement Bayesian CNN. For example, Shridhar et al 2018 used Pytorch (also see their blogs), Thomas Wiecki 2017 used PyMC3, and Tran et al 2016 introduced the package Edward and then merged into TensorFlow Probability (Tran et al 2018).

This blog will use TensorFlow Probability to implement Bayesian CNN and compare it to regular CNN, using the famous MNIST data. The human accuracy on the MNIST data is about 97.5% — 98%. A single node neural network model will be used as the baseline model.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_functionimport os
import warnings
# warnings.simplefilter(action="ignore")
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'import matplotlib.pyplot as plt
from matplotlib import figure 
from matplotlib.backends import backend_agg
import seaborn as sns
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np
tf.logging.set_verbosity(tf.logging.ERROR)# Dependency imports
import matplotlib
import tensorflow_probability as tfpmatplotlib.use("Agg")
%matplotlib inline

Import data

Tensorflow’s built-in MNIST API saves you a lot of effort on manipulating the MNIST data, so that you can focus on model development. It allows you to import the data with different shapes and one-hot encoded labels, to easily select batches, and the images have been normalized.

The MNIST data are imported here in three versions:

images reshaped to 784(28*28) vector, labels one-hot coded
images not reshaped (28281 images), labels not one-hot coded (integers 1–9)
images not reshaped, labels one-hot coded.

mnist_onehot = input_data.read_data_sets(data_dir, one_hot=True)
mnist_conv = input_data.read_data_sets(data_dir,reshape=False ,one_hot=False)
mnist_conv_onehot = input_data.read_data_sets(data_dir,reshape=False ,one_hot=True)# display an image
img_no = 485
one_image = mnist_conv_onehot.train.images[img_no].reshape(28,28)
plt.imshow(one_image, cmap='gist_gray')print('Image label: {}'.format(np.argmax(mnist_conv_onehot.train.labels[img_no])))

Image label: 6

Baseline model

As a baseline model, a neural network with one hidden layer of a single node is built. This is equivalent to a multinomial logistic regression model. The model flattens the image, ignoring the connections between neighboring pixels in the image.

The baseline model actually does a good job reaching around 91–93% accuracy.

Codes are from Udemy course “Complete Guide to Tensorflow for Deep Learning with Python”.

# define placeholders
x = tf.placeholder(tf.float32, shape=[None, 28*28])
y_true = tf.placeholder(tf.float32, shape=[None, 10])# define variables: weights and bias
W = tf.Variable(tf.zeros([28*28, 10]))
b = tf.Variable(tf.zeros([10]))# create graph operations
y = tf.matmul(x,W)+b# define loss function
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_true, logits=y))# define optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.5)
train=optimizer.minimize(cross_entropy)# create session
epochs = 5000
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for step in range(epochs):
        batch_x, batch_y = mnist_onehot.train.next_batch(50)
        sess.run(train, feed_dict={x:batch_x, y_true:batch_y})
    
    #EVALUATION
    correct_preds = tf.equal(tf.argmax(y,1), tf.argmax(y_true, 1))
    acc = tf.reduce_mean(tf.cast(correct_preds, tf.float32))
    
    print('Accuracy on test set: {}'.format(
        sess.run(acc, feed_dict={x: mnist_onehot.test.images, y_true: mnist_onehot.test.labels})))

Accuracy on test set: 0.9124000072479248

CNN model

The CNN model is a simple version of the following:

Convolutional layer (32 kernels)
Max pooling
Convolutional layer (64 kernels)
Max pooling
Flattening layer
Fully connected layer (1024 output units)
Dropout layer (50% dropping rate)
Fully connected layer (10 output units, one for each digit)

The number of kernels, dropping out rate, and output units are arbitrary here, without any parameter tuning. After 5000 batches, the model accuracy reached 99%, overtaking human accuracy.

x = tf.placeholder(tf.float32,shape=[None,28,28,1])
y_true = tf.placeholder(tf.float32,shape=[None,10])
hold_prob = tf.placeholder(tf.float32)cnn = tf.keras.Sequential()
cnn.add(tf.keras.layers.Conv2D(32, kernel_size=5, padding='SAME', activation=tf.nn.relu))
cnn.add(tf.keras.layers.MaxPooling2D(pool_size=[2, 2], strides=[2, 2], padding="SAME"))
cnn.add(tf.keras.layers.Conv2D(64, kernel_size=5, padding='SAME', activation=tf.nn.relu))
cnn.add(tf.keras.layers.MaxPooling2D(pool_size=[2, 2], strides=[2, 2], padding="SAME"))
cnn.add(tf.keras.layers.Flatten())
cnn.add(tf.keras.layers.Dense(1024, activation=tf.nn.relu))
cnn.add(tf.keras.layers.Dropout(hold_prob))
cnn.add(tf.keras.layers.Dense(10))y_pred = cnn(x)cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_true,logits=y_pred))
optimizer = tf.train.AdamOptimizer(learning_rate=0.0001)
train = optimizer.minimize(cross_entropy)steps = 5000
init = tf.global_variables_initializer()with tf.Session() as sess:
    sess.run(init)
    
    for i in range(steps+1):
        batch_x , batch_y = mnist_conv_onehot.train.next_batch(50)
        sess.run(train,feed_dict={x:batch_x,y_true:batch_y,hold_prob:0.5})
        
        # PRINT OUT A MESSAGE EVERY 100 STEPS
        if i%500 == 0:
            matches = tf.equal(tf.argmax(y_pred,1),tf.argmax(y_true,1))
            acc = tf.reduce_mean(tf.cast(matches,tf.float32))            print('Step {}: accuracy={}'.format(i, sess.run(acc,feed_dict={x:mnist_conv_onehot.test.images,                 y_true:mnist_conv_onehot.test.labels,                                                                           hold_prob:1.0})))

Step 0: accuracy=0.20010000467300415
Step 500: accuracy=0.9563999772071838
Step 1000: accuracy=0.973800003528595
Step 1500: accuracy=0.9807999730110168
Step 2000: accuracy=0.9815000295639038
Step 2500: accuracy=0.9854000210762024
Step 3000: accuracy=0.9864000082015991
Step 3500: accuracy=0.9868000149726868
Step 4000: accuracy=0.9886000156402588
Step 4500: accuracy=0.9894999861717224
Step 5000: accuracy=0.9865999817848206

Bayesian CNN

I chose TensorFlow Probability to implement Bayesian CNN purely for convenience and familiarity with TensorFlow. This package uses the Flipout gradient estimator to minimize the negative ELBO as the loss. It computes the integration when deriving the posterior distribution. Other implementations may be more efficient; for example, Shridhar et al 2018’s applied the Local Reparameterization Trick to avoid the integration by sampling from an approximation of posterior distribution.

The codes are modified based on the provided example here. The codes of the plots below are taken from the original example, thus not displayed here.

images = tf.placeholder(tf.float32,shape=[None,28,28,1])
labels = tf.placeholder(tf.float32,shape=[None,])
hold_prob = tf.placeholder(tf.float32)# define the model
neural_net = tf.keras.Sequential([
      tfp.layers.Convolution2DReparameterization(32, kernel_size=5,  padding="SAME", activation=tf.nn.relu),
      tf.keras.layers.MaxPooling2D(pool_size=[2, 2],  strides=[2, 2],  padding="SAME"),
      tfp.layers.Convolution2DReparameterization(64, kernel_size=5,  padding="SAME",  activation=tf.nn.relu),
      tf.keras.layers.MaxPooling2D(pool_size=[2, 2], strides=[2, 2], padding="SAME"),
      tf.keras.layers.Flatten(),
      tfp.layers.DenseFlipout(1024, activation=tf.nn.relu),
      tf.keras.layers.Dropout(hold_prob),
      tfp.layers.DenseFlipout(10)])logits = neural_net(images)# Compute the -ELBO as the loss, averaged over the batch size.
labels_distribution = tfp.distributions.Categorical(logits=logits)
neg_log_likelihood = -tf.reduce_mean(labels_distribution.log_prob(labels))
kl = sum(neural_net.losses) / mnist_conv.train.num_examples
elbo_loss = neg_log_likelihood + kloptimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(elbo_loss)# Build metrics for evaluation. Predictions are formed from a single forward
# pass of the probabilistic layers. They are cheap but noisy predictions.
predictions = tf.argmax(logits, axis=1)
accuracy, accuracy_update_op = tf.metrics.accuracy(labels=labels, predictions=predictions)

learning_rate = 0.001   #initial learning rate
max_step = 5000 #number of training steps to run
batch_size = 50 #batch sizeviz_steps = 500 #frequency at which save visualizations.
num_monte_carlo = 50 #Network draws to compute predictive probabilities.init_op = tf.group(tf.global_variables_initializer(),
                     tf.local_variables_initializer())
with tf.Session() as sess:
        sess.run(init_op)# Run the training loop.
        for step in range(max_step+1):
            images_b, labels_b = mnist_conv.train.next_batch(
batch_size)
            images_h, labels_h = mnist_conv.validation.next_batch(
mnist_conv.validation.num_examples)
            
            _ = sess.run([train_op, accuracy_update_op], feed_dict={
                   images: images_b,labels: labels_b,hold_prob:0.5})        if (step==0) | (step % 500 == 0):
                loss_value, accuracy_value = sess.run([elbo_loss, accuracy], feed_dict={images: images_b,
labels: labels_b,hold_prob:0.5})
                
                print("Step: {:>3d} Loss: {:.3f} Accuracy: {:.3f}".format(step, loss_value, accuracy_value))

Step: 0 Loss: 161.928 Accuracy: 0.140
Step: 500 Loss: 135.825 Accuracy: 0.858
Step: 1000 Loss: 117.817 Accuracy: 0.907
Step: 1500 Loss: 99.129 Accuracy: 0.927
Step: 2000 Loss: 80.596 Accuracy: 0.938
Step: 2500 Loss: 63.682 Accuracy: 0.946
Step: 3000 Loss: 48.857 Accuracy: 0.950
Step: 3500 Loss: 36.574 Accuracy: 0.954
Step: 4000 Loss: 27.315 Accuracy: 0.957
Step: 4500 Loss: 20.480 Accuracy: 0.959
Step: 5000 Loss: 15.652 Accuracy: 0.961

Results

The regular CNN takes a shorter time to run and achieves better accuracy, compared to the Bayesian CNN using the same model structure. However, the one advantage that Bayesian CNN brings in is an uncertainty measure of the weights and predictions.

The following plots show the hyper parameters of weight posterior distributions converge through training steps. At the beginning, priors dominate the distributions so that all are similar; in the end, the posteriors differ as driven by the data. Recall the layers in the model:

Layer 0: Convolutional layer (32 kernels)
Layer 2: Convolutional layer (64 kernels)
Layer 5: Fully connected layer (1024 output units)
Layer 7: Fully connected layer (10 output units, one for each digit)

The graphs below show the uncertainties of prediction at training steps 1, 500 and 5000 (from left to right). Step 1 shows higher uncertainties; after 500 training batches, the predictions become more confident in general, except for some unclear hand writings. For example, the second last case is difficult even for humans to be certain (3 or 5?). After step 5000, the model confidence is improved significantly.