k-fold Cross-Validation in Keras Convolutional Neural Networks

5 min readApr 9, 2019

Data Overview: This article is based on the implementation of the paper Convolutional Neural Networks for Sentence Classification (EMNLP 2014). We’ll just use the Movie Reviews data set and design a movie review sentiment analysis using convolutional neural networks using Keras.

Data and Google word2vec download

For this analysis, we will use the Movie Reviews from the Sentiment polarity datasets by Bo Pang and Lillian Lee of Cornell University (2005). The reviews can be downloaded from http://www.cs.cornell.edu/people/pabo/movie-review-data/ and consist of 5331 positive and 5331 negative processed sentences/snippets.

We will represent the sentence of each review as a row matrix of vectors. This vector is the average of word2vec (Google’s Word2Vec) scores of all the words in our sentence. The word2vec transforms a word into a vector of size 300. Training on word2vec architecture models requires downloading word2vec.

Data loading and preprocessing

After downloading the data, we can manually join the positive and negative reviews (in that order) into a single text file and load it using

reviews = pd.read_csv(‘../input/reviews/reviews.txt’, delimiter=’\n’, header = None)
df = reviews
df.columns = ['Phrase']
df['Sentiment'] = labels

To load the word2vec binary file, we need to install the library ‘gensim’. If you are using Anaconda IDE, use

conda install -c anaconda gensim

And the binary word2vec file (GoogleNews-vectors-negative300.bin) can be loaded as

import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('../input/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin', binary=True)

We can strip the punctuation from the text of the reviews, although we don’t lowercase the text because the word2vec is case-sensitive. (I verified this be comparing ‘b’ with ‘B’ and ‘c’ with ‘C’ in the word2vec file)

def clean_str(text):
    text = str(text)
    text = re.sub(r"(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9]\.[^\s]{2,})", "url", text)
    text = re.sub(r'([^\s\w]|_)+', '', text)
    return text
df['text'] = df['Phrase'].apply(clean_str)

Next, we tokenize the text in order to convert each word into a vector using GoogleNews vector.

from keras.preprocessing.text import TokenizerNUM_WORDS=22000 (the number of most common words we want to keep)tokenizer = Tokenizer(num_words=NUM_WORDS)
tokenizer.fit_on_texts(df['text'])
sequences_train = tokenizer.texts_to_sequences(df['text'])
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Vectoring the Movie Reviews

The most important part of the implementation is the embedding layer of the network we are going to use. The paper mentions four different types of embedding layer, but we will try to implement only the first three, mentioned below:

CNN-rand - a model where all words are randomly initialized and then modified during training.
CNN-static - A model with pre-trained vectors from word2vec. All words — including the unknown ones that are randomly initialized — are kept static and only the other parameters of the model are learned.
CNN-non-static - Same as above but the pre-trained vectors are fine-tuned for each task.

EMBEDDING_DIM=300
vocabulary_size=min(len(word_index)+1,NUM_WORDS)
embedding_matrix = np.zeros((vocabulary_size, EMBEDDING_DIM))for word, i in word_index.items():
    if i>=NUM_WORDS:
        continue
    try:
        embedding_vector = word_vectors[word]
        embedding_matrix[i] = embedding_vector
    except KeyError:
        embedding_matrix[i]=np.random.normal(scale = 0.1,size = (EMBEDDING_DIM, ))random_embedding = Embedding(vocabulary_size, EMBEDDING_DIM)
static_embedding = Embedding(vocabulary_size, EMBEDDING_DIM, weights=[embedding_matrix], trainable=False)
non_static_embedding = Embedding(vocabulary_size, EMBEDDING_DIM, weights=[embedding_matrix], trainable=True)

Building the model

The paper uses a simple convolutional neural network which is a slight variant of the CNN architecture of Collobert et al, shown in the figure below.

Image Source: Convolutional Neural Networks for Sentence Classification by Yoon Kim

For the dataset, Kim has used: rectified linear units, filter windows (h) of 3, 4, 5 with 100 feature maps each, dropout rate (p) of 0.5, l2 constraint (s) of 3, and mini-batch size of 50. The figure below explains the three different filter sizes, the pooling layer and the output layer.

Image source: A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification by Ye Zhang and Byron Wallace

def evaluate_model(X_train, X_val, y_train, y_val):
 
    model = None
 
    sequence_length = X_train.shape[1]
    filter_sizes = [3,4,5]
    num_filters = 100
    drop = 0.5
 
    inputs = Input(shape=(sequence_length,), dtype=’int32')    embedding = non_static_embedding(inputs)    reshape = Reshape((sequence_length,EMBEDDING_DIM,1))(embedding)    convolution1 = Conv2D(num_filters, (filter_sizes[0], EMBEDDING_DIM),activation=’relu’)(reshape)    convolution2 = Conv2D(num_filters, (filter_sizes[1], EMBEDDING_DIM),activation=’relu’)(reshape)    convolution3 = Conv2D(num_filters, (filter_sizes[2], EMBEDDING_DIM),activation=’relu’)(reshape)    maxpooling1 = MaxPooling2D((sequence_length — filter_sizes[0] + 1, 1), strides=(1,1), padding = ‘valid’)(convolution1)    maxpooling2 = MaxPooling2D((sequence_length — filter_sizes[1] + 1, 1), strides=(1,1), padding = ‘valid’)(convolution1)    maxpooling3 = MaxPooling2D((sequence_length — filter_sizes[2] + 1, 1), strides=(1,1), padding = ‘valid’)(convolution1)    merged = concatenate([maxpooling1, maxpooling2, maxpooling3], axis=1)    flatten = Flatten()(merged)    dropout = Dropout(drop)(flatten)    output = Dense(units=1, activation=’sigmoid’, kernel_regularizer=regularizers.l2(3))(dropout)    model = Model(inputs, output)    epochs = 100
    batch_size = 50    model.compile(loss=’binary_crossentropy’, optimizer=Adadelta(lr=1), metrics=[‘acc’])    model.save_weights(‘model.h5’) #    callbacks = [EarlyStopping(monitor=’val_acc’, patience = 10)]
 
    model.fit(X_train, y_train, validation_data = (X_val,y_val), epochs=epochs, batch_size=batch_size, verbose=2)#, callbacks = callbacks)
 
    _, val_acc = model.evaluate(X_val, y_val, verbose = 1)    model.load_weights(‘model.h5’) # return model, val_acc

# these lines of the code try to save the weights before training the model and load the same weights after the training, so that when we try to run the k-fold cross-validation in the next part, we can reset the weight for each fold.

10-fold cross-validation

Here, we try to run 10 fold cross-validation to validate our model. This step is usually skipped in CNN's because of the computational overhead. While implementing this project, this step was the hardest because there is not much documentation on running k-fold cross-validation in Keras. The method of saving and loading the initial weights in each fold helps us to implement this. After each fold, the learnt weights are initialized to the weights that were stored just after compiling the model.

n_folds = 10
cv_scores, model_history = list(), list()
for _ in range(n_folds):
    # split data
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.10, random_state = np.random.randint(1,1000, 1)[0])
    # evaluate model
    model, test_acc = evaluate_model(X_train, X_val, y_train, y_val)
    print('>%.3f' % val_acc)
    cv_scores.append(val_acc)
    model_history.append(model)
    
print('Estimated Accuracy %.3f (%.3f)' % (np.mean(cv_scores), np.std(cv_scores)))

This implementation gives a mean accuracy of 80.8% with a standard deviation of 1.1%. The accuracy for the same implementation in the paper is 81.5%. We can try to tweak the hyper-parameters to beat the performance of the paper. One hyper-parameter that I played with was the l2 constraint. Setting it to a lower number gave better performance, and setting it to 0 gave me the best performing model with an accuracy of 83.5% with a standard deviation of 1.4%.

For future work, I will try to run a grid search on the hyperparameters and try to find out the best set of parameters. These hyperparameters include l2 constraint, filter region size, dropout rate, pooling size, activation functions and the batch sizes.

To be continued ………….