Training on Large Datasets That Don’t Fit In Memory in Keras

Rajat Garg
6 min readMar 19, 2019

--

Training your Deep Learning algorithms on a huge dataset that is too large to fit in memory? If yes, this article will be of great help to you. In this article, we will discuss how to train our deep learning network on a huge dataset that does not fit in memory using Keras.

Introduction

Deep Learning algorithms are outperforming all the other algorithms and are able to produce state-of-the-art results on most of the problems. The major reason for the success of deep learning algorithm is the growing size of the dataset. Now, Deep Learning algorithms are trained on huge datasets that even do not fit in memory. The question is: How to train our model on such huge datasets? This article is divided into the following subparts:

  • Downloading and Understanding Dataset
  • Preparation of Dataset — To Load the Dataset in Batches
  • Shuffling and Splitting of the Dataset In Train And Validation Set
  • Creation of Custom Generator
  • Defining Model Architecture and Training Model
  • Conclusion

As a running example, we will be solving the Kaggle “Plant Seedlings Classification” challenge. However, the dataset for this challenge is not that big but we will solve this challenge assuming the dataset is too large to fit in memory and will then load the dataset in batches.

The code for this challenge is written on Google Colab. All the code will be shared on the Github Repository.

Downloading and Understanding Dataset

You can download the dataset from here. Unzip the train.zip folder. The dataset contains 4750 images of plant seedling at various stages classified into 12 plant species. The 12 plant species are Black-grass, Charlock, Cleavers, Common Chickweed, Common wheat, Fat Hen, Losse Silky-bent, Maize, Scentless Mayweed, Shepherds Purse, Small-flowered Cranesbill, Sugar beet. The goal of the competition is to create a classifier capable of determining a plant’s species from a photo.

The current directory looks like:

Folder Directory

If your dataset for your problem is not in this format, don’t worry. Your dataset can be in any format. We will see in the next section that the aim is to take all the data point(ie images in our example) and save them to a single folder. Data points can be anything like images, audios etc.

Preparation of Dataset — To Load the Dataset in Batches

The next step is to take your whole dataset (i.e. all the data points(images in our example) ) and store them to one folder. We create a new folder by the name “all_images” and the aim is to store all the images in our dataset in this “all_images” folder.

We use the below script to store all the images in the “all_images” folder. You can write a similar script to take all the data points in your dataset (that can be images, audios etc) and store them to a new folder.

The next step is to store the name of each data point (ie name of each image) in one array (let’s name the array as filename). One more thing is to store the labels associated with each data point in another array(let’s call this array as labels).

Below is a script storing the names of each image in the filename array and labels associated with that image in labels array.

Note: Keep in mind that the name of each data point should be unique.

Now, you can save the “all_images” folder, “filename” array and the “labels” array for later use.

Below we create a numpy array of filename and labels and save them as a .npy file.

Shuffling and Splitting of the Dataset In Train and Validation Set

The next step is to shuffle the dataset so as to remove any symmetry from our dataset.

Now, let’s split the dataset into a train and validation set. We can save these files also as these will be used later for training and validating of our model.

from sklearn.model_selection import train_test_split

You can save the “all_images” folder in zip format also in case you want to share the dataset with other team members.

These lines of code just create an “all_images.zip” folder.

Creation of Custom Generator

Note: As our dataset is too large to fit in memory, we have to load the dataset from the hard disk in batches to our memory.

To do so, we are going to create a custom generator. Our Custom Generator is going to load the dataset from the hard disk in batches to memory.

Let’s try to understand the whole code:

Line 1: Our Custom Generator class inherit from the Sequence class.

Line 3: Here, we can feed parameters to our generator. In this example, we pass image filenames as image_filenames, labels as labels and the batch size as batch_size.

Line 9: This function computes the number of batches that this generator is supposed to produce. So, we divide the number of total samples by the batch_size and return that value.

Line 14: Here, given the batch number idx you need to put together a list that consists of data batch and the ground-truth (GT). In this example, we read batch images of size batch_size and return an array of the form [image_batch, GT].

In __getitem__(self, idx) function you can decide what happens to your dataset when loaded in batches. Here, you can put your preprocessing steps as well. Also, you can calculate the mel spectrogram of your audio file as well.

Above explanation is taken from this post by Ramin Rezaiifar.

Ok, so we have created our data generator. Next step is to create instances of this class.

Lines 3,4: Instantiate two instances of My_Custom_Generator (one for training and one for validation) and initialize them with image filenames for training and validation and the ground-truth for training and validation sets.

Defining Model Architecture and Training Model

Let’s first import some libraries:

Now, let’s define our model architecture and compile the model. You can use any of your model architecture here.

Now, let’s train our model.

Results of our training:

Training Result

Our model is not giving good results and is overfitting. We can improve the accuracy using different model architecture or better preprocessing our dataset or using data augmentation or using transfer learning. But, that’s not the aim of this article. I hope now you have a very clear understanding of how to train a deep learning architecture with a huge amount of dataset. If you have any doubts or question, comment on and I will be happy to help.

Conclusion

With growing access to IoT devices and smartphones, a huge amount of data is being collected each second. With the power of deep learning algorithms, we can create value on top of these huge datasets. In this article, I try to give you a very clear understanding with an example on how you can train your own deep learning model with a huge dataset.

Note: It’s always better to preprocess your dataset and then feed it to the learning algorithm otherwise preprocessing of the dataset will happen on each epoch.

For more information, refer to the Keras documentation.

Github Code: The code repository for this post is written on Google Colab.

Linkedin Profile: You can follow me on LinkedIn as well

In case you love this article, show your love by clicking the Clap icon.

--

--

Rajat Garg

Software Engineer at Microsoft || Helping tech engineers grab their dream offer