Beginner’s guide to feeding data in Tensorflow — Part1

Vikas Sangwan
Coinmonks
5 min readJun 23, 2018

--

Hi, this series would be regarding feeding data(images and numeric features) to your tensorflow models. After reading all the posts, you will be able to —

  1. Feed the in-memory data to your model.
  2. Feed the TFRecords format to your model.
  3. Feed the Raw Images on disk to your model.

In this part i will be focusing on feeding the in-memory data .

This post assumes the following skill-set on reader’s side — basic working of Neural Nets, basics of Tensorflow or basics model building in Keras.

The code for this post is available here.

Let’s get started.

INSTALLATION

pip install tensorflow-gpu keras

or if you don’t have a GPU, install the CPU version of tensorflow.

pip install tensorflow keras

DATASET

We will be using the popular MNIST dataset. It contains the images of digits from 0–9. Generally this dataset is available in numpy array format. So I will provide the code to convert the data to TFRecords Format and to raw Images on disk.

Using the Estimator API

Tensorflow provides the High Level Estimator API for defining your model with minimum efforts. We will be using this API for building our simple Model of 2 Hidden layers of 100 neurons each with an output layer of 10 since the dataset has 10 classes.

When you make a model with Estimator API, it expects the type of feature columns of dataset as a list that would be feed during training. Since we are using the images which contains numbers we would be providing one feature column of Numeric type to it.

Here key is the name that we wish to give to that feature column. Note that the same key should be passed to the model when your are feeding the data.

For a more detailed list of type of feature_columns, go to tensorflow’s documentation.

Now for creating the model ,we would be using the pre-made estimators — specifically the DNNClassifier a.k.a Deep neural net classifier in which you can add as many dense layers as you want.

##defining the model

Now that we have defined our model , lets define the DATASET Flow.

  1. Working with Numpy arrays
  • Since we have the data as numpy arrays here. So the simplest way would be to pass the data to model right??Correct. Tensorflow provides this feature under tf.estimator.inputs. Let’s pass the data and train the model for 101 steps. Here steps mean the no. of mini-batches seen by the model.Default batch size is 128.

The output would be look like this -

This works fine But what if you want to do some preprocessing of the data before feeding it to the model. Here the DATASET API comes to the rescue.

As you saw in the above line of code, model.train expects the data to be feed from an input function.Since that function would require a preprocessing function in case we are doing some preprocessing , Let’s write a preprocessing function that would convert the numpy arrays to tensor and change the data type to float32 since the weights of the dense layers are of dtype float32.

NOTE: As a rule of thumb, always first convert your data to tensors(float32 ,int32) otherwise you will get weird errors that will cause a lot of problem.

Now write the input function that would return a generator to fetch the next batch of data.

Here we are using the DATASET API(tf.data.Dataset). Since we have the data as arrays we can call the from_tensor_slices method and pass on the data.

I know that the name of method is ambiguous!!

The tensor_from_slices method also takes the list of filenames stored on disk. I will explain this in next post.

If you have a dataset that contains various numeric and categorical features, then you should use the TextLineDataset method of Dataset API or pandas_input_fn from tf.estimator.inputs but i would highly recommend the former one.

The incoming data is now passed to _parse_preprocess function that we wrote and would return a tuple of 2 elements — first would be the dictionary with key and image data as 784 tensor of dtype tf.float32 and second is the label of dtype tf.int32.

Now as the name suggests, the .shuffle() and .batch() methods shuffles and batches the data respectively.

To iterate over these batches of data-points, we call make_one_shot_iterator that returns the iterator a.k.a generator.

The .get_next() simply returns the next batch of data.

Now it’s time to train our model.

NOTE: the input_fn in estimator.train() doesn’t accept the parameters in the function. The workaround is to use the lambda feature of python.

Conclusion:

The above code works for image data in numpy array format.Now to generalize, instead of just image data if you have data with various features including numeric columns, ordinal, nominal types,

  • Just pass the features names in a list whiling creating a model.
  • Use the TextLineDataset method from tf.data.Dataset API instead of from_tensor_slices in the train_input_fn().
  • Map the dataset to an appropriate preprocessing function that would still return the tuple containing the features with the appropriate keys in a dict and a label.
  • Then shuffle, batch according to the requirements and finally making an iterator and returning the next batch.

Link for part2 is here.

Thank you for reading the post. If you liked it, please give some claps.

Join Coinmonks Telegram Channel and Youtube Channel get daily Crypto News

Also, Read

--

--

Vikas Sangwan
Coinmonks

Set up the Goal, put in the efforts| Deep Learning | Fast.ai fellow| Data Scientist at Tabsquare.ai(Singapore)