Beginner’s guide to feeding data in Tensorflow — Part1

Published in

Coinmonks

5 min readJun 23, 2018

Hi, this series would be regarding feeding data(images and numeric features) to your tensorflow models. After reading all the posts, you will be able to —

Feed the in-memory data to your model.
Feed the TFRecords format to your model.
Feed the Raw Images on disk to your model.

In this part i will be focusing on feeding the in-memory data .

This post assumes the following skill-set on reader’s side — basic working of Neural Nets, basics of Tensorflow or basics model building in Keras.

The code for this post is available here.

Let’s get started.

INSTALLATION

pip install tensorflow-gpu keras

or if you don’t have a GPU, install the CPU version of tensorflow.

pip install tensorflow keras

DATASET

We will be using the popular MNIST dataset. It contains the images of digits from 0–9. Generally this dataset is available in numpy array format. So I will provide the code to convert the data to TFRecords Format and to raw Images on disk.

Using the Estimator API

Tensorflow provides the High Level Estimator API for defining your model with minimum efforts. We will be using this API for building our simple Model of 2 Hidden layers of 100 neurons each with an output layer of 10 since the dataset has 10 classes.

When you make a model with Estimator API, it expects the type of feature columns of dataset as a list that would be feed during training. Since we are using the images which contains numbers we would be providing one feature column of Numeric type to it.

feature_column = [tf.feature_column.numeric_column(key=’image’,shape=(784,))]

Here key is the name that we wish to give to that feature column. Note that the same key should be passed to the model when your are feeding the data.

For a more detailed list of type of feature_columns, go to tensorflow’s documentation.

Now for creating the model ,we would be using the pre-made estimators — specifically the DNNClassifier a.k.a Deep neural net classifier in which you can add as many dense layers as you want.

##defining the model

model = tf.estimator.DNNClassifier([100,100],n_classes=10,feature_columns=feature_column)

Now that we have defined our model , lets define the DATASET Flow.

Working with Numpy arrays

import numpy as np
import pandas as pd
from keras.datasets import mnist(x_train,y_train),(x_test,y_test) = mnist.load_data()
#since the model expects a single feature vector of size 784 #convert from (28,28) to 784 
x_train = x_train.reshape(-1,784) 
x_test = x_test.reshape(-1,784)

Since we have the data as numpy arrays here. So the simplest way would be to pass the data to model right??Correct. Tensorflow provides this feature under tf.estimator.inputs. Let’s pass the data and train the model for 101 steps. Here steps mean the no. of mini-batches seen by the model.Default batch size is 128.

model.train(input_fn=tf.estimator.inputs.numpy_input_fn(
dict({'image':x_train}),                                         np.array(y_train,np.int32),
shuffle=True),steps=101)

The output would be look like this -

This works fine But what if you want to do some preprocessing of the data before feeding it to the model. Here the DATASET API comes to the rescue.

As you saw in the above line of code, model.train expects the data to be feed from an input function.Since that function would require a preprocessing function in case we are doing some preprocessing , Let’s write a preprocessing function that would convert the numpy arrays to tensor and change the data type to float32 since the weights of the dense layers are of dtype float32.

NOTE: As a rule of thumb, always first convert your data to tensors(float32 ,int32) otherwise you will get weird errors that will cause a lot of problem.

'''parse function to be used. This function is needed to do the preprocessing of data like reshaping ,converting to tensors from numpy arrays ,one-hot encoding ,etc.'''
def _parse_and_preprocess(x,y):
    x = tf.cast(x,tf.float32) 
    #cast to float32 as the weights are float32.
    y = tf.cast(y,tf.int32) #cast to tensor of int32
    return (dict({'image':x}),y) #return tuple of dict of feature # name with key as provided in the feature column and label.

Now write the input function that would return a generator to fetch the next batch of data.

##define the function that feeds the data to the model .
def train_input_fn(x_train,y_train,batch_size=64):
    ##Here we are using dataset API.
    '''
    take the data from tensor_slices i.e. an array of data-points in simple words.
    '''
    dataset =    tf.data.Dataset.from_tensor_slices((x_train,y_train)) 
    
    
    dataset = dataset.map(lambda x,y:_parse_and_preprocess(x,y)).shuffle(buffer_size=128) \
                .batch(batch_size)    dataset_iterator = dataset.make_one_shot_iterator()   
    return dataset_iterator.get_next()

Here we are using the DATASET API(tf.data.Dataset). Since we have the data as arrays we can call the from_tensor_slices method and pass on the data.

I know that the name of method is ambiguous!!

The tensor_from_slices method also takes the list of filenames stored on disk. I will explain this in next post.

If you have a dataset that contains various numeric and categorical features, then you should use the TextLineDataset method of Dataset API or pandas_input_fn from tf.estimator.inputs but i would highly recommend the former one.

The incoming data is now passed to _parse_preprocess function that we wrote and would return a tuple of 2 elements — first would be the dictionary with key and image data as 784 tensor of dtype tf.float32 and second is the label of dtype tf.int32.

Now as the name suggests, the .shuffle() and .batch() methods shuffles and batches the data respectively.

To iterate over these batches of data-points, we call make_one_shot_iterator that returns the iterator a.k.a generator.

The .get_next() simply returns the next batch of data.

Now it’s time to train our model.

NOTE: the input_fn in estimator.train() doesn’t accept the parameters in the function. The workaround is to use the lambda feature of python.

import time
t1 = time.time()
model.train(input_fn=lambda:train_input_fn(x_train,y_train,64),steps=150)
t2= time.time()
print('time taken ---- \t {}'.format(t2 - t1))

Conclusion:

The above code works for image data in numpy array format.Now to generalize, instead of just image data if you have data with various features including numeric columns, ordinal, nominal types,

Just pass the features names in a list whiling creating a model.
Use the TextLineDataset method from tf.data.Dataset API instead of from_tensor_slices in the train_input_fn().
Map the dataset to an appropriate preprocessing function that would still return the tuple containing the features with the appropriate keys in a dict and a label.
Then shuffle, batch according to the requirements and finally making an iterator and returning the next batch.

Link for part2 is here.

Thank you for reading the post. If you liked it, please give some claps.

Join Coinmonks Telegram Channel and Youtube Channel get daily Crypto News

Also, Read

Crypto Telegram Signals | Crypto Trading Bot
Copy Trading | Crypto Tax Software
Grid Trading | Crypto Hardware Wallet
Best Crypto Exchange | Best Crypto Exchange in India
Best Crypto APIs for Developers
Best Crypto Lending Platform
An ultimate guide to Leveraged Token
Best VPNs for Crypto Trading
Best Crypto Analytics or On-Chain Data | Bexplus Review
10 Biggest NFT MarketPlaces to Mint a Collection
AscendEx Staking | Bot Ocean Review | Best Bitcoin Wallets
Bitget Review | Gemini vs BlockFi | OKEx Futures Trading

Beginner’s guide to feeding data in Tensorflow — Part1

Also, Read

Written by Vikas Sangwan