Train tf.keras model in TensorFlow 2.0 using feature coulmn

Published in

ML Book

5 min readAug 28, 2019

In this tutorial, we will see how to use tf.keras model to classify structured data (pandas dataframe) with creating an input pipe line using feature columns ( tf.feature_column) and tf.data.

you will learn-

Creating different types of feature columns using tf.feature_columns
Creating input data function using tf.data for train, val and test set
Creating, compiling and training of tf.keras.model
Evaluating model
Prediction on test data

The Dataset

I have used Titanic: Machine Learning from Disaster from kaggle, you can download and find description of dataset on kaggle. I have used google colab and hence uploaded data in google drive.

Mount google drive

I have uploaded data on google drive, Learn How to use data from google drive here

Import TensorFlow and other libraries

I have used Tensorflow nightly version which is unstable version (aug 2019)

Load and preprocess Data

Use Pandas to create a dataframe

Pandas is a Python library with many helpful utilities for loading and working with structured data. We will use Pandas to download the dataset from mounted google drive, and load it into a dataframe

data = pd.read_csv('drive/My Drive/collab data/titanic/train.csv')data.head(5)

data.shape>> (891, 12)

Missing Data

Check missing values

data.isnull().sum()

Missing value handling

As you can seee that there are some missing values in ‘age’ , ‘embark’ and ‘cabin’. In ‘cabin’ number of missing values are large hence we delete this column from data, and in ‘age’ we will fill missing values with mean value and in ‘embark’ with most frequent value.

mean_value = round(data[‘Age’].mean())
mode_value = data[‘Embarked’].mode()[0]
value = {‘Age’: mean_value, ‘Embarked’: mode_value}data.fillna(value=value,inplace=True)
data.dropna(axis=1,inplace=True)

Explore data with pandas_profiling library

import pandas_profiling as pdpfpdpf.ProfileReport(data)

run this code for statical and visual analysis of your data

Train, val, test Split

We will divide data into train, validation, test data with 3:1:1 ratio

train, test = train_test_split(data, test_size=0.2)
train, val = train_test_split(train, test_size=0.25)
print(len(train), ‘train examples’)
print(len(val), ‘validation examples’)
print(len(test), ‘test examples’)>> 534 train examples
   178 validation examples
   179 test examples

Input pilpe line

Create an input pipeline using tf.data

Next, we will wrap the dataframes with tf.data. This will enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model. If we were working with a very large CSV file (so large that it does not fit into memory), we would use tf.data to read it from disk directly. That is not covered in this tutorial.

batch_size = 32
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

Understand the input pipeline

Now that we have created the input pipeline, let’s call it to see the format of the data it returns. We have used a small batch size to keep the output readable.

After running this code you can see that the dataset returns a dictionary of column names (from the dataframe) that map to column values from rows in the dataframe.

Feature columns

Know more about feature columns here

Decide which types of features you have in data

While data exploration you should note the types of features we have, for example, whether a feature is numerical or categorical, if it is numerical then can we categorize it into buckets or not, or if it is categorical then it should be checked how many categories are there, can we convert it into indicator columns or embedding column, are there any two feature, those can we combined to create new crossed feature. I will recommend you to read this very simplified tutorial on feature columns.

#numarical features
num_c = [‘Age’,’Fare’,’Parch’,’SibSp’]
bucket_c = [‘Age’] #bucketized numerical feature
#categorical features
cat_i_c = [‘Embarked’, ‘Pclass’,’Sex’] #indicator columns
cat_e_c = [‘Ticket’] # embedding column

Scaler function

It is very important for numerical variables to get scaled. here I have used min-max scaling. Here we are creating a function named ‘get_scal’ which takes list of numerical features and returns ‘minmax’ function, which will be used in tf.feature_column.numeric_column() as normalizer_fn in parameters. ‘minmax’ function itself takes a ‘numerical’ number from a particular feature and return scaled value of that number.

Creating feature columns

Numerical Columns

Bucketized columns

Age = feature_column.numeric_column(“Age”)
# bucketized cols
age_buckets = feature_column.bucketized_column(Age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
feature_columns.append(age_buckets)

Categorical Indicator columns

Categorical Embedding columns

Crosed columns

Combination of ‘age’ (age buckets) and ‘sex’

vocabulary = data[‘Sex’].unique()
Sex =
tf.feature_column.categorical_column_with_vocabulary_list(‘Sex’, vocabulary)crossed_feature = feature_column.crossed_column([age_buckets, Sex], hash_bucket_size=1000)
crossed_feature = feature_column.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)print(‘Total number of feature coumns: ‘,len(feature_columns))>> Total number of feature coumns:  10

Create, compile and train the model

Create a feature layer

Now that we have defined our feature columns, we will use a DenseFeatures layer to input them to our Keras model.

feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

tf.keras

Evaluation

loss, accuracy = model.evaluate(test_ds)
print(“Accuracy: “, accuracy)
>>6/6 [==============================] - 0s 70ms/step - loss: 0.6719 - accuracy: 0.7877
Accuracy:  0.7877095