Train Linear model and boosted tree model in Tensorflow 2.0 using feature columns

Siddhartha
ML Book
Published in
4 min readAug 30, 2019

--

In this tutorial, we will see how to use tf.estimator.LinearClassifier model and tf.estimator.BoostedTreesClassifier to classify structured data (pandas dataframe) with creating an input pipe line using feature columns ( tf.feature_column) and tf.data.

you will learn-

  • Creating different types of feature columns using tf.feature_columns
  • Creating input data function using tf.data for train, val and test set
  • Creating, compiling and training of tf.estimator model
  • Evaluating model
  • Prediction on test data

The Dataset

I have used Titanic: Machine Learning from Disaster from kaggle, you can download and find description of dataset on kaggle. I have used google colab and hence uploaded data in google drive.

Mount google drive

I have uploaded data on google drive, Learn How to use data from google drive here

Import TensorFlow and other libraries

I have used Tensorflow nightly version which is unstable version (aug 2019)

Load and preprocess Data

Use Pandas to create a dataframe

Pandas is a Python library with many helpful utilities for loading and working with structured data. We will use Pandas to download the dataset from mounted google drive, and load it into a dataframe.

data = pd.read_csv(‘drive/My Drive/collab data/titanic/train.csv’)
data.head(5)

Missing Data

Check missing values

data.isnull().sum()

Missing value handling

As you can seee that there are some missing values in ‘age’ , ‘embark’ and ‘cabin’. In ‘cabin’ number of missing values are large hence we delete this column from data, and in ‘age’ we will fill missing values with mean value and in ‘embark’ with most frequent value.

mean_value = round(data[‘Age’].mean())
mode_value = data[‘Embarked’].mode()[0]
value = {‘Age’: mean_value, ‘Embarked’: mode_value}
data.fillna(value=value,inplace=True)
data.dropna(axis=1,inplace=True)
data.shape
>> (891, 11)

Explore data with pandas_profiling library

import pandas_profiling as pdpf
pdpf.ProfileReport(data)

Train, val, test Split

We will divide data into train, validation, test data with 3:1:1 ratio

train, test = train_test_split(data, test_size=0.2)
train, val = train_test_split(train, test_size=0.25)
print(len(train), ‘train examples’)
print(len(val), ‘validation examples’)
print(len(test), ‘test examples’)
>>534 train examples
178 validation examples
179 test examples

Input pilpe line

Feature columns

Know more about feature columns here

Decide which types of features you have in data

While data exploration you should note the types of features we have, for example, whether a feature is numerical or categorical, if it is numerical then can we categorize it into buckets or not, or if it is categorical then it should be checked how many categories are there, can we convert it into indicator columns or embedding column, are there any two feature, those can we combined to create new crossed feature. I will recommend you to read this very simplified tutorial on feature columns.

num_c = [‘Age’,’Fare’,’Parch’,’SibSp’]
bucket_c = [‘Age’]
cat_i_c = [‘Embarked’, ‘Pclass’,’Sex’]
cat_e_c = [‘Ticket’]

Scaler function

It is very important for numerical variables to get scaled. here I have used min-max scaling. Here we are creating a function named ‘get_scal’ which takes list of numerical features and returns ‘minmax’ function, which will be used in tf.feature_column.numeric_column() as normalizer_fn in parameters. ‘minmax’ function itself takes a ‘numerical’ number from a particular feature and return scaled value of that number.

Creating feature columns

Create an input data function using tf.data

Next, we will wrap the dataframes with tf.data. This will enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model. If we were working with a very large CSV file (so large that it does not fit into memory), we would use tf.data to read it from disk directly. That is not covered in this tutorial.

Here we will define a make_input_fn which will retrun input_function for data. The input_function specifies how data is converted to a tf.data.Dataset that feeds the input pipeline in a streaming fashion. tf.data.Dataset take take in multiple sources such as a dataframe, a csv-formatted file, and more. In tf.estimator we provide input function in model intead of data, but in dt.keras we can directly provide input data through input function.

Train linear model

After adding all the base features to the model, let’s train the model. Training a model is just a single command using the tf.estimator API:

print(pd.Series(result))

Train boosted Tree model

Tensorflow boosted tree model does not support embeding column (aug 2019), hence creating feature columns without embedding column

feature_columns1 = list(set(feature_columns)-set([embeding]))
feature_columns1

Find full tutorial on

Google colab or GitHub

Join our Telegram channel for more updates and study resources and discussion

Join and earn ₹31

👉https://t.me/joinai

--

--