Train Linear model and boosted tree model in Tensorflow 2.0 using feature columns
In this tutorial, we will see how to use tf.estimator.LinearClassifier model and tf.estimator.BoostedTreesClassifier to classify structured data (pandas dataframe) with creating an input pipe line using feature columns ( tf.feature_column) and tf.data.
you will learn-
- Creating different types of feature columns using tf.feature_columns
- Creating input data function using tf.data for train, val and test set
- Creating, compiling and training of tf.estimator model
- Evaluating model
- Prediction on test data
The Dataset
I have used Titanic: Machine Learning from Disaster from kaggle, you can download and find description of dataset on kaggle. I have used google colab and hence uploaded data in google drive.
Mount google drive
I have uploaded data on google drive, Learn How to use data from google drive here
Import TensorFlow and other libraries
I have used Tensorflow nightly version which is unstable version (aug 2019)
Load and preprocess Data
Use Pandas to create a dataframe
Pandas is a Python library with many helpful utilities for loading and working with structured data. We will use Pandas to download the dataset from mounted google drive, and load it into a dataframe.
data = pd.read_csv(‘drive/My Drive/collab data/titanic/train.csv’)
data.head(5)
Missing Data
Check missing values
data.isnull().sum()
Missing value handling
As you can seee that there are some missing values in ‘age’ , ‘embark’ and ‘cabin’. In ‘cabin’ number of missing values are large hence we delete this column from data, and in ‘age’ we will fill missing values with mean value and in ‘embark’ with most frequent value.
mean_value = round(data[‘Age’].mean())
mode_value = data[‘Embarked’].mode()[0]
value = {‘Age’: mean_value, ‘Embarked’: mode_value}
data.fillna(value=value,inplace=True)
data.dropna(axis=1,inplace=True)
data.shape>> (891, 11)
Explore data with pandas_profiling library
import pandas_profiling as pdpf
pdpf.ProfileReport(data)
Train, val, test Split
We will divide data into train, validation, test data with 3:1:1 ratio
train, test = train_test_split(data, test_size=0.2)
train, val = train_test_split(train, test_size=0.25)
print(len(train), ‘train examples’)
print(len(val), ‘validation examples’)
print(len(test), ‘test examples’)>>534 train examples
178 validation examples
179 test examples
Input pilpe line
Feature columns
Know more about feature columns here
Decide which types of features you have in data
While data exploration you should note the types of features we have, for example, whether a feature is numerical or categorical, if it is numerical then can we categorize it into buckets or not, or if it is categorical then it should be checked how many categories are there, can we convert it into indicator columns or embedding column, are there any two feature, those can we combined to create new crossed feature. I will recommend you to read this very simplified tutorial on feature columns.
num_c = [‘Age’,’Fare’,’Parch’,’SibSp’]
bucket_c = [‘Age’]cat_i_c = [‘Embarked’, ‘Pclass’,’Sex’]
cat_e_c = [‘Ticket’]
Scaler function
It is very important for numerical variables to get scaled. here I have used min-max scaling. Here we are creating a function named ‘get_scal’ which takes list of numerical features and returns ‘minmax’ function, which will be used in tf.feature_column.numeric_column() as normalizer_fn in parameters. ‘minmax’ function itself takes a ‘numerical’ number from a particular feature and return scaled value of that number.
Creating feature columns
Create an input data function using tf.data
Next, we will wrap the dataframes with tf.data. This will enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe to features used to train the model. If we were working with a very large CSV file (so large that it does not fit into memory), we would use tf.data to read it from disk directly. That is not covered in this tutorial.
Here we will define a make_input_fn which will retrun input_function for data. The input_function specifies how data is converted to a tf.data.Dataset that feeds the input pipeline in a streaming fashion. tf.data.Dataset take take in multiple sources such as a dataframe, a csv-formatted file, and more. In tf.estimator we provide input function in model intead of data, but in dt.keras we can directly provide input data through input function.
Train linear model
After adding all the base features to the model, let’s train the model. Training a model is just a single command using the tf.estimator API:
print(pd.Series(result))
Train boosted Tree model
Tensorflow boosted tree model does not support embeding column (aug 2019), hence creating feature columns without embedding column
feature_columns1 = list(set(feature_columns)-set([embeding]))
feature_columns1
Find full tutorial on
Join our Telegram channel for more updates and study resources and discussion
Join and earn ₹31