DatRet: Tensorflow implementation for structured tabular data

6 min readJan 22, 2023

My open-source project

A simple implementation of a deep neural network architecture for tabular data with adjustable layer generation and layer-by-layer increase in the number of neurons. Using a similar classic machine learning method.

This article discusses the reason for this library, will conduct a “tutorial” and compare the prediction accuracy of DatRetClassifier and DatRetRegressor with the creation of classical machine learning methods.

Introduction

To predict tabular data, classical machine learning methods are most often used. Most commonly implemented in scikit-learn. One of the advantages of this library is ease of use. We prepare the data, do fit and predict, done.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
                            n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
print(clf.predict([[0, 0, 0, 0]]))

The use of neural networks, in particular the Tensorflow or PyTorch libraries, involves building the architecture of a neural network model and then training and predicting. Requires a higher entry threshold.

Many ready-made architectures of neural networks have been implemented for working with images, text, and sound. Not much to work with tabular data — TabNet example.

The main goal of creating DatRet was to lower the entry threshold for working with neural networks. Implemented training and data prediction, as in classical methods, such as RandomForestClassifier or CatBoostClassifier. To do this, I created an automatic generation of the neural network architecture, based on the number of selected neurons in the first fully connected layer. The second goal was an attempt to approach the classical methods in terms of the accuracy of predicting structured tabular data.

The model has three classes:

DatRetClassifier for classification tasks.
DatRetRegressor for regression problems
DatRetMultilabelClassifier for “multilabel” classification.

Advantages

simplicity and ease of use. Fit and predict et Voila!
automatic generation of neural network architecture
quick adjustment of model parameters
GPU support
high prediction accuracy
support for multilabel classification
Tensorflow under the hood ;)

Where to get it?

The source code is currently hosted on GitHub at: GitHub — AbdualimovTP/datret: Tensorflow implementation for structured tabular data

Binary installers for the latest released version are available at the Python Package Index (PyPI)

# PyPI
pip install datret

Dependencies

Quick start

Training and prediction of the model is implemented as in scikit-learn. Prepare your test and train set and run the fit. Support for automatic data normalization for neural networks.

NB! Don’t forget to install the dependencies before using the model. You will need Tensorflow, Numpy, Pandas and Scikit-Learn installed.

NB! No need to do one-hot encoding of predictive features. The model will do automatically.

# load library
from datret.datret import DatRetClassifier, DatRetRegressor, DatRetMultilabelClassifier

# prepare train, test split. As in sklearn.
# for example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=i)

# Call the regressor or classifier and train the model.
DR = DatRetClassifier() # DatRetRegressor works on the same principle
DR.fit(X_train, y_train)
# predict the actual label (or class) over a new set of data.
DR_predict = DR.predict(X_test)
# predict the class probabilities for each data point.
DR_predict_proba = DR.predict_proba(X_test) # Missing in DatRetRegressor, DatRetMultilabelClassifier

Custom model options

Parameters:

epoch: int, default = 30. Number of epochs to train the model.
optimizer: string (name of optimizer) or optimizer instance. See tf.keras.optimizers, default = Adam(learning_rate=0.001). On DatRetRegressor defaut learning rate = 0.01. Built-in tensorflow optimizer classes.
loss: Loss function. May be a string (name of loss function). See tf.keras.losses, default for DatRetClassifier = CategoricalCrossentropy(), for DatRetRegressor = MeanSquaredError(). Built-in loss functions.
verbose: ‘auto’, 0, 1, or 2, default=0. Verbosity mode. 0 = silent, 1 = progress bar, 2 = one line per epoch. ‘auto’ defaults to 1 for most cases, but 2 when used with ParameterServerStrategy=0.
number_neurons: int, default = 500. The number of layers in the first fully connected layer. Subsequent layers are generated automatically with half as many neurons.
validation_split: Float between 0 and 1, default = 0. Fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
batch_size: int, default =1. Number of samples per gradient update. Steps_per_epoch s calculated automatically, X_train.shape[0] // batch_size
shuffle: True or False, default = True. This argument is ignored when x is a generator or an object of tf.data.Dataset. 'batch' is a special option for dealing with the limitations of HDF5 data; it shuffles in batch-sized chunks.
callback: [], default = [EarlyStopping(monitor='loss', mode='auto', patience=7, verbose=1), ReduceLROnPlateau(monitor='loss', factor=0.2, patience=3, min_lr=0.00001, verbose=1)]. Callbacks: utilities called at certain points during model training.

Adjustable fit method parameters

Parameters:

normalize: True or False ,default True. Automatic normalization of input data. Used MinMaxScaler.

Example:

# load library
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam, Nadam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.losses import CategoricalCrossentropy, MeanSquaredError, BinaryCrossentropy
from datret.datret import DatRetClassifier, DatRetRegressor, DatRetMultilabelClassifier

# prepare train, test split. As in sklearn.
# for example
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=i)

# Call the regressor or classifier and train the model.
DR = DatRetClassifier(epoch=50,
                      optimizer=Nadam(learning_rate=0.001),
                      loss=BinaryCrossentropy(),
                      verbose=1,
                      number_neurons=1000,
                      validation_split = 0.1,
                      batch_size=100,
                      shuffle=True,
                      callback=[])
DR.fit(X_train, y_train, normalize=True)
# predict the actual label (or class) over a new set of data.
DR_predict = DR.predict(X_test)
# predict the class probabilities for each data point.
DR_predict_proba = DR.predict_proba(X_test)

Model architecture

As an example, when using number_neurons = 500 input neurons and 2 predictable classes, the model will automatically have this architecture.

Model: "DatRet with number_neurons = 500"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, X_train.shape[0)]      0         

 dense (Dense)               (None, 500)               150500    

 dense_1 (Dense)             (None, 250)               125250    

 dense_2 (Dense)             (None, 125)               31375     

 dense_3 (Dense)             (None, 62)                7812      

 dense_4 (Dense)             (None, 31)                1953      

 dense_5 (Dense)             (None, 15)                480       

 dense_6 (Dense)             (None, 7)                 112       

 dense_7 (Dense)             (None, 3)                 24        

 dense_8 (Dense)             (None, 2)                 8         
                       (2 predictable classes)                               
=================================================================
Total params: 317,514
Trainable params: 317,514
Non-trainable params: 0

Comparison of accuracy with classical machine learning methods

DatRetClassifier

To assess the accuracy of the classifier, we will use Pima Indians Diabetes Database | Kaggle. Comparable metric RocAucScore. We will compare DatRet with RandomForest and CatBoost out “of the box”.

for i in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]:
    X_train, X_test, y_train, y_test = train_test_split(data.drop(["Outcome"], axis=1), data["Outcome"],
                                                random_state=10, test_size=i)
    #RandomForest
    RF = RandomForestClassifier(random_state=0)
    RF.fit(X_train, y_train)
    RF_pred = RF.predict_proba(X_test)
    dataFrameRocAuc.loc['RandomForest'][f'{int(i*100)}%'] = np.round(roc_auc_score(y_test, RF_pred[:,1]), 2)
    
    #Catboost
    CB = CatBoostClassifier(random_state=0, verbose=0)
    CB.fit(X_train, y_train)
    CB_pred = CB.predict_proba(X_test)
    dataFrameRocAuc.loc['CatBoost'][f'{int(i*100)}%'] = np.round(roc_auc_score(y_test, CB_pred[:,1]), 2)
    
    #DatRet
    DR = DatRetClassifier(optimizer=Adam(learning_rate=0.001))
    DR.fit(X_train, y_train)
    DR_pred = DR.predict_proba(X_test)
    dataFrameRocAuc.loc['DatRet'][f'{int(i*100)}%'] = np.round(roc_auc_score(y_test, DR_pred[:,1]), 2)

               10%  20%     30%  40%  50%  60%
RandomForest  0.79  0.81  0.81  0.79  0.82  0.82
CatBoost      0.78  0.82  0.82  0.8   0.81  0.82
DatRet        0.79  0.84  0.82  0.81  0.84  0.81

DatRetRegressor

To assess the accuracy of the regressor, we will use Medical Cost Personal Datasets | Kaggle. Comparable metric Root Mean Square Error. We will compare DatRet with RandomForest and CatBoost out “of the box”.

for i in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]:
    X_train, X_test, y_train, y_test = train_test_split(data.drop(["charges"], axis=1), data["charges"],
                                                random_state=10, test_size=i)
    #RandomForest
    RF = RandomForestRegressor(random_state=0)
    RF.fit(X_train, y_train)
    RF_pred = RF.predict(X_test)
    dataFrameRMSE.loc['RandomForest'][f'{int(i*100)}%'] = np.round(mean_squared_error(y_test, RF_pred, squared=False), 2)
    
    #Catboost
    CB = CatBoostRegressor(random_state=0, verbose=0)
    CB.fit(X_train, y_train)
    CB_pred = CB.predict(X_test)
    dataFrameRMSE.loc['CatBoost'][f'{int(i*100)}%'] = np.round(mean_squared_error(y_test, CB_pred, squared=False), 2)
    
    #DatRet
    DR = DatRetRegressor(optimizer=Adam(learning_rate=0.01))
    DR.fit(X_train, y_train)
    DR_pred = DR.predict(X_test)
    dataFrameRMSE.loc['DatRet'][f'{int(i*100)}%'] = np.round(mean_squared_error(y_test, DR_pred, squared=False), 2)

              10%   20%   30%   40%   50%   60%
RandomForest  5736  5295  4777  4956  4904  4793
CatBoost      5732  5251  4664  4986  5044  4989
DatRet        5860  5173  4610  4927  5047  5780

Not bad results for the out-of-the-box model.

In the task of classifying 10%, 20%, 30%, 40%, 50% of the total dataset of the test sample, DatRet showed the best results.

In the regression problem for 20%, 30%, 40% of the total dataset of the test sample, DatRet gives the best accuracy.

In the future, I plan to evaluate the accuracy of the model on other datasets. I also see opportunities to improve the quality of forecasting. I plan to implement in the next versions of the library.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com