Stories by Anil Yildiz on Medium

Deep Learning with Tabnet

Anil Yildiz — Fri, 24 Nov 2023 05:34:10 GMT

TabNet is a deep learning architecture specifically designed for tabular data, introduced in the paper “TabNet: Attentive Interpretable Tabular Learning” by Arik and Pfister from Google Research.

TabNet takes raw, unprocessed tabular data and undergoes training using gradient descent-based optimization. It utilizes sequential attention at each decision step to select features. This enhances interpretability by allocating learning capacity to the most useful features.

As feature selection is performed on an individual basis, it can vary for each row in the training dataset. TabNet utilizes a single deep learning architecture for feature selection and extraction, a technique referred to as soft feature selection.Based on design choices, TabNet can provide two types of interpretability: local interpretability, which visualizes the importance of features and shows how they are combined for a specific row; and global interpretability, which quantifies the contribution of each feature across the entire dataset in the trained model.

We can list the advantages of the Tabnet algorithm as follows;

It allows you to train Multiregressor without creating separate models for each class.
It uses an attention structure to focus on a specific data point and even visualizes it, showing which parts receive attention for a given selection. The number of features can be changed depending on the features being focused on.
It uses backprop to improve decisions and weights, which gives more control.
LR reduction uses fine-tuning approaches that work for all deep learning principles such as special loss.
Tabnet automates feature selection, so you don’t need to take care of this.

Its disadvantages are as follows;

Tabnet only performs well with tabular data sets. (That was its purpose anyway 😄 )
Tabnet, like other deep learning algorithms, is quite complex.
Grid Search and Randomized Search cannot be used when optimizing parameters. Therefore, hyperparameters (EPOCHS, BATCH_SIZE, LEARNING_RATE, etc.) must be configured manually.
Layer and neuron values used in creating the artificial neural network architecture must be assigned correctly, otherwise the accuracy rates will be quite low.

Tabnet Implementation

The dataset contains packet information generated over the network when smartphone applications are in use. Our objective is to predict the type of application through multi-class classification using the packet information.

First, let’s import the necessary libraries.

import numpy as np
import pandas as pd
import seaborn as sns
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler

from sklearn.preprocessing import MinMaxScaler    
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

dataframe = pd.read_csv(‘dataset.csv')

Let’s load and visualize the dataset. (Categories are given according to the type of applications. For example 0 : Social Media, 1: Multimedia, 3 : Communication, 4 : Navigation, 5 : Game etc.)

df = pd.read_csv('dataset.csv')
rows = 100
dfhead = df.head(rows)
dftail = df.tail(rows)
dflast = pd.concat([dfhead, dftail])
dflast

We encode our target class.

class2idx = {
    'Entertainment':0,
    'Social':1,
    'Utility':2,
    'Lifestyle':3,
    'Productivity':4,
    'Game':5
}
idx2class = {v: k for k, v in class2idx.items()}
dataframe['appcategory'].replace(class2idx, inplace=True)

Since Appcategory is our target class, we extract it from our training data and then partition our data sets as Train (training), Val (validation) and Test.

X = dataframe.drop(columns=['appcategory'])
y = dataframe['appcategory']
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=69)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.1, stratify=y_trainval, random_state=21)

We normalize the inputs.

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
X_train, y_train = np.array(X_train), np.array(y_train)
X_val, y_val = np.array(X_val), np.array(y_val)
X_test, y_test = np.array(X_test), np.array(y_test)

We visualize the class distribution in Train and Val datasets.

def get_class_distribution(obj):
    count_dict = {
        "rating_ECOMMERCE": 0,
        "rating_OTHER": 0,
        "rating_MULTIMEDIA": 0,
        "rating_SOCIAL": 0,
        "rating_COMMUNICATION": 0,
        "rating_NAVIGATION": 0,
        "rating_GAME": 0,
    }
    
    for i in obj:
        if i == 0: 
            count_dict['rating_ECOMMERCE'] += 1
        elif i == 1: 
            count_dict['rating_OTHER'] += 1
        elif i == 2: 
            count_dict['rating_MULTIMEDIA'] += 1
        elif i == 3: 
            count_dict['rating_SOCIAL'] += 1
        elif i == 4: 
            count_dict['rating_COMMUNICATION'] += 1  
        elif i == 5: 
            count_dict['rating_NAVIGATION'] += 1      
        elif i == 6: 
            count_dict['rating_GAME'] += 1
        else:
            print("Check classes.")
            
    return count_dict

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(25,7))

# Train
sns.barplot(data = pd.DataFrame.from_dict([get_class_distribution(y_train)]).melt(), x = "variable", y="value", hue="variable",  ax=axes[0]).set_title('Class Distribution in Train Set')
# Validation
sns.barplot(data = pd.DataFrame.from_dict([get_class_distribution(y_val)]).melt(), x = "variable", y="value", hue="variable",  ax=axes[1]).set_title('Class Distribution in Val Set')

We create Train, Val and Test data sets.

class ClassifierDataset(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)

train_dataset = ClassifierDataset(torch.from_numpy(X_train).float(), torch.from_numpy(y_train).long())
val_dataset = ClassifierDataset(torch.from_numpy(X_val).float(), torch.from_numpy(y_val).long())
test_dataset = ClassifierDataset(torch.from_numpy(X_test).float(), torch.from_numpy(y_test).long())

We implement the structure we will use as a sampler.

target_list = []
for _, t in train_dataset:
    target_list.append(t)
    
target_list = torch.tensor(target_list)

class_count = [i for i in get_class_distribution(y_train).values()]
class_weights = 1./torch.tensor(class_count, dtype=torch.float)

class_weights_all = class_weights[target_list]

print(class_weights)

weighted_sampler = WeightedRandomSampler(
    weights=class_weights_all,
    num_samples=len(class_weights_all),
    replacement=True
)

We assign the parameters of the algorithm. We mentioned that these parameters need to be assigned manually. At this stage, we can examine the parameters and what they do and decide on the most suitable parameters for our dataset. You can review it here.

EPOCHS = 300
BATCH_SIZE = 32
LEARNING_RATE = 0.001
NUM_FEATURES = len(X.columns)
NUM_CLASSES = 7

We define dataloader.

train_loader = DataLoader(dataset=train_dataset,
                          batch_size=BATCH_SIZE,
                          sampler=weighted_sampler
)
val_loader = DataLoader(dataset=val_dataset, batch_size=1)
test_loader = DataLoader(dataset=test_dataset, batch_size=1)

We design the neural network architecture. The neuron and layer values assigned at this stage are done manually, just like the parameters defined in the two previous steps. If the neuron and layer values are not assigned properly at this stage, the success rate will be quite low.

class MulticlassClassification(nn.Module):
    def __init__(self, num_feature, num_class):
        super(MulticlassClassification, self).__init__()
        
        self.layer_1 = nn.Linear(num_feature, 512)
        self.layer_2 = nn.Linear(512, 256)
        self.layer_3 = nn.Linear(256, 128)
        self.layer_out = nn.Linear(128, num_class) 
        
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=0.2)
        self.batchnorm1 = nn.BatchNorm1d(512)
        self.batchnorm2 = nn.BatchNorm1d(256)
        self.batchnorm3 = nn.BatchNorm1d(128)
        
    def forward(self, x):
        x = self.layer_1(x)
        x = self.batchnorm1(x)
        x = self.relu(x)
        
        x = self.layer_2(x)
        x = self.batchnorm2(x)
        x = self.relu(x)
        x = self.dropout(x)
        
        x = self.layer_3(x)
        x = self.batchnorm3(x)
        x = self.relu(x)
        x = self.dropout(x)
        
        x = self.layer_out(x)
        
        return x

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")    
    
model = MulticlassClassification(num_feature = NUM_FEATURES, num_class=NUM_CLASSES)
model.to(device)

criterion = nn.CrossEntropyLoss(weight=class_weights.to(device))
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

As a result of this design, there will be a neural network architecture like this.

We train and validate the model.

def multi_acc(y_pred, y_test):
    y_pred_softmax = torch.log_softmax(y_pred, dim = 1)
    _, y_pred_tags = torch.max(y_pred_softmax, dim = 1)    
    
    correct_pred = (y_pred_tags == y_test).float()
    acc = correct_pred.sum() / len(correct_pred)
    
    acc = torch.round(acc * 100)
    
    return acc

accuracy_stats = {
    'train': [],
    "val": []
}
loss_stats = {
    'train': [],
    "val": []
}

print("Begin training.")

for e in tqdm(range(1, EPOCHS+1)):
    
    # TRAINING
    train_epoch_loss = 0
    train_epoch_acc = 0

    for X_train_batch, y_train_batch in train_loader:
        X_train_batch, y_train_batch = X_train_batch.to(device), y_train_batch.to(device)
        optimizer.zero_grad()
        
        y_train_pred = model(X_train_batch)
        
        train_loss = criterion(y_train_pred, y_train_batch)
        train_acc = multi_acc(y_train_pred, y_train_batch)
        
        train_loss.backward()
        optimizer.step()
        
        train_epoch_loss += train_loss.item()
        train_epoch_acc += train_acc.item()
                
    # VALIDATION    
    with torch.no_grad():
        
        val_epoch_loss = 0
        val_epoch_acc = 0
        
        model.eval()
        for X_val_batch, y_val_batch in val_loader:
            X_val_batch, y_val_batch = X_val_batch.to(device), y_val_batch.to(device)
            
            y_val_pred = model(X_val_batch)
                        
            val_loss = criterion(y_val_pred, y_val_batch)
            val_acc = multi_acc(y_val_pred, y_val_batch)
            
            val_epoch_loss += val_loss.item()
            val_epoch_acc += val_acc.item()
            
    loss_stats['train'].append(train_epoch_loss/len(train_loader))
    loss_stats['val'].append(val_epoch_loss/len(val_loader))
    accuracy_stats['train'].append(train_epoch_acc/len(train_loader))
    accuracy_stats['val'].append(val_epoch_acc/len(val_loader))
                              
    
    print(f'Epoch {e+0:03}: | Train Loss: {train_epoch_loss/len(train_loader):.5f} | Val Loss: {val_epoch_loss/len(val_loader):.5f} | Train Acc: {train_epoch_acc/len(train_loader):.3f}| Val Acc: {val_epoch_acc/len(val_loader):.3f}')

We create dataframes and visualize forecasts and losses.

# Create dataframes
train_val_acc_df = pd.DataFrame.from_dict(accuracy_stats).reset_index().melt(id_vars=['index']).rename(columns={"index":"epochs"})
train_val_loss_df = pd.DataFrame.from_dict(loss_stats).reset_index().melt(id_vars=['index']).rename(columns={"index":"epochs"})
# Plot the dataframes
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,7))
sns.lineplot(data=train_val_acc_df, x = "epochs", y="value", hue="variable",  ax=axes[0]).set_title('Train-Val Accuracy/Epoch')
sns.lineplot(data=train_val_loss_df, x = "epochs", y="value", hue="variable", ax=axes[1]).set_title('Train-Val Loss/Epoch')

We test the model and print the result. (Finally 😄)

y_pred_list = []
with torch.no_grad():
    model.eval()
    for X_batch, _ in test_loader:
        X_batch = X_batch.to(device)
        y_test_pred = model(X_batch)
        _, y_pred_tags = torch.max(y_test_pred, dim = 1)
        y_pred_list.append(y_pred_tags.cpu().numpy())
y_pred_list = [a.squeeze().tolist() for a in y_pred_list]

It took 300 iterations to train the model because the EPOCH parameter was set to 300

Class-based and average results.

Class distributions in train, test and val datasets.

Iteration-based plots of accuracy and losses.

Sources

TabNet: Attentive Interpretable Tabular Learning

TabNet on AI Platform: High-performance, Explainable Tabular Learning

Tabular Workflow for TabNet

Implemantation Example Google

Implemantation Example Kaggle

Light GBM Light and Powerful Gradient Boost Algorithm

Anil Yildiz — Fri, 24 Nov 2023 05:32:00 GMT

LightGBM, developed by Microsoft, is a gradient boosting algorithm that has rapidly gained popularity and secured a robust position among successful models. Light GBM is widely used in Kaggle and one of the reasons is its superior speed performance compared to other models, consistently placing it among the models that achieve the best results. LightGBM is a type of Gradient Boosting Machine (GBM) that utilizes a structure incorporating tree-based learning algorithms. These features positively impact the preference for LightGBM, contributing to its increased popularity.

LightGBM employs a leaf-wise decision tree-based gradient boosting method that reduces memory usage while enhancing model efficiency. This method adopts two innovative techniques, Gradient-based One Side Sampling (GOSS) and Exclusive Feature Bundling (EFB), to overcome the limitations of the traditional histogram-based approach used in Gradient Boosting Decision Tree (GBDT) algorithms. GOSS and EFB are employed to address these limitations and improve the overall performance of the algorithm. The characteristics of the LightGBM algorithm are shaped by the GOSS and EFB methodologies. These techniques are employed together to ensure the effective operation of the model and gain advantages over other Gradient Boosting Decision Tree (GBDT) algorithms.

LightGBM Algorithm

Let’s first examine the difference between LightGBM’s adopted leaf-wise decision tree and the level-wise decision tree in decision trees.

LightGBM adopts a leaf-wise growth strategy, as opposed to the traditional level-wise approach. The fundamental difference lies in how the tree grows and how branches expand.Level-Wise

Level-Wise

The tree grows level by level. In other words, all nodes are expanded at each level, and the children of these nodes are created.
This approach typically leads to shallower but wider trees.
The tree expansion may stop before ensuring completion of all levels, potentially prolonging the processing time.

Leaf-Wise

The tree grows by adding a leaf that provides the maximum gain at each expansion step. In other words, only one leaf node is added at each expansion step.
This usually results in deeper but narrower trees. Deeper trees can offer more flexibility to capture complex feature relationships.
It is generally faster compared to the level-wise approach.

In summary, LightGBM’s leaf-wise strategy focuses on expanding the tree by adding leaves that provide the maximum gain, offering advantages such as increased depth for capturing intricate patterns and faster processing.

Gradient-based One Side Sampling (GOSS)

It starts from the fact that different data examples contribute differently to information gain. Examples with higher gradients have a greater impact on information gain. GOSS preserves examples with large gradients (e.g., those greater than a specific threshold or in the top percentiles), while maintaining the accuracy of information gain predictions by randomly discarding examples with small gradients. This method allows for more accurate gain predictions compared to regular random sampling at the desired example ratio when information gain varies widely

Exclusive Feature Bundling (EFB)

It generally provides an almost lossless strategy to represent high-dimensional sparse data with fewer features. Especially in a sparse feature space, many features are mutually exclusive, meaning they do not simultaneously take non-zero values. These features can be safely merged into a single feature. As a result, the complexity of histogram construction decreases from O(data^feature) to O(data^bundle) levels (bundle < feature). This speeds up the algorithm while maintaining precision.

Light GBM Implemantation

First, let’s import the necessary libraries.

import pandas as pd
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

Let’s load and visualize the dataset. (Categories are given according to the type of applications. For example 0 : Social Media, 1: Multimedia, 3 : Communication, 4 : Navigation, 5 : Game etc.)

df = pd.read_csv('dataset.csv')
rows = 100
dfhead = df.head(rows)
dftail = df.tail(rows)
dflast = pd.concat([dfhead, dftail])
dflast

Load the data and split it into test and train (80%-20%).

df = pd.read_csv(dataset.csv')
X = df.drop(columns=['appcategory'])
y = df['appcategory']
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size = 0.2)

We install Light GBM, do hyperparameter optimization and start the train process. In this example, GridSearchCV is used for hyperparameter optimization. However, the optimal values were found beforehand and set directly.

For hyperparameter optimization with GridSearchCV and LightGBM hyperparameters, you can check this article.

lgb=lgb.LGBMClassifier()
parameters = {'num_leaves':[100], 'min_child_samples':[15],'max_depth':[20],
             'learning_rate':[0.2],'reg_alpha':[0.03]}
clf=GridSearchCV(lgb,parameters,cv = 2)
clf.fit(X=X_train, y=y_train)

Finally, we complete the prediction process and write down the results.

predictions=clf.predict(X_test)    
score = accuracy_score(y_test, predictions)
a = pd.crosstab(y_test,predictions)
print(score)
print(a.max(axis=1)/a.sum(axis=1))

We display an average success rate of 79% and category-based success rates.

Sources

Light GBM: A Powerful Gradient Boosting Algorithm

What is LightGBM (Light Gradient Boosting) + Example Python Code

Light-GBM & difference b/w LGBM & Xg-boost