MACHINE LEARNING

How to Extract Neural Network Embeddings

Enhancing Predictive Accuracy with Neural Network Embeddings

Priyanshu Chaudhary
CueNex

--

automatic feature extraction

Introduction

In our previous feature engineering blog, we looked at the ways of handcrafted feature engineering. In this blog, we will be looking at the automatic feature engineering done by neural networks and how to extract those embeddings and use them along with handcrafted features.
Note that we have utilized the TensorFlow framework for all our neural network-based pipelines, so this method will work only when you are using the TensorFlow framework. Before diving into the process of embeddings extraction, let's first understand what embeddings are and how the idea came about.

Understanding embeddings

Though the idea of embeddings was introduced in the early 90s, in the modern era, it became famous when in 2013, Tomas Mikolov and colleagues introduced the Word2Vec algorithm, which used a neural network to generate embeddings that capture semantic relationships between words. This breakthrough led to the widespread adoption of word embeddings in NLP applications. Similarly, we can also represent the tabular data in the form of embeddings by passing it through single or multiple dense layers.

We can think of embeddings as outputs or encoded representations of data that machines can interpret and that capture temporal, spatial, and contextual information depending on the application. These encoded representations are subsequently delivered to a classifier layer or layers, which provide the desired output.

In our case, the data is from credit default detection, and the embeddings will be the output from all the intermediate layers of the neural network. Normally these encodings will be passed to the classifier head (the dense layer) in neural networks, but we can pass them to different models such as LGBM (light GBM), SVM (support vector machines), etc. that will act as a classifier head.

Model

We are using our previous neural network model that we have defined in our blog. The Network along with an embedding layer is shown below.

Neural Network with an embedding layer

The changes that are to be made have been done in the below code.

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ReduceLROnPlateau, LearningRateScheduler, EarlyStopping
from tensorflow.keras.layers import Dense, Input, InputLayer, Add, Concatenate, Dropout, BatchNormalization

LR_START = 0.01 #defines the learning rate
def my_model(n_inputs=len(features)):
"neural network with skip connection,
Returns a compiled instance of TensorFlow.keras.models.Model."
activation = 'swish'
l1 = 1e-7
l2 = 4e-4
inputs = Input(shape=(n_inputs, ))
x0 = BatchNormalization()(inputs)
x0 = Dense(1024,
kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
activation=activation,
)(x0)
x0 = Dropout(0.1)(x0)
x1 = Dense(64,
kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
activation=activation,
)(x0)
x2 = Dense(64,
kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
activation=activation,
)(x1)
x3 = Concatenate()([x2, x0])
x3 = Dropout(0.1)(x3)
x4 = Dense(16,
kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
activation=activation,
name='embeddings')(x3)#specify the name of the layer
x5 = Dense(1,
activation='sigmoid',
name='output')(x4) #specify the name of the layer
model = Model(inputs, [x5,x4])#output x4 to extract embeddings of x4 layer
model.compile(optimizer=tf.keras.optimizers.Nadam(learning_rate=LR_START,clipvalue= 0.5,clipnorm = 1.0),loss=tf.keras.losses.BinaryCrossentropy(),loss_weights=[1., 0.0])
return model

Let’s understand the changes we have done.

Note that we can extract the embeddings from any of the layers that are present in our neural network(x0,x1,x2,x3, …). Here is an example of how we extract the embedding of layer x4.

To extract features, we have to specify the output in the Model layer of the x4 variable as illustrated below.

x4 = Dense(16, 
kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
activation=activation,
name='embeddings')(x3)#specify the name of the layer
x5 = Dense(1,
activation='sigmoid',
name='output')(x4) #specify the name of the layer
model = Model(inputs, [x5,x4]) #output x4 to extract embeddings of x4 layer

Note that we have specified weights in our compile layer of the model so that it does not evaluate based on the x4 embeds as output, as also highlighted in the figure by a red circle.

model.compile(optimizer=tf.keras.optimizers.Nadam(learning_rate=LR_START,clipvalue= 0.5,clipnorm = 1.0),loss=tf.keras.losses.BinaryCrossentropy(),loss_weights=[1., 0.0])
return model

Training

Now that we have added x4 embeds in our output, we also have to specify that our model does not evaluate on x4 embeds, or else it can lead to a logical error.

from sklearn.model_selection import StratifiedKFold, StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler, QuantileTransformer, OneHotEncoder

''' function for scaling data, training of model and validating it
code ref: https://www.kaggle.com/code/ambrosm/amex-keras-quickstart-1-training
'''
def fit_model(X_tr, y_tr, X_va=None, y_va=None, fold=0, run=0):

global y_va_pred
gc.collect()
start_time = datetime.datetime.now()

scaler = StandardScaler() #scales the training data
X_tr = scaler.fit_transform(X_tr)

if X_va is not None:
X_va = scaler.transform(X_va) #scales the validation data
validation_data = (X_va,[y_va, np.zeros((len(y_va),16))])
else:
validation_data = None
# Define the learning rate schedule and EarlyStopping
if USE_PLATEAU and X_va is not None: # use early stopping
epochs = EPOCHS
lr = ReduceLROnPlateau(monitor="val_output_loss", factor=0.7, # scheduler #specify in
patience=4, verbose=VERBOSE)
es = EarlyStopping(monitor="val_output_loss", #stop training if results does not improve for 12 epochs
patience=12,
verbose=1,
mode="min",
restore_best_weights=True)
callbacks = [lr, es, tf.keras.callbacks.TerminateOnNaN()]

else: # use exponential learning rate decay rather than early stopping
epochs = EPOCHS_EXPONENTIALDECAY

def exponential_decay(epoch):
# v decays from e^a to 1 in every cycle
# w decays from 1 to 0 in every cycle
# epoch == 0 -> w = 1 (first epoch of cycle)
# epoch == epochs_per_cycle-1 -> w = 0 (last epoch of cycle)
# higher a -> decay starts with a steeper decline
# ref:
a = 3
epochs_per_cycle = epochs // CYCLES
epoch_in_cycle = epoch % epochs_per_cycle
if epochs_per_cycle > 1:
v = math.exp(a * (1 - epoch_in_cycle / (epochs_per_cycle-1)))
w = (v - 1) / (math.exp(a) - 1)
else:
w = 1
return w * LR_START + (1 - w) * LR_END

lr = LearningRateScheduler(exponential_decay, verbose=0)
callbacks = [lr, tf.keras.callbacks.TerminateOnNaN()]

# Construct and compile the model
model = my_model(X_tr.shape[1]) #define and compile model
# Train the model

history = model.fit(X_tr, [y_tr,np.zeros((len(y_tr),16))], #fit model
validation_data=validation_data,
epochs=epochs,
verbose=VERBOSE,
batch_size=BATCH_SIZE,
shuffle=True,
callbacks=callbacks)
del X_tr, y_tr
with open(f"scaler_{fold}.pickle", 'wb') as f: pickle.dump(scaler, f) #save standard scaler for real-time interaction
model.save(f"model_{fold}") #save model weights
history_list.append(history.history)
callbacks, es, lr, history = None, None, None, None

lastloss = f"Training loss: {history_list[-1]['loss'][-1]:.4f} | Val loss: {history_list[-1]['val_loss'][-1]:.4f}"

# Inference for validation
y_va_pred = model.predict(X_va, batch_size=len(X_va), verbose=0)[0].ravel()

# Evaluation: Execution time, loss and metrics
score = evaluation_metric(y_va, y_va_pred)
print(f"{Fore.GREEN}{Style.BRIGHT}Fold {run}.{fold} | {str(datetime.datetime.now() - start_time)[-12:-7]}"
f" | {len(history_list[-1]['loss']):3} ep"
f" | {lastloss} | Score: {score:.5f}{Style.RESET_ALL}")
score_list.append(score)

if DIAGRAMS and fold == 0 and run == 0:
# Plot training history
plot_history(history_list[-1],
title=f"Learning curve",
plot_lr=True)

# Scale and predict
y_pred_list.append(model.predict(scaler.transform(test), batch_size=128*1024, verbose=0).ravel())

Changes that we need to make in our training code are

  • Change the validation_data shape as defined below.
validation_data = (X_va,[y_va, np.zeros((len(y_va),16))])
  • Change model.fit and change the training labels’ shape.
history = model.fit(X_tr, [y_tr,np.zeros((len(y_tr),16))],    #fit model 
validation_data=validation_data,
epochs=epochs,
verbose=VERBOSE,
batch_size=BATCH_SIZE,
shuffle=True,
callbacks=callbacks)
  • Change the scheduler to an early stopping behavior so that model only evaluates the true labels, not the embeddings.
lr = ReduceLROnPlateau(monitor="val_output_loss", factor=0.7,  # scheduler  #specify in 
patience=4, verbose=VERBOSE)
es = EarlyStopping(monitor="val_output_loss", #stop training if results does not improve for 12 epochs
patience=12,
verbose=1,
mode="min",
restore_best_weights=True)

Now everything is ready we have our embeddings extracted.

How to use the embeddings?

Using both automated and handcrafted features.

The embeddings generated from the neural networks are in tabular format. Therefore, using them can simply be concatenated with the handcrafted features we made in our previous blogs.

We can also extract the embeddings from the GRU model we created here and use those features in our final model. A combination of features carrying temporal information and the handcrafted feature can be of high significance to the model.

Conclusion

We now know that using embeddings produced by a neural network can be a practical substitute for designing features by hand. We can enhance the effectiveness and accuracy of our machine-learning models by utilizing embeddings, which will ultimately produce better outcomes and more diversity in our pipeline.

--

--

Priyanshu Chaudhary
CueNex
Writer for

Competitions Master @Kaggle.com, Machine Learning @Expedia