MACHINE LEARNING

How to Extract Neural Network Embeddings

Enhancing Predictive Accuracy with Neural Network Embeddings

Priyanshu Chaudhary

Published in

CueNex

6 min readMar 29, 2023

Introduction

In our previous feature engineering blog, we looked at the ways of handcrafted feature engineering. In this blog, we will be looking at the automatic feature engineering done by neural networks and how to extract those embeddings and use them along with handcrafted features.
Note that we have utilized the TensorFlow framework for all our neural network-based pipelines, so this method will work only when you are using the TensorFlow framework. Before diving into the process of embeddings extraction, let's first understand what embeddings are and how the idea came about.

Understanding embeddings

Though the idea of embeddings was introduced in the early 90s, in the modern era, it became famous when in 2013, Tomas Mikolov and colleagues introduced the Word2Vec algorithm, which used a neural network to generate embeddings that capture semantic relationships between words. This breakthrough led to the widespread adoption of word embeddings in NLP applications. Similarly, we can also represent the tabular data in the form of embeddings by passing it through single or multiple dense layers.

We can think of embeddings as outputs or encoded representations of data that machines can interpret and that capture temporal, spatial, and contextual information depending on the application. These encoded representations are subsequently delivered to a classifier layer or layers, which provide the desired output.

In our case, the data is from credit default detection, and the embeddings will be the output from all the intermediate layers of the neural network. Normally these encodings will be passed to the classifier head (the dense layer) in neural networks, but we can pass them to different models such as LGBM (light GBM), SVM (support vector machines), etc. that will act as a classifier head.

Model

We are using our previous neural network model that we have defined in our blog. The Network along with an embedding layer is shown below.

The changes that are to be made have been done in the below code.

import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ReduceLROnPlateau, LearningRateScheduler, EarlyStopping
from tensorflow.keras.layers import Dense, Input, InputLayer, Add, Concatenate, Dropout, BatchNormalization

LR_START = 0.01  #defines the learning rate
def my_model(n_inputs=len(features)):
    "neural network with skip connection, 
    Returns a compiled instance of TensorFlow.keras.models.Model."
    activation = 'swish'
    l1 = 1e-7
    l2 = 4e-4
    inputs = Input(shape=(n_inputs, ))
    x0 = BatchNormalization()(inputs)
    x0 = Dense(1024, 
               kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
              activation=activation,
             )(x0)
    x0 = Dropout(0.1)(x0)
    x1 = Dense(64, 
              kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
              activation=activation,
             )(x0)
    x2 = Dense(64, 
              kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
              activation=activation,
             )(x1)
    x3 = Concatenate()([x2, x0])
    x3 = Dropout(0.1)(x3)
    x4 = Dense(16, 
              kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
              activation=activation,
             name='embeddings')(x3)#specify the name of the layer
    x5 = Dense(1,
              activation='sigmoid',
             name='output')(x4)  #specify the name of the layer
    model = Model(inputs, [x5,x4])#output x4 to extract embeddings of x4 layer
    model.compile(optimizer=tf.keras.optimizers.Nadam(learning_rate=LR_START,clipvalue= 0.5,clipnorm = 1.0),loss=tf.keras.losses.BinaryCrossentropy(),loss_weights=[1., 0.0])
    return model

Let’s understand the changes we have done.

Note that we can extract the embeddings from any of the layers that are present in our neural network(x0,x1,x2,x3, …). Here is an example of how we extract the embedding of layer x4.

To extract features, we have to specify the output in the Model layer of the x4 variable as illustrated below.

x4 = Dense(16, 
              kernel_regularizer=tf.keras.regularizers.L1L2(l1=l1,l2=l2),
              activation=activation,
             name='embeddings')(x3)#specify the name of the layer
x5 = Dense(1,
              activation='sigmoid',
             name='output')(x4)  #specify the name of the layer
model = Model(inputs, [x5,x4]) #output x4 to extract embeddings of x4 layer

Note that we have specified weights in our compile layer of the model so that it does not evaluate based on the x4 embeds as output, as also highlighted in the figure by a red circle.

model.compile(optimizer=tf.keras.optimizers.Nadam(learning_rate=LR_START,clipvalue= 0.5,clipnorm = 1.0),loss=tf.keras.losses.BinaryCrossentropy(),loss_weights=[1., 0.0])
    return model

Training

Now that we have added x4 embeds in our output, we also have to specify that our model does not evaluate on x4 embeds, or else it can lead to a logical error.

from sklearn.model_selection import StratifiedKFold, StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler, QuantileTransformer, OneHotEncoder

''' function for scaling data, training of model and validating it
code ref: https://www.kaggle.com/code/ambrosm/amex-keras-quickstart-1-training
'''
def fit_model(X_tr, y_tr, X_va=None, y_va=None, fold=0, run=0):

    global y_va_pred
    gc.collect()
    start_time = datetime.datetime.now()
    
    scaler = StandardScaler()  #scales the training data
    X_tr = scaler.fit_transform(X_tr)
    
    if X_va is not None:
        X_va = scaler.transform(X_va)  #scales the validation data
        validation_data = (X_va,[y_va, np.zeros((len(y_va),16))])
    else:
        validation_data = None
    # Define the learning rate schedule and EarlyStopping
    if USE_PLATEAU and X_va is not None: # use early stopping
        epochs = EPOCHS
        lr = ReduceLROnPlateau(monitor="val_output_loss", factor=0.7,  # scheduler  #specify in 
                               patience=4, verbose=VERBOSE)
        es = EarlyStopping(monitor="val_output_loss",    #stop training if results does not improve for 12 epochs
                           patience=12, 
                           verbose=1,
                           mode="min", 
                           restore_best_weights=True)
        callbacks = [lr, es, tf.keras.callbacks.TerminateOnNaN()]

    else: # use exponential learning rate decay rather than early stopping
        epochs = EPOCHS_EXPONENTIALDECAY

        def exponential_decay(epoch):
            # v decays from e^a to 1 in every cycle
            # w decays from 1 to 0 in every cycle
            # epoch == 0                  -> w = 1 (first epoch of cycle)
            # epoch == epochs_per_cycle-1 -> w = 0 (last epoch of cycle)
            # higher a -> decay starts with a steeper decline
            # ref:
            a = 3
            epochs_per_cycle = epochs // CYCLES
            epoch_in_cycle = epoch % epochs_per_cycle
            if epochs_per_cycle > 1:
                v = math.exp(a * (1 - epoch_in_cycle / (epochs_per_cycle-1)))
                w = (v - 1) / (math.exp(a) - 1)
            else:
                w = 1
            return w * LR_START + (1 - w) * LR_END

        lr = LearningRateScheduler(exponential_decay, verbose=0)
        callbacks = [lr, tf.keras.callbacks.TerminateOnNaN()]
        
    # Construct and compile the model
    model = my_model(X_tr.shape[1])   #define and compile model
    # Train the model

    history = model.fit(X_tr, [y_tr,np.zeros((len(y_tr),16))],    #fit model 
                        validation_data=validation_data, 
                        epochs=epochs,
                        verbose=VERBOSE,
                        batch_size=BATCH_SIZE,
                        shuffle=True,
                        callbacks=callbacks)
    del X_tr, y_tr
    with open(f"scaler_{fold}.pickle", 'wb') as f: pickle.dump(scaler, f)  #save standard scaler for real-time interaction 
    model.save(f"model_{fold}") #save model weights
    history_list.append(history.history)
    callbacks, es, lr, history = None, None, None, None

    lastloss = f"Training loss: {history_list[-1]['loss'][-1]:.4f} | Val loss: {history_list[-1]['val_loss'][-1]:.4f}"
        
        # Inference for validation
        y_va_pred = model.predict(X_va, batch_size=len(X_va), verbose=0)[0].ravel()
        
        # Evaluation: Execution time, loss and metrics
        score = evaluation_metric(y_va, y_va_pred)
        print(f"{Fore.GREEN}{Style.BRIGHT}Fold {run}.{fold} | {str(datetime.datetime.now() - start_time)[-12:-7]}"
              f" | {len(history_list[-1]['loss']):3} ep"
              f" | {lastloss} | Score: {score:.5f}{Style.RESET_ALL}")
        score_list.append(score)
        
        if DIAGRAMS and fold == 0 and run == 0:
            # Plot training history
            plot_history(history_list[-1], 
                         title=f"Learning curve",
                         plot_lr=True)

        # Scale and predict
        y_pred_list.append(model.predict(scaler.transform(test), batch_size=128*1024, verbose=0).ravel())

Changes that we need to make in our training code are

Change the validation_data shape as defined below.

validation_data = (X_va,[y_va, np.zeros((len(y_va),16))])

Change model.fit and change the training labels’ shape.

history = model.fit(X_tr, [y_tr,np.zeros((len(y_tr),16))],    #fit model 
                        validation_data=validation_data, 
                        epochs=epochs,
                        verbose=VERBOSE,
                        batch_size=BATCH_SIZE,
                        shuffle=True,
                        callbacks=callbacks)

Change the scheduler to an early stopping behavior so that model only evaluates the true labels, not the embeddings.

lr = ReduceLROnPlateau(monitor="val_output_loss", factor=0.7,  # scheduler  #specify in 
                               patience=4, verbose=VERBOSE)
es = EarlyStopping(monitor="val_output_loss",    #stop training if results does not improve for 12 epochs
                           patience=12, 
                           verbose=1,
                           mode="min", 
                           restore_best_weights=True)

Now everything is ready we have our embeddings extracted.

How to use the embeddings?

Using both automated and handcrafted features.

The embeddings generated from the neural networks are in tabular format. Therefore, using them can simply be concatenated with the handcrafted features we made in our previous blogs.

We can also extract the embeddings from the GRU model we created here and use those features in our final model. A combination of features carrying temporal information and the handcrafted feature can be of high significance to the model.

Conclusion

We now know that using embeddings produced by a neural network can be a practical substitute for designing features by hand. We can enhance the effectiveness and accuracy of our machine-learning models by utilizing embeddings, which will ultimately produce better outcomes and more diversity in our pipeline.