What’s Happening to Embeddings During Training?
A study on the spatial dynamics under different training strategies.
Introduction
As data scientists and machine learning engineers, many have used embeddings explicitly or implicitly to conduct predictive modeling or descriptive analysis in our day-to-day work. In most cases, we care about how embeddings could make the model more accurate for a task, but rarely take a closer look to understand the underlying process that generates those numbers. In this article, I’ll share some preliminary insights to discover what actually happens to embeddings by looking at how different training strategies impact their spatial characteristics.
The results are reproducible using this notebook.
What Are Embeddings?
In simple terms, embeddings are numerical representations that carry certain topological information about the entities in the data. One example is the word embeddings that encode the semantic similarity among words. In practice, embeddings are multidimensional vectors composed of decimal numbers that are randomly initialized and incrementally shaped while the model is being trained.
Now the question is, how is the embedding space created during training?
Characterize the Spatial Properties
To start the study, we first need some indicators or metrics that characterize the spatial properties of embeddings. Among the limited literature, the following measures were selected:
- Gini Index: Measures inequality of values, i.e., if the information in the vector is concentrated in a few dimensions.
- Vector Entropy: Measures distributional uncertainty that reveals how uniformly the embedding uses its dimensions.
- Hoyer Sparsity: Measures how many dimensions are effectively utilized.
- Spectral Entropy: Gives a frequency-domain view that reflects how smooth or noisy the embedding is.
Their Python implementation used in this work is presented below.
import numpy as np
def gini(x):
"""Compute the Gini coefficient of a vector."""
x = np.abs(x.flatten()) + 1e-12 # Avoid division by zero
x_sorted = np.sort(x)
n = len(x)
cumulative = np.cumsum(x_sorted)
gini_coeff = (n + 1 - 2 * np.sum(cumulative) / cumulative[-1]) / n
return gini_coeff
def vector_entropy(x):
"""Compute entropy of normalized absolute vector components."""
x = np.abs(x.flatten()) + 1e-12
p = x / np.sum(x)
return -np.sum(p * np.log(p))
def hoyer_sparsity(x):
"""Compute Hoyer's sparsity measure of a vector."""
x = np.abs(x.flatten())
n = len(x)
l1 = np.sum(x)
l2 = np.sqrt(np.sum(x ** 2))
if l1 == 0:
return 0.0
return (np.sqrt(n) - (l1 / l2)) / (np.sqrt(n) - 1)
def spectral_entropy(E):
"""Compute spectral entropy from an embedding matrix E (rows = vectors)."""
# Compute covariance matrix
cov = np.cov(E, rowvar=False)
# Compute eigenvalues
eigvals = np.linalg.eigvalsh(cov)
eigvals = np.clip(eigvals, a_min=1e-12, a_max=None) # avoid log(0)
p = eigvals / np.sum(eigvals)
return -np.sum(p * np.log(p))
Peek into the Embedding Space During Training
Now that we have defined our embedding measurements. It’s time to see how they evolve over time during training. We are going to do a simple deep learning experiment: building a matrix factorization model on the MovieLens dataset and monitoring the change in embedding space.
Data Collection and Processing
First, the dataset was downloaded, processed, and split into training and test sets following an 80/20 ratio.
# Load the MovieLens 1M dataset (only 'train' split is available)
ratings = tfds.load('movielens/100k-ratings', split='train')
# Convert the dataset to a Pandas DataFrame
ratings_df = tfds.as_dataframe(ratings)
# Preprocess the DataFrame: Convert bytes to strings and then to integers
ratings_df['movie_id'] = ratings_df['movie_id'].str.decode('utf-8').astype(np.int64)
ratings_df['user_id'] = ratings_df['user_id'].str.decode('utf-8').astype(np.int64)
# Create LabelEncoders and fit on the DataFrame
user_encoder = LabelEncoder()
movie_encoder = LabelEncoder()
user_encoder.fit(ratings_df['user_id'])
movie_encoder.fit(ratings_df['movie_id'])
# Transform the IDs in the DataFrame
ratings_df['user_id_encoded'] = user_encoder.transform(ratings_df['user_id'])
ratings_df['movie_id_encoded'] = movie_encoder.transform(ratings_df['movie_id'])
ratings_df['label'] = ratings_df['user_rating'].apply(lambda x: 1 if x >= 4 else 0)
# Split the DataFrame into train and test sets
train_df, test_df = train_test_split(ratings_df, test_size=0.2, random_state=42)
Model Creation
Next, the model was created. It’s a simple matrix factorization that defines a user’s movie preference as the dot product between the user and item embeddings.
def build_model(emb_dim=16, optimizer='adam'):
tf.keras.backend.clear_session()
# Define the embedding dimensions
user_embedding_dim = emb_dim
movie_embedding_dim = emb_dim
# Create the user embedding layer
user_embedding_layer = tf.keras.layers.Embedding(
input_dim=num_users,
output_dim=user_embedding_dim,
name='user_embedding_layer'
)
# Create the movie embedding layer
movie_embedding_layer = tf.keras.layers.Embedding(
input_dim=num_movies,
output_dim=movie_embedding_dim,
name='movie_embedding_layer'
)
# Define the model
input_user_id = tf.keras.Input(shape=(1,), dtype=tf.int64)
input_movie_id = tf.keras.Input(shape=(1,), dtype=tf.int64)
user_embedding = user_embedding_layer(input_user_id)
movie_embedding = movie_embedding_layer(input_movie_id)
# Concatenate the embeddings
concatenated_embeddings = tf.keras.layers.Concatenate()([user_embedding, movie_embedding])
# Add a dense layer for prediction
output = tf.keras.layers.Dense(1, activation='sigmoid')(concatenated_embeddings)
output = tf.keras.layers.Reshape((1,))(output)
# Create the model
model = tf.keras.Model(inputs=[input_user_id, input_movie_id], outputs=output)
# Compile the model
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'])
return model
Experiment Setup
To understand the dynamics of embeddings, I monitored how embedding measurements fluctuate over 20 epochs when the model is trained with different optimizers and batch sizes. To make the comparison representative, I selected two fundamentally different optimizers: Stochastic Gradient Descent (SGD) and Adam. SGD is a straightforward optimizer known for its strong generalization capabilities, whereas Adam is a more sophisticated technique that achieves rapid convergence through adaptive learning rates. Additionally, I examined various batch sizes, ranging from 1 to 32, to assess their impact on training dynamics and embedding representations.
- Optimizers: SGD, Adam
- Batch sizes: 1, 8, 16, 32
- Metrics: training loss, test loss, Gini index, Hoyer sparsity, vector entropy, spectral entropy.
- Number of epochs: 20
- Dimensions of embeddings: 8
emb_dim = 8
num_epochs = 20
for optimizer in ['adam', 'sgd']:
for batch_size in [1, 8, 16, 32]:
model = build_model(emb_dim=emb_dim, optimizer=optimizer)
for epoch in range(num_epochs):
print(optimizer, batch_size, epoch)
history = model.fit(X_train, y_train, epochs=1, batch_size=batch_size, verbose=0)
user_embeddings = model.get_layer('user_embedding_layer').get_weights()[0]
movie_embeddings = model.get_layer('movie_embedding_layer').get_weights()[0]
embeddings = np.concatenate([user_embeddings, movie_embeddings])
# Evaluate the model
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
gini_score = round(float(np.mean(list(map(gini, embeddings)))),4)
hoyer_sparsity_score = round(float(np.mean(list(map(hoyer_sparsity, embeddings)))),4)
vector_entropy_score = round(float(np.mean(list(map(vector_entropy, embeddings)))),4)
spectral_entropy_score = round(float(spectral_entropy(embeddings)),4)
Results and Discussions
Figures 1 and 2 present how embeddings evolve for the SGD and Adam optimizers, respectively. In each figure, the rows represent the losses and embedding metrics over 20 epochs for a given batch size.
After eyeballing these results comparatively, there are some key takeaways to share:
- Gini Index and Hoyer Sparsity. Generally, the embeddings created by SGD exhibited an increased trend for Gini index and Hoyer sparsity. This is likely due to the stochastic nature that made the training noisy and failed to equalize the utilization of different dimensions of embeddings. In comparison, Adam showed the opposite thanks to its momentum term that stabilizes the gradient and fairly updates every dimension over time.
- Vector Entropy and Spectral Entropy. For both SGD and Adam, vector entropy tended to be inversely correlated with the Gini index and sparsity during training. This is expected because vector entropy and Gini index intrinsically point in opposite directions. What is interesting is, spectral entropy always decreased over epochs regardless of the optimizer and batch size, indicating that any training strategy eventually converges towards a more structured variation pattern across dimensions and achieves concentration in the spectrum.
- Special case: batch size of 1. As we see, the Gini index and Hoyer sparsity are quite different when the batch size equals 1 for both SGD and Adam. For SGD, the extremely noisy update caused the Gini index to decrease over time, which motivates a deeper exploration of the impact of batch size. Looking at Adam, the first few epochs showed the same trend as SGD with a batch size of 1, however, it quickly evolved to SGD with larger batch sizes, suggesting the potential correlation between these two optimizers.
Final Thoughts
This study is a preliminary work to explore the dynamics of embeddings during training under different approaches.
At this stage, many aspects are not fully understood and could have a better theoretical interpretation. I look forward to initiating a broader exploration and discussion around this topic to better understand the underlying mechanism driving the observed results.
Your feedback and ideas are welcome.