Embeddings Merge 101: A Practical Guide to Merging Embeddings

Published in

MantisNLP

6 min readAug 7, 2024

Photo by Ricardo Gomez Angel on Unsplash

Have you identified several fill-mask (aka language models aka embedding models) which complement each other, and you want to use all of them to train, for example, a classifier on top of it? Easy-peasy.

The conditions you need to meet are:

For a simplified approach, the embeddings must be of the same architecture (let’s say, both roberta-base or both bert-large-case) and same dimensions.
However, if you are very invested in merging models even from different families (e.g, textual and image embeddings, or BERT and RoBERTa), take a look at the complex approach, which describes how you could even merge models from different families.

Simplified approach: Merging models of same architecture

Combining two fill-mask models into one in Hugging Face generally involves a few steps:

Load the Models: Load the two models you want to combine.
Combine the Embeddings: Merge the embeddings from both models.
Create a New Model: Integrate the combined embeddings into a new model.

Step 1: Load the Models

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name_1 = 'model_name_1'
model_name_2 = 'model_name_2'

# Load the models
model1 = AutoModelForMaskedLM.from_pretrained(model_name_1)
model2 = AutoModelForMaskedLM.from_pretrained(model_name_2)

# Load the tokenizers
tokenizer1 = AutoTokenizer.from_pretrained(model_name_1)
tokenizer2 = AutoTokenizer.from_pretrained(model_name_2)

Step 2: Combine the Embeddings

You need to ensure that both models have compatible tokenizers or you might need to handle token mappings. There are several techniques to do so:

Averaging Embeddings

The default technique is to sum the embeddings and divide by 2.

import torch

# Get embeddings from both models
embeddings1 = model1.base_model.embeddings.word_embeddings.weight
embeddings2 = model2.base_model.embeddings.word_embeddings.weight

# Ensure the dimensions match
assert embeddings1.shape == embeddings2.shape, "Embedding dimensions do not match!"

# Combine embeddings (e.g., by averaging)
combined_embeddings = (embeddings1 + embeddings2) / 2

Concatenating the embeddings

Concatenating the embeddings from two models increases the dimensionality of the resulting embeddings, which may capture more information from both models.

import torch

# Concatenate embeddings
combined_embeddings = torch.cat((embeddings1, embeddings2), dim=-1)

# Adjust the embedding layer of the new model to match the new dimension
new_model.config.hidden_size = combined_embeddings.shape[-1]
new_model.base_model.embeddings.word_embeddings = torch.nn.Embedding.from_pretrained(combined_embeddings)

Weighted Sum

Use a weighted sum to combine the embeddings. You can learn the weights during training or set them manually.

alpha = 0.7  # Weight for the first model
beta = 0.3   # Weight for the second model

# Combine embeddings with weights
combined_embeddings = alpha * embeddings1 + beta * embeddings2

# Set the combined embeddings to the new model
new_model.base_model.embeddings.word_embeddings.weight = torch.nn.Parameter(combined_embeddings)

Linear Transformations

Use a linear transformation to combine the embeddings. This approach allows learning a transformation matrix during training.

import torch.nn as nn

class LinearCombiner(nn.Module):
    def __init__(self, embedding_dim):
        super(LinearCombiner, self).__init__()
        self.transform = nn.Linear(embedding_dim * 2, embedding_dim)
    
    def forward(self, emb1, emb2):
        combined = torch.cat((emb1, emb2), dim=-1)
        return self.transform(combined)

combiner = LinearCombiner(embeddings1.shape[-1])
combined_embeddings = combiner(embeddings1, embeddings2)

# Set the combined embeddings to the new model
new_model.base_model.embeddings.word_embeddings.weight = torch.nn.Parameter(combined_embeddings)

Use a Neural Network instead to combine them

Use a small neural network to learn how to combine the embeddings.

class EmbeddingCombiner(nn.Module):
    def __init__(self, embedding_dim):
        super(EmbeddingCombiner, self).__init__()
        self.fc1 = nn.Linear(embedding_dim * 2, embedding_dim)
        self.fc2 = nn.Linear(embedding_dim, embedding_dim)
        self.relu = nn.ReLU()
    
    def forward(self, emb1, emb2):
        combined = torch.cat((emb1, emb2), dim=-1)
        combined = self.relu(self.fc1(combined))
        return self.fc2(combined)

combiner = EmbeddingCombiner(embeddings1.shape[-1])
combined_embeddings = combiner(embeddings1, embeddings2)

# Set the combined embeddings to the new model
new_model.base_model.embeddings.word_embeddings.weight = torch.nn.Parameter(combined_embeddings)

Neural network + Attention Mechanism

Use an attention mechanism to learn how to combine embeddings. This method allows the model to weigh the importance of each embedding dynamically.

class AttentionCombiner(nn.Module):
    def __init__(self, embedding_dim):
        super(AttentionCombiner, self).__init__()
        self.attention = nn.MultiheadAttention(embed_dim=embedding_dim, num_heads=1)
    
    def forward(self, emb1, emb2):
        combined = torch.stack((emb1, emb2), dim=0)
        attention_output, _ = self.attention(combined, combined, combined)
        return torch.mean(attention_output, dim=0)

combiner = AttentionCombiner(embeddings1.shape[-1])
combined_embeddings = combiner(embeddings1.unsqueeze(1), embeddings2.unsqueeze(1)).squeeze(1)

# Set the combined embeddings to the new model
new_model.base_model.embeddings.word_embeddings.weight = torch.nn.Parameter(combined_embeddings)

Step 3: Create a New Model

Create a new model with the combined embeddings. We can do this by initializing a new model and replacing its embeddings with the combined ones.

from transformers import BertConfig, BertForMaskedLM

# Use a configuration from one of the models or create a new one
config = BertConfig.from_pretrained(model_name_1)

# Create a new model
new_model = BertForMaskedLM(config)

# Replace the embeddings with the combined embeddings
new_model.base_model.embeddings.word_embeddings.weight = torch.nn.Parameter(combined_embeddings)

# Save the new model
new_model.save_pretrained('combined_model')

Step 4: Save and Load the New Model

Save the model so you can load it later as needed.

# Save the new model and tokenizer
new_model.save_pretrained('path_to_combined_model')
tokenizer1.save_pretrained('path_to_combined_model')

Loading and Using the New Model

Now you can load and use the new model as usual.

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load the new model and tokenizer
new_model = AutoModelForMaskedLM.from_pretrained('path_to_combined_model')
tokenizer = AutoTokenizer.from_pretrained('path_to_combined_model')

# Example usage
input_text = "This is a [MASK] example."
inputs = tokenizer(input_text, return_tensors='pt')
outputs = new_model(**inputs)

Complex approach: What if the embeddings come from different families?

Combining embeddings from different model families, such as textual and image embeddings or a modelroberta-base with bert-base-uncased, can be more challenging than combining embeddings from the same model family. This is because different model architectures may have different embedding dimensions, tokenization strategies, and even pre-training objectives. However, it's not impossible.

The steps now include include additionally:

Tokenization Alignment: Different models often have different tokenizers. To combine embeddings, you need to align the tokenization strategies. One approach is to use a unified tokenizer that works with both models, but this can be complex.
Embedding Dimension Alignment: If the embedding dimensions of the two models are different, you’ll need to align them. This can be done using techniques like linear transformation, zero-padding, or projection to a common space.

Here’s a more detailed example that combines embeddings from roberta-base and bert-base-uncased:

Step 1: Load the Models and Tokenizers

from transformers import AutoModelForMaskedLM, AutoTokenizer

# Load the models
model_name_1 = 'roberta-base'
model_name_2 = 'bert-base-uncased'

model1 = AutoModelForMaskedLM.from_pretrained(model_name_1)
model2 = AutoModelForMaskedLM.from_pretrained(model_name_2)

# Load the tokenizers
tokenizer1 = AutoTokenizer.from_pretrained(model_name_1)
tokenizer2 = AutoTokenizer.from_pretrained(model_name_2)

Step 2: Tokenization

You need to ensure the tokens from both tokenizers align. One way is to tokenize the input with both tokenizers and handle the alignment manually.

input_text = "This is a [MASK] example."

tokens1 = tokenizer1.tokenize(input_text)
tokens2 = tokenizer2.tokenize(input_text)

# Convert tokens to IDs
ids1 = tokenizer1.convert_tokens_to_ids(tokens1)
ids2 = tokenizer2.convert_tokens_to_ids(tokens2)

# Ensure alignment, e.g., by padding or truncating
max_length = max(len(ids1), len(ids2))
ids1 = ids1 + [tokenizer1.pad_token_id] * (max_length - len(ids1))
ids2 = ids2 + [tokenizer2.pad_token_id] * (max_length - len(ids2))

Step 3: Get Embeddings

Retrieve the embeddings from both models.

import torch

# Get embeddings
embeddings1 = model1.roberta.embeddings.word_embeddings.weight
embeddings2 = model2.bert.embeddings.word_embeddings.weight

Step 4: Align Embedding Dimensions

If the embedding dimensions differ, use a linear layer to project them to a common dimension.

import torch.nn as nn

# Assuming embeddings1 and embeddings2 have different dimensions
dim1 = embeddings1.size(1)
dim2 = embeddings2.size(1)
common_dim = max(dim1, dim2)

# Linear layers to project to common dimension
linear1 = nn.Linear(dim1, common_dim)
linear2 = nn.Linear(dim2, common_dim)

projected_embeddings1 = linear1(embeddings1)
projected_embeddings2 = linear2(embeddings2)

Step 5: Combine the Embeddings

Combine the projected embeddings using a chosen technique (e.g., concatenation, weighted sum).

# Combine embeddings, for example, by averaging
combined_embeddings = (projected_embeddings1 + projected_embeddings2) / 2

# Create a new embedding layer
new_embedding_layer = nn.Embedding.from_pretrained(combined_embeddings)

Step 6: Integrate into a New Model

Integrate the combined embeddings into a new model architecture.

from transformers import BertConfig, BertForMaskedLM

# Create a new configuration
config = BertConfig.from_pretrained(model_name_2)

# Initialize a new model
new_model = BertForMaskedLM(config)

# Replace the embeddings with the combined embeddings
new_model.bert.embeddings.word_embeddings = new_embedding_layer

# Save the new model
new_model.save_pretrained('path_to_combined_model')
tokenizer2.save_pretrained('path_to_combined_model')

Last Note

Regardless the approach you have followed, it’s crucial that you finetune your model on a relevant dataset, to make sure your it’s completely aligned to your new data.

Enjoy and don’t forget to drop a like ❤️!

Need help?

At Mantis, our experienced team of NLP engineers is ready to help. If you have any NLP-related question, reach out to us at hi@mantisnlp.com.