If you read my blog from December 20 about answering questions from long passages using BERT, you know how excited I am about how BERT is having a huge impact on natural language processing. BERT, or Bidirectional Encoder Representations from Transformers, which was developed by Google, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
The BERT pre-trained models can be used for more than just question/answer tasks. They can also be used to determine how similar two sentences are to each other. In this post, I am going to show how to find these similarities using a measure known as cosine similarity. I do some very simple testing using 3 sentences that I have tokenized manually. If using a larger corpus, you will definitely want to have the sentences tokenized using something like nltk.tokenize. The first two sentences (0 and 1) come from the same blog entry, while the third (2) comes from a separate blog entry. The similarity between sentences 0 and 1 should be higher with each other than with sentence 2 (I’ll explain later why I used the numbers 0–2 instead of 1–3). Let’s see if that is the case. The sentences are:
- BERT was developed by Google and Nvidia has created an optimized version that uses TensorRT
- One drawback of BERT is that only short passages can be queried
- I attended a conference in Denver
In this testing, I used the BERT-as-a-Service server and client. The testing was done two different ways. The first was where the BERT server and BERT client were both running on the same physical server. The second was where the BERT server was on a different physical server than the BERT client was on. Where the two differ, I will list the syntax needed for each. Note: If running the BERT server in a Docker container on one physical server and the BERT client on a different physical server or PC, you must specify that ports 5555 and 5556 are explicitly used for the container the BERT server is running in. If you do not, the client won’t be able to connect. To do this, be sure to add the -p switch to your docker run command (-p 5555:5555 -p 5556:5556). This is not necessary if the BERT server and BERT client are both running on the same physical server.
Installing the BERT Server and Client
The BERT server and client require TensorFlow version 1.10 or greater. I used version 1.14.0-rc0 in this testing. To install each, run the following:
pip install bert-serving-server bert-serving-clientpip install tensorflow-gpu==1.14.0-rc0
Create a directory for the models to be saved in:
Get the pre-trained models (https://github.com/google-research/bert#pre-trained-models). I used BERT-Base, both the Uncased and Cased versions for testing.
- BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zipunzip uncased_L-12_H-768_A-12.zip
- BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters
wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zipunzip cased_L-12_H-768_A-12.zip
Starting the BERT Server
When starting the BERT server, you must specify which pre-trained model to use. The switch specifies how many clients can concurrently connect to the server at one time. The switch is not necessary, but I use it here so the encoded sentences can be seen.
If using the Uncased model, run:
bert-serving-start -model_dir /home/username/bert_models/uncased_L-12_H-768_A-12/ -num_worker=1 -show_tokens_to_client
If using the Cased model, run:
bert-serving-start -model_dir /home/username/bert_models/cased_L-12_H-768_A-12/ -num_worker=1 -show_tokens_to_client
When the server is started, you will see something similar to this:
Connecting to the BERT Server with the BERT Client and Finding Sentence Similarity
The following script (BERT_sentence_similarity.py) can be found on my Github at https://github.com/pacejohn/BERT-Cosine-Similarities.
# Connect to a BERT server from a BERT client and determine the #cosine similarity between 2 sentences# Imports
from bert_serving.client import BertClient
from sklearn.metrics.pairwise import cosine_similarity# Uncomment the following line if the BERT server that is running #locally (on the same physical server that the client will be #running on).
#client = BertClient()
# Uncomment the following line if the BERT server is running #remotely (on a different physical server than the client will be #running on).
# You must specify the IP of the remote server and the ports
client = BertClient(ip=’10.3.50.61', port=5555, port_out=5556)
# Save tokenized sentences to variables. This makes it easier later.
# I started the numbering at 0 rather than 1 so it matches the #indexes of the arrays that are created when the encoding happens.
sentence0 = [‘bert’, ‘was’, ‘developed’, ‘by’, ‘google’, ‘and’, ‘nvidia’, ‘has’, ‘created’, ‘an’, ‘optimized’, ‘version’, ‘that’, ‘uses’, ‘tensorrt’]sentence1 = [‘one’, ‘drawback’, ‘of’, ‘bert’, ‘is’, ‘that’, ‘only’, ‘short’, ‘passages’, ‘can’, ‘be’, ‘queried’]sentence2 = [‘i’, ‘attended’, ‘a’, ‘conference’, ‘in’, ‘denver’]
# Specify which 2 sentences to compare.
first_sentence = 0
second_sentence = 1
# Encode the sentences using the BERT client.
sentences = client.encode([sentence0, sentence1, sentence2], show_tokens=True, is_tokenized=True)
# If you print ‘sentences’, it will show the arrays along with the #encoded sentences. This can be interesting because it shows which #words it did not recognize. They are denoted by [UNK].
# Calculate cosine similarity between the 2 sentences you specified #and display it
cos_sim = cosine_similarity(sentences[first_sentence][:].reshape(1,-1),sentences[ second_sentence][:].reshape(1,-1))
# Show the sentences and their cosine similarity, but leave off the [CLS] at the beginning and [SEP] at the end
if first_sentence == 0 and second_sentence == 1:
print(“\n\nThe cosine similarity between sentence:\n” + str(‘ ‘.join(sentence0)) + “\n\nand sentence:\n\n” + str(‘ ‘.join(sentence1)) + “\nis “ + str(cos_sim))if first_sentence == 0 and second_sentence == 2:
print(“\n\nThe cosine similarity between sentence:\n” + str(‘ ‘.join(sentence0)) + “\n\nand sentence:\n\n” + str(‘ ‘.join(sentence2)) + “\nis “ + str(cos_sim))if first_sentence == 1 and second_sentence == 2:
print(“\n\nThe cosine similarity between sentence:\n” + str(‘ ‘.join(sentence1)) + “\n\nand sentence:\n\n” + str(‘ ‘.join(sentence2)) + “\nis “ + str(cos_sim))
# Show the encoded versions of the sentences for comparison
print(“\n******\nThe encoded sentences are”)
As a final note, Tables 1 to 4 below show the differences in the cosine similarity between the sentences when capital letters are used or not used and when the Uncased or Cased model is used. In my next post, I will discuss these differences in more detail.
I welcome your feedback and suggestions. Be sure to follow me on Twitter @pacejohn, on LinkedIn (https://www.linkedin.com/in/john-pace-phd-20b87070/), and check out my blog at https://www.ironmanjohn.com/.
Comparison of Cosine Similarities Using BERT Uncased and Cased Models