1 line to BioBERT Word Embeddings with NLU in Python
Including Part of Speech, Named Entity Recognition, Emotion Classification in the same line! With Bonus t-SNE plots!
0. Introduction
0.1 What is NLU?
John Snow Labs NLU library gives you 1000+ NLP models and 100+ Word Embeddings in 300+ languages and infinite possibilities to explore your data and gain insights.
In this tutorial, we will cover how to get the powerful BioBERT Embeddings with 1 line of NLU code and then how to visualize them with t-SNE. We will compare Comparing Sentiment with Sarcasm and Emotions!
0.2 What is t-SNE?
T-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
0.3 What is BioBERT?
Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts.
1. Import NLU, load BioBERT, and embed a sample string in 1 line
nlu.load('biobert').predict('He was suprised by the diversity of NLU')
2. Load a larger dataset
The following snippet will download a Reddit sarcasm dataset and load it to a Pandas Dataframe
import pandas as pd# Download the dataset! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sarcasm/train-balanced-sarcasm.csv -P /tmp# Load dataset to Pandasdf = pd.read_csv('/tmp/train-balanced-sarcasm.csv')df
3. Predict on the dataset and also add Part of Speech, Emotion and Sentiment Classifiers
Since adding additional classifiers and getting their predictions is so easy in NLU, we will extend our NLU pipeline with a POS, Emotion, and Sentiment classifier which all achieve results close to the state of the art.
Those extra predictions will also come in handy when plotting our results.
We are limiting ourselves to a subsection of the dataset because our RAM is sadly limited and we are not running on a cluster. With Spark NLP you can take exactly the same models and run them in a scalable fashion inside of a Spark cluster
pipe = nlu.load('pos sentiment emotion biobert')df['text'] = df['comment']# NLU to gives us one row per embedded word by specifying the output level
predictions = pipe.predict(df[['text','label']], output_level='token')predictions
4. Emotion Plots
We can quickly plot the distribution of predicted emotions using pandas functions on the data frame
# Some Tokens are None which we must drop firstpredictions.dropna(how='any', inplace=True)# Some sentiment are 'na' which we must drop firstpredictions = predictions[predictions.emotion!= 'na']predictions.emotion.value_counts().plot.bar(title='Dataset emotion distribution')
5. Prepare data for T-SNE
We prepare the data for the T-SNE algorithm by collecting them in a matrix for TSNE
import numpy as npmat = np.matrix([x for x in predictions.biobert_embeddings])
6. Fit T-SNE
Finally, we fit the T-SNE algorithm and get our 2-Dimensional representation of our Biobert Word Embeddings
from sklearn.manifold import TSNEmodel = TSNE(n_components=2)low_dim_data = model.fit_transform(mat)print('Lower dim data has shape',low_dim_data.shape)
7. Plot BioBERT Word Embeddings, colored by Part of Speech Tag
The following plots show scatter plots for the 2-D representation of the Word Embeddings. Each point represents a word in a sentence and the color represents the POS class that word belongs to.tsne_df = pd.DataFrame(low_dim_data, predictions.pos)ax = sns.scatterplot(data=tsne_df, x=0, y=1, hue=tsne_df.index)ax.set_title(‘T-SNE BIOBERT Embeddings, colored by Part of Speech Tag’)
tsne_df = pd.DataFrame(low_dim_data, predictions.pos)ax = sns.scatterplot(data=tsne_df, x=0, y=1, hue=tsne_df.index)ax.set_title('T-SNE BIOBERT Embeddings, colored by Part of Speech Tag')
8. Plot Plot Biobert Word Embeddings, colored by Emotion
The following plots show scatter plots for the 2-D representation of the Word Embeddings. Each point represents a word from a sentence that was classified with a particular emotion, which reflects in the colors
tsne_df = pd.DataFrame(low_dim_data, predictions.emotion)ax = sns.scatterplot(data=tsne_df, x=0, y=1, hue=tsne_df.index)ax.set_title('T-SNE BIOBERT Embeddings, colored by Emotion')
9. Plot Plot Biobert Word Embeddings, colored by Sarcasm
The following plots show scatter plots for the 2-D representation of the Word Embeddings. Each point represents a word from a sentence that was classified as sarcastic or not, which reflects in the colors
tsne_df = pd.DataFrame(low_dim_data, predictions.label.replace({1:'sarcasm',0:'normal'}))
tsne_df.columns=['x','y']ax = sns.scatterplot(data=tsne_df, x='x', y='y', hue=tsne_df.index)ax.set_title('T-SNE BIOBERT Embeddings, colored by Sarcasm label')
9. Plot Plot Biobert Word Embeddings, colored by Sentiment
The following plots show scatter plots for the 2-D representation of the Word Embeddings. Each point represents a word from a sentence that was classified as sarcastic or not, which reflects in the colors
tsne_df = pd.DataFrame(low_dim_data, predictions.sentiment)tsne_df.columns = ['x','y']ax = sns.scatterplot(data=tsne_df, x='x', y='y', hue=tsne_df.index)ax.set_title('T-SNE BIOBERT Embeddings, colored by Sentiment')
10. There are many many more word Embeddings!
To view all word embeddings, type the following command
nlu.print_all_model_kinds_for_action('embed')
More NLU Medium articles
- Introduction to NLU
- One line of Python code for 6 Embeddings, BERT, ALBERT, ELMO, ELECTRA, XLNET, GLOVE, Part of Speech with NLU and t-SNE
- One-Line Bert Embeddings and t-SNE plots with NLU
- Easy sentence similarity with BERT Sentence Embeddings using John Snow Labs NLU
NLU Talks
- NLP Summit 2020: John Snow Labs NLU: The simplicity of Python, the power of Spark NLP
- John Snow Labs NLU: Become a Data Science Superhero with One Line of Python code
- Language Detection and Multi-lingual Text Mining in Spark NLP
More about NLU
- NLU website
- NLU Github
- NLU Documentation
- Having questions or wanna share an idea? Join us on Slack!
- Overview of all NLU example notebooks
- Named Entity Recognition (NER) 18 class notebook
- Part of Speech (POS) notebook
- BERT Word Embeddings and T-SNE plotting notebook
- ALBERT Word Embeddings and T-SNE plotting notebook
- ELMO Word Embeddings and T-SNE plotting notebook
- XLNET Word Embeddings and T-SNE plotting notebook
- Spellchecking
- Typed Dependency Parsing notebook