How to Use Arabic Word2Vec Word Embedding with LSTM for Sentiment Analysis Task

Word Embedding

Waad Thuwaini Alshammari
5 min readSep 26, 2021

Word embedding is the approach of learning word and their relative meanings from a corpus of text and representing the word as a dense vector. The word vector is the projection of the word into a continuous feature vector space, see Figure 1 (A) for clarity. Words that have similar meaning should be close together in the vector space as illustrated in Figure 1 (B).

Figure 1 (A): visualizing high-dimensional Word2Vec word embeddings [1].
Figure 1 (B): Word2vec word embeddings shows that word that have similar meaning closed together [2].

Word2Vec

Word2vec is one of the most popular words embedding in NLP. Word2vec has two types, Continuous Bag-of-Words Model (CBOW) and Continuous Skip-gram Model [3], the model architectures are shown in Figure 2. CBOW predicts the word according to the given context, where Skip-gram predicts the context according to the given word, which increases the computational complexity [3].

Figure 2: The CBOW and Skip-gram architecture [3].

How use pretrained Arabic word embedding as an embedding layer

In this tutorial, we will look at how we can use pre-trained word embedding on sentiment analysis problems using LSTM. AraVec is an open-source pre-trained word2vec project [4]. AraVec trained on a big Arabic corpus of text that contains more than 1,169,075,128 tokens. First, importing the used library.

import tensorflow as tffrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencesfrom numpy import arrayimport gensimfrom gensim.models import KeyedVectorsfrom gensim.models import word2vec

Next, defining a small sentiment analysis problem that contains 8 examples. Each sentence is classified to either positive “1” or negative “0” according to their priority.We will define the sentences and labels:

docs=[ ‘كان يوم سعيد’, ‘ماشاءهللا عمل جيد’, ‘ممتاز’,  ‘عمل مكتمل’, ‘اعتقد بانه ضعيف’, ‘يوجد ثغرات ونقاط ضعف’, ‘ليس جيدا’, ‘كان عمل متعب’]# define class labelslabels = array([1,1,1,1,0,0,0,0])

Then, encoding the document since Keras requires input data to be an integer. Those, the embedding layer take a sequence of numbers as an input. The Tokenizer()function will split the sentence into tokens. The texts_to_sequences()convert word to integer number. Since sequence has a different length, pad_sequences pad all sequence to the given maxlen which in this example equals 4.

# prepare tokenizert = Tokenizer()t.fit_on_texts(docs)vocab_size = len(t.word_index) + 1# integer encode the documentsencoded_docs = t.texts_to_sequences(docs)print(‘encoded_docs:\n’,encoded_docs)# pad documents to a max length of 4 wordsmax_length = 4padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding=’post’)print(‘padded_docs:\n’,padded_docs)

Output:

encoded_docs:[[2, 3, 4], [5, 1, 6], [7], [1, 8], [9, 10, 11], [12, 13, 14, 15], [16, 17], [2, 1, 18]]
padded_docs:[[ 2 3 4 0][ 5 1 6 0][ 7 0 0 0][ 1 8 0 0][ 9 10 11 0][12 13 14 15][16 17 0 0][ 2 1 18 0]]

Next, loading the AraVec Skip-gram word embedding into memory as a dictionary of a word to embedding vectors.

# load the whole embedding into memoryw2v_embeddings_index={}TOTAL_EMBEDDING_DIM=300embeddings_file=’…./full_grams_sg_300_twitter/full_grams_sg_300_twitter.mdl’w2v_model =KeyedVectors.load(embeddings_file)for word in w2v_model.wv.vocab:    w2v_embeddings_index[word] = w2v_model[word]
print(‘Loaded %s word vectors.’% len(w2v_embeddings_index))

Output:

Loaded 1476715 word vectors.

Then, creating an embedding matrix for word in the training dataset.

# create a weight matrix for words in training docsembedding_matrix = np.zeros((vocab_size, TOTAL_EMBEDDING_DIM))for word, i in t.word_index.items():    embedding_vector = w2v_embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
print(“Embedding Matrix shape:”, embedding_matrix.shape)

Output:

Embedding Matrix shape: (19, 300)

The embedding layer is seeded by AraVec word embedding weight. The 300-dimensional Twitter Skip-gram version 3 was chosen. Therefore, the embedding layer defend with output_dim equal to 300.

embedding_layer = tf.keras.layers.Embedding(vocab_size, TOTAL_EMBEDDING_DIM, weights=[embedding_matrix], input_length=4, trainable=False)

LSTM

Defining, compile, and fit the LSTM model.

# define modelinput_placeholder= tf.keras.Input(shape=(4,), dtype=’int32')input_embedding = embedding_layer(input_placeholder)lstm= tf.keras.layers.LSTM(units=10, activation=’relu’)(input_embedding)preds = tf.keras.layers.Dense(1, activation=’sigmoid’, name = “activation”)(lstm)model = tf.keras.models.Model(inputs=input_placeholder, outputs=preds)# compile the modelmodel.compile(loss=’binary_crossentropy’, optimizer=tf.keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999), metrics=[‘accuracy’])# summarize the modelprint(model.summary())print(‘\n # Fit model on training data’)
# fit the model
history= model.fit(padded_docs, labels, epochs=50, verbose=0)

Output:

Model: “functional_3"_________________________________________________________________Layer (type) Output Shape Param # =================================================================input_2 (InputLayer) [(None, 4)] 0 _________________________________________________________________embedding (Embedding) (None, 4, 300) 5700 _________________________________________________________________lstm_1 (LSTM) (None, 10) 12440 _________________________________________________________________activation (Dense) (None, 1) 11 =================================================================Total params: 18,151Trainable params: 12,451Non-trainable params: 5,700_________________________________________________________________

Evaluating the model using the training set

# evaluate the modelloss, accuracy = model.evaluate(padded_docs, labels, verbose=0)print(‘Accuracy: %f’ % (accuracy*100))

Output:

Accuracy: 62.500000

Testing the model with a positive sentence which predicted correctly

text=[‘عمل جيد’]
encoded_text = t.texts_to_sequences(text)
print(‘encoded_text:\n’,encoded_text)# pad documents to a max length of 4 wordspadded_text = pad_sequences(encoded_text, maxlen=max_length, padding=’post’)print(‘padded_text:\n’,padded_text)result=int(model.predict(padded_text).round().item())print(‘Input %s \n Prediction: %s’ %(text,result))

Output:

encoded_text:[[1, 6]]padded_text:[[1 6 0 0]]Input [‘عمل جيد‘]
Prediction: 1

Testing the model with a negative sentence which predicted correctly

text=[‘يوم متعب’]
encoded_text = t.texts_to_sequences(text)
print(‘encoded_text:\n’,encoded_text)# pad documents to a max length of 4 wordspadded_text = pad_sequences(encoded_text, maxlen=max_length, padding=’post’)print(‘padded_text:\n’,padded_text)result=int(model.predict(padded_text).round().item())print(‘ Input %s \n Prediction: %s’ %(text,result))

Output:

encoded_text:[[3, 18]]padded_text:[[ 3 18 0 0]]Input [‘يوم متعب‘]
Prediction: 0

Reference

[1] S. Smetanin, “Google News and Leo Tolstoy: Visualizing Word2Vec Word Embeddings using t-SNE | by Sergey Smetanin | Towards Data Science,” Nov. 16, 2018. https://towardsdatascience.com/google-news-and-leo-tolstoy-visualizing-word2vec-word-embeddings-with-t-sne-11558d8bd4d (accessed Nov. 02, 2020).

[2] D. Ping, B. Xiang, P. Ng, R. Nallapati, S. Chakravarty, and C. Tang, “Introduction to Amazon SageMaker Object2Vec,” Amazon Web Services, Nov. 08, 2018. https://aws.amazon.com/blogs/machine-learning/introduction-to-amazon-sagemaker-object2vec/ (accessed Nov. 02, 2020).

[3] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Proc. Int. Conf. Learn. Represent. ICLR 2013, Jan. 2013, Accessed: May 22, 2019. [Online]. Available: http://arxiv.org/pdf/1301.3781v3.pdf.

[4] A. B. Soliman, K. Eissa, and S. R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP,” Procedia Comput. Sci., vol. 117, pp. 256 – 265, Jan. 2017, doi: 10.1016/j.procs.2017.10.117.

--

--

Waad Thuwaini Alshammari

MSc Computer Science, My research interests include Deep Learning, Machine in Learning, and Natural Language Processing “Once you stop learning you start dying”