How to Use Arabic Word2Vec Word Embedding with LSTM for Sentiment Analysis Task

Word Embedding

Waad Thuwaini Alshammari
Word embedding is the approach of learning word and their relative meanings from a corpus of text and representing the word as a dense vector. The word vector is the projection of the word into a continuous feature vector space, see Figure 1 (A) for clarity. Words that have similar meaning should be close together in the vector space as illustrated in Figure 1 (B).

Figure 1 (A): visualizing high-dimensional Word2Vec word embeddings [1].
Figure 1 (B): Word2vec word embeddings shows that word that have similar meaning closed together [2].


Word2vec is one of the most popular words embedding in NLP. Word2vec has two types, Continuous Bag-of-Words Model (CBOW) and Continuous Skip-gram Model [3], the model architectures are shown in Figure 2. CBOW predicts the word according to the given context, where Skip-gram predicts the context according to the given word, which increases the computational complexity [3].

Figure 2: The CBOW and Skip-gram architecture [3].

How use pretrained Arabic word embedding as an embedding layer

In this tutorial, we will look at how we can use pre-trained word embedding on sentiment analysis problems using LSTM. AraVec is an open-source pre-trained word2vec project [4]. AraVec trained on a big Arabic corpus of text that contains more than 1,169,075,128 tokens. First, importing the used library.

import tensorflow as tffrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencesfrom numpy import arrayimport gensimfrom gensim.models import KeyedVectorsfrom gensim.models import word2vec

Next, defining a small sentiment analysis problem that contains 8 examples. Each sentence is classified to either positive “1” or negative “0” according to their priority.We will define the sentences and labels:

docs=[ ‘كان يوم سعيد’, ‘ماشاءهللا عمل جيد’, ‘ممتاز’,  ‘عمل مكتمل’, ‘اعتقد بانه ضعيف’, ‘يوجد ثغرات ونقاط ضعف’, ‘ليس جيدا’, ‘كان عمل متعب’]# define class labelslabels = array([1,1,1,1,0,0,0,0])

Then, encoding the document since Keras requires input data to be an integer. Those, the embedding layer take a sequence of numbers as an input. The Tokenizer()function will split the sentence into tokens. The texts_to_sequences()convert word to integer number. Since sequence has a different length, pad_sequences pad all sequence to the given maxlen which in this example equals 4.

# prepare tokenizert = Tokenizer()t.fit_on_texts(docs)vocab_size = len(t.word_index) + 1# integer encode the documentsencoded_docs = t.texts_to_sequences(docs)print(‘encoded_docs:\n’,encoded_docs)# pad documents to a max length of 4 wordsmax_length = 4padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding=’post’)print(‘padded_docs:\n’,padded_docs)


encoded_docs:[[2, 3, 4], [5, 1, 6], [7], [1, 8], [9, 10, 11], [12, 13, 14, 15], [16, 17], [2, 1, 18]]
padded_docs:[[ 2 3 4 0][ 5 1 6 0][ 7 0 0 0][ 1 8 0 0][ 9 10 11 0][12 13 14 15][16 17 0 0][ 2 1 18 0]]

Next, loading the AraVec Skip-gram word embedding into memory as a dictionary of a word to embedding vectors.

# load the whole embedding into memoryw2v_embeddings_index={}TOTAL_EMBEDDING_DIM=300embeddings_file=’…./full_grams_sg_300_twitter/full_grams_sg_300_twitter.mdl’w2v_model =KeyedVectors.load(embeddings_file)for word in w2v_model.wv.vocab:    w2v_embeddings_index[word] = w2v_model[word]
print(‘Loaded %s word vectors.’% len(w2v_embeddings_index))


Loaded 1476715 word vectors.

Then, creating an embedding matrix for word in the training dataset.

# create a weight matrix for words in training docsembedding_matrix = np.zeros((vocab_size, TOTAL_EMBEDDING_DIM))for word, i in t.word_index.items():    embedding_vector = w2v_embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
print(“Embedding Matrix shape:”, embedding_matrix.shape)


Embedding Matrix shape: (19, 300)

The embedding layer is seeded by AraVec word embedding weight. The 300-dimensional Twitter Skip-gram version 3 was chosen. Therefore, the embedding layer defend with output_dim equal to 300.

embedding_layer = tf.keras.layers.Embedding(vocab_size, TOTAL_EMBEDDING_DIM, weights=[embedding_matrix], input_length=4, trainable=False)


Defining, compile, and fit the LSTM model.

# define modelinput_placeholder= tf.keras.Input(shape=(4,), dtype=’int32')input_embedding = embedding_layer(input_placeholder)lstm= tf.keras.layers.LSTM(units=10, activation=’relu’)(input_embedding)preds = tf.keras.layers.Dense(1, activation=’sigmoid’, name = “activation”)(lstm)model = tf.keras.models.Model(inputs=input_placeholder, outputs=preds)# compile the modelmodel.compile(loss=’binary_crossentropy’, optimizer=tf.keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999), metrics=[‘accuracy’])# summarize the modelprint(model.summary())print(‘\n # Fit model on training data’)
# fit the model
history=, labels, epochs=50, verbose=0)


Model: “functional_3"_________________________________________________________________Layer (type) Output Shape Param # =================================================================input_2 (InputLayer) [(None, 4)] 0 _________________________________________________________________embedding (Embedding) (None, 4, 300) 5700 _________________________________________________________________lstm_1 (LSTM) (None, 10) 12440 _________________________________________________________________activation (Dense) (None, 1) 11 =================================================================Total params: 18,151Trainable params: 12,451Non-trainable params: 5,700_________________________________________________________________

Evaluating the model using the training set

# evaluate the modelloss, accuracy = model.evaluate(padded_docs, labels, verbose=0)print(‘Accuracy: %f’ % (accuracy*100))


Accuracy: 62.500000

Testing the model with a positive sentence which predicted correctly

text=[‘عمل جيد’]
encoded_text = t.texts_to_sequences(text)
print(‘encoded_text:\n’,encoded_text)# pad documents to a max length of 4 wordspadded_text = pad_sequences(encoded_text, maxlen=max_length, padding=’post’)print(‘padded_text:\n’,padded_text)result=int(model.predict(padded_text).round().item())print(‘Input %s \n Prediction: %s’ %(text,result))


encoded_text:[[1, 6]]padded_text:[[1 6 0 0]]Input [‘عمل جيد‘]
Prediction: 1

Testing the model with a negative sentence which predicted correctly

text=[‘يوم متعب’]
encoded_text = t.texts_to_sequences(text)
print(‘encoded_text:\n’,encoded_text)# pad documents to a max length of 4 wordspadded_text = pad_sequences(encoded_text, maxlen=max_length, padding=’post’)print(‘padded_text:\n’,padded_text)result=int(model.predict(padded_text).round().item())print(‘ Input %s \n Prediction: %s’ %(text,result))


encoded_text:[[3, 18]]padded_text:[[ 3 18 0 0]]Input [‘يوم متعب‘]
Prediction: 0


