LSTM networks on Quality of Stack Overflow Questions

Asking well defined questions is crucial when it comes to programmers, because most of the questions have been answered, but in the plethora of information that are on Google, it can be hard to find the exact answer one is looking for sometimes.

The data set we’ll be analyzing in this blog post is a collection of questions taken from StackOverflow (largest question/answer site for programmers) separated into three categories based on quality of the post/question.

Link to the data set, where you can download it and see a brief description.


I’ll be using Google Colab for this tutorial, taking advantage of it’s GPU computing power, and also because it’s a platform that everyone should be familiar by now.

Also, I’ll try to limit myself to using as few libraries as I can, to help you get familiar with Tensorflow 2 + Keras primarily(no scikit-learn or other machine learning libraries).

As a start, I’ll need to import the data. I’ve had the data set uploaded on G Drive, and with the code below, I’m granting access to Colab to pull all the data from my G Drive.

from google.colab import drivedrive.mount(‘/content/gdrive’)

After we run that we’ll be given a link, where we will have to go and allow Colab to access Drive, and after we’ve done that, we will get a unique code that will be used as a password to enter in Colab cell.

We can also manually upload files on Colab, as shown below. We would save ourselves some time, skipping the mounting of the Drive part.

Image for post
Image for post

Once we’ve successfully finished all the steps from above, we’ll see a new folder in our files manager called gdrive. We will find our data there, copy it’s path(right click on file and press “Copy Path”) and use that to load it.

data_path=‘/content/gdrive/MyDrive/StackOverflowQuestions/data.csv’data = pd.read_csv(data_path)data.head()

We’re calling the .head() function, just to get an idea how the data looks like, which we can see below.

Image for post
Image for post

As these are bunch of questions from Stack Overflow, it wouldn’t be crazy to guess that the text data is probably not so clean. We’ll print one question, just to see how it looks like, using data.Body[3]. Output:

<p>I am attempting to overlay a title over an image — with the image darkened with a lower opacity. However, the opacity effect is changing the overlaying text as well — making it dim. Any fix to this? Here is what is looks like:</p>\n\n<p><a href=”" rel=”noreferrer”><img src=”" alt=”enter image description here”></a></p>\n\n<p>....

As we can see, a lot of things going on in that body of text. We’ll define a function to clean it up a little:

import re
def cleant(raw_t):
cleanr = re.compile('<.*?>') cleantext = re.sub(cleanr, '', raw_t) cleantext = re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0- 9_-]+)', '', cleantext) #regex to remove to emails(above) cleantext = re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', cleantext) return cleantextdata.Body = data['Body'].apply(lambda x: cleant(x))

We can define X and Y now. Keep in mind though, that our y / categorical values are represented with strings, as we’ve seen above. We need to turn them into numerical values, so we will manually label encode them.

X = data.Bodyy ={‘LQ_CLOSE’:0,’LQ_EDIT’:1,’HQ’:2})

Split the data into train and test, with 70:30 ration, and print the shapes(always print and double-check the shapes, a lot of errors arise from mismatched shapes of data)

train_size = int(len(X) * 0.7)X_train, y_train = X[0:train_size], y[0:train_size]X_test, y_test = X[train_size:], y[train_size:]print(“X_train shape:”,X_train.shape) #(42000,)print(“X_test shape:”,X_test.shape)   #(18000,print(“y_train shape:”, y_train.shape)#(42000,)print(“y_test shape:”, y_test.shape)  #(18000,)

Now we start playing with Tensorflow. We use Tokenizer to map the words from our data into their numeric representations. After that we need to fit it to text.

from tensorflow.keras.preprocessing.text import Tokenizerfrom tensorflow.keras.preprocessing.sequence import pad_sequencesvocab_size = 30000 #Size of our vocabularyembedding_dim = 16 max_length = 150   #length of every input that we will feed our DNNtokenizer = Tokenizer(vocab_size, oov_token=”<OOV>”)tokenizer.fit_on_texts(X_train)word_index = tokenizer.word_indexprint(len(word_index)) #How many unique words there are in X_train

Now we need to create sequences of data, and after that we will need to pad those sequences to a fixed length, just like we do with Convolutional Neural Networks

sequences = tokenizer.texts_to_sequences(X_train)padded = pad_sequences(sequences,maxlen=max_length, 
padding=’post’, truncating=’post’)
test_sequences = tokenizer.texts_to_sequences(X_test)test_padded = pad_sequences(test_sequences, maxlen=max_length, padding=’post’, truncating=’post’)print(padded.shape) #(42000, 150)print(test_padded.shape) #(18000, 150)

Model Architecture

Image for post
Image for post

Time to build, compile and fit our model. We’ll keep the model simple, I’ll let you experiment with it further if you want.

model = tf.keras.Sequential([tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=(max_length)),tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True)),tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50)),tf.keras.layers.Dropout(0.3),tf.keras.layers.Dense(units=vocab_size/100, activation=’relu’),tf.keras.layers.Dense(3, activation=’softmax’)])model.summary()
Model: "sequential"
Layer (type) Output Shape Param #
embedding (Embedding) (None, 150, 16) 480000
bidirectional (Bidirectional (None, 150, 100) 26800
bidirectional_1 (Bidirection (None, 100) 60400
dropout (Dropout) (None, 100) 0
dense (Dense) (None, 300) 30300
dense_1 (Dense) (None, 3) 903
Total params: 598,403
Trainable params: 598,403

This is where I usually have the most fun, as there are so many things we can tweak. I didn’t want to make my model too complex, there were no regularizers included, no fancy weight initializers, and only one Dropout layer, but you’re more than welcome to add those in order to optimize the performance. The optimizer I’ve decided to use is Adam.

model.compile(loss=’sparse_categorical_crossentropy’,optimizer=tf.keras.optimizers.Adam(lr=0.001),metrics=[‘accuracy’])history =, y_train, epochs=10)
Epoch 7/10
1313/1313 [==============================] - 38s 29ms/step - loss: 0.1300 - accuracy: 0.9613
Epoch 8/10
1313/1313 [==============================] - 38s 29ms/step - loss: 0.1095 - accuracy: 0.9668
Epoch 9/10
1313/1313 [==============================] - 38s 29ms/step - loss: 0.0918 - accuracy: 0.9722
Epoch 10/10
1313/1313 [==============================] - 38s 29ms/step - loss: 0.0763 - accuracy: 0.9768

About 97% on train data! Not too shabby. What about our test data, that should be our main focus!

scores = model.evaluate(test_padded, y_test, verbose=0)print(“Accuracy: %.2f%%” % (scores[1]*100))#Accuracy: 81.67%

Considerably lower than the train data, although still good. We will also plot the history of training:‘ggplot’)fig, (ax1, ax2) = plt.subplots(nrows=1,ncols=2, figsize=(10, 4), sharex=True)ax1.plot(history.history[‘accuracy’], label=’Train’)ax1.set_xlabel(‘Epochs’)ax1.set_ylabel(‘Accuracy %’)ax1.set_title(‘Accuracy per Epoch’)ax1.legend()ax2.plot(history.history[‘loss’], label=’train’)ax2.set_ylabel(“Loss”)ax2.set_title(“Loss per epoch”)ax2.legend()
Image for post
Image for post

Steady trend, which is good. If I were to leave it for 100+ epochs, the model would’ve overfitted badly most likely, since it’s already overfitting after 10 epochs.

Predict our own data

I always find it tricky predicting new data, or my own data, with Tensorflow models. There is always a lot of hassle with dimension matching, reverse engineering of what I did to my train/test data, and so on. So here, I decided to show you how you can do that too.

We will define a function that will transform our text in a same way that our train/test data was transformed:

def get_encode(x):  x = cleant(x)  x = tokenizer.texts_to_sequences(x)  x = tf.keras.preprocessing.sequence.pad_sequences(x, maxlen=max_length, padding=’post’)  return x

I’ve found a random question on Stack Overflow, that I decided to convert into html text, to make it similar to the question format that was in our data

text=[‘<p>I want to write a bash script which takes different arguments. It should be used like normal linux console programs:</p><p>my_bash_script -p 2 -l 5 -t 20 So the value 2 should be saved in a variable called pages and the parameter l should be saved in a variable called length and the value 20 should be saved in a variable time.</p><p>What is the best way to do this?</p>’]text_sequences = tokenizer.texts_to_sequences(text)text_padded = pad_sequences(text_sequences, maxlen=max_length)text_padded.shape #(1, 150)y_pred = model.predict(text_padded) y_pred = y_pred.round() #[1, 0, 0]

As you can see, it’s pretty straightforward. Make sure your text is in brackets [], create sequences with tokenizer, and then pad them to a dimension our model can take. Function model.predict returns an array of 3 values(because our output layer is a Dense of 3 outputs, for 3 classes), and the 3 values represent the softmax probability of that class being the one. We then call the .round() function, which takes the value with highest probability, and rounds it off to 1.

Thanks for reading my first blog, if you’d like to see more let me know!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store