A virtual-scrum-master using text classification and ML models

Ishmeet
3 min readNov 11, 2019

--

I have used text classification to predict a defect assignment to developer. Basically, some work of a scrum master automated using AI.

In a large software system the number of bugs can grow tremendously. These needs to be assigned and fixed as early as possible. An assigner assigns this bug to developer who in turn fixes it. So many times it happens that a bug is assigned to incorrect developer. In order to deal with this problem a supervised learning approach is used to predict the right developer.

There are basic steps
1. Pick a dataset (preferably which has lots of defects, I used my companies defect logs from JIRA. I exported it to excel)
2. Download and install Anaconda (https://www.anaconda.com/distribution/)
3. Data preprocessing (drop NA, stemming, lamentation, stop words removal)
4. Applying count vectorization, TFIDF transform.
5. Applying multiple ML models.
5.1 Multinomial Naive Bayes
5.2 SGD Classifier
5.3 Decision Tree
5.4 Random Forest Classifier
5.5 K neighbours Classifier
6. Deploying the model using pickle

Step1: Pick a dataset

The dataset I used I am not sharing. You should use your own, you can easily export to excel from any bug tracking tool like JIRA/Bugzilla from your organisation.

Step2: Download and install Anaconda

Anaconda Python/R Distribution — Free Download
The open-source Anaconda Distribution is the easiest way to perform Python/R data science and machine learning on…www.anaconda.com

Step3: Data Preprocessing

Firstly, I imported a set of libraries

import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
import numpy as np
import pickle
import scipy.sparse as sp

Here I read the excel file into a pandas data-frame.

dat1=pd.read_excel(‘ABC1234.xlsx’,header=0, encoding=”ISO-8859–1", error_bad_lines=False)

The “Assignee” is the column which is my target, i.e. I will be predicting the assignee based on “summary” and “description” columns.

#Data Preprocessing
dat1[‘Assignee’].isnull().sum()
Z = dat1[‘Assignee’].dropna()
dataframeZ = dat1[pd.notnull(dat1[‘Assignee’])]

Using test train split to split the dataset 80% for training and 20% for testing.

# Train test split
X = dataframeZ[[‘Summary’,’Description’]]
y = dataframeZ[‘Assignee’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Step 4: Applying count vectorization and TFIDF transformation

Text preprocessing, tokenizing and filtering of stopwords are all included in Count Vectorizer, which builds a dictionary of features and transforms documents to feature vectors

# Vectorization
Vector1 = CountVectorizer()
Vector2 = CountVectorizer()
X_train_1 = Vector1.fit_transform(X_train[‘Summary’].values.astype(‘U’))
X_train_2 = Vector2.fit_transform(X_train[‘Description’].values.astype(‘U’))

TFIDF (Term Frequency times Inverse Document Frequency) to divide the number of occurrences of each word in a document by the total number of words in the document. And downscale weights for words that occur in many documents in the corpus and are therefore less informative

from scipy.sparse import hstack
X_train_vector = hstack([X_train_1,X_train_2])
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_train = TfidfTransformer()
X_train_vector_tfidf = tfidf_train.fit_transform(X_train_vector)

X_test_1 = Vector1.transform(X_test[‘Summary’].values.astype(‘U’))
X_test_2 = Vector2.transform(X_test[‘Description’].values.astype(‘U’))
X_test_vector = hstack([X_test_1, X_test_2])
X_test_vector_tfidf = tfidf_train.transform(X_test_vector)

Step5: Apply multiple machine learning models

Multinomial Naive Bayes Classifier

#ML Algorithm
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_vector_tfidf, y_train)
predicted = clf.predict(X_test_vector_tfidf)
np.mean(predicted == y_test)*100

SGD Classifer

from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(loss=’hinge’, penalty=’l2',alpha=1e-3, random_state=42,max_iter=5, tol=None).fit(X_train_vector_tfidf, y_train)
predicted_sgd = sgd.predict(X_test_vector_tfidf)
np.mean(predicted_sgd == y_test)*100

Decision Tree Classifier

from sklearn import tree
Dtreeclf = tree.DecisionTreeClassifier().fit(X_train_vector_tfidf, y_train)
predicted_dtree = Dtreeclf.predict(X_test_vector_tfidf)
np.mean(predicted_dtree == y_test)*100

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
randomclf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0).fit(X_train_vector_tfidf, y_train)
predicted_randomf = randomclf.predict(X_test_vector_tfidf)
np.mean(predicted_randomf == y_test)*100

K Neighbors Classifier

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3).fit(X_train_vector_tfidf, y_train)
predicted_neigh = neigh.predict(X_test_vector_tfidf)
np.mean(predicted_neigh == y_test)*100

Step6: Deploying the model using pickle

pickle.dump(tfidf_train2, open(“tfidf_train2.pkl”, “wb”))
pickle.dump(clf2, open(“MultinomialNB2.pkl”, “wb”))
pickle.dump(sgd2, open(“SGDClassifier2.pkl”, “wb”))
pickle.dump(Dtreeclf2, open(“DecisionTreeClassifier2.pkl”, “wb”))
pickle.dump(randomclf2, open(“RandomForestClassifier2.pkl”, “wb”))
pickle.dump(neigh2, open(“KNeighborsClassifier2.pkl”, “wb”))

Codebase

Please post some queries or reviews. All constructive criticisms are welcome.

Thank you.

--

--