I have used text classification to predict a defect assignment to developer. Basically, some work of a scrum master automated using AI.
In a large software system the number of bugs can grow tremendously. These needs to be assigned and fixed as early as possible. An assigner assigns this bug to developer who in turn fixes it. So many times it happens that a bug is assigned to incorrect developer. In order to deal with this problem a supervised learning approach is used to predict the right developer.
There are basic steps
1. Pick a dataset (preferably which has lots of defects, I used my companies defect logs from JIRA. I exported it to excel)
2. Download and install Anaconda (https://www.anaconda.com/distribution/)
3. Data preprocessing (drop NA, stemming, lamentation, stop words removal)
4. Applying count vectorization, TFIDF transform.
5. Applying multiple ML models.
5.1 Multinomial Naive Bayes
5.2 SGD Classifier
5.3 Decision Tree
5.4 Random Forest Classifier
5.5 K neighbours Classifier
6. Deploying the model using pickle
Step1: Pick a dataset
The dataset I used I am not sharing. You should use your own, you can easily export to excel from any bug tracking tool like JIRA/Bugzilla from your organisation.
Step2: Download and install Anaconda
Step3: Data Preprocessing
Firstly, I imported a set of libraries
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
import numpy as np
import pickle
import scipy.sparse as sp
Here I read the excel file into a pandas data-frame.
dat1=pd.read_excel(‘ABC1234.xlsx’,header=0, encoding=”ISO-8859–1", error_bad_lines=False)
The “Assignee” is the column which is my target, i.e. I will be predicting the assignee based on “summary” and “description” columns.
#Data Preprocessing
dat1[‘Assignee’].isnull().sum()
Z = dat1[‘Assignee’].dropna()
dataframeZ = dat1[pd.notnull(dat1[‘Assignee’])]
Using test train split to split the dataset 80% for training and 20% for testing.
# Train test split
X = dataframeZ[[‘Summary’,’Description’]]
y = dataframeZ[‘Assignee’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Step 4: Applying count vectorization and TFIDF transformation
Text preprocessing, tokenizing and filtering of stopwords are all included in Count Vectorizer, which builds a dictionary of features and transforms documents to feature vectors
# Vectorization
Vector1 = CountVectorizer()
Vector2 = CountVectorizer()
X_train_1 = Vector1.fit_transform(X_train[‘Summary’].values.astype(‘U’))
X_train_2 = Vector2.fit_transform(X_train[‘Description’].values.astype(‘U’))
TFIDF (Term Frequency times Inverse Document Frequency) to divide the number of occurrences of each word in a document by the total number of words in the document. And downscale weights for words that occur in many documents in the corpus and are therefore less informative
from scipy.sparse import hstack
X_train_vector = hstack([X_train_1,X_train_2])
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_train = TfidfTransformer()
X_train_vector_tfidf = tfidf_train.fit_transform(X_train_vector)X_test_1 = Vector1.transform(X_test[‘Summary’].values.astype(‘U’))
X_test_2 = Vector2.transform(X_test[‘Description’].values.astype(‘U’))
X_test_vector = hstack([X_test_1, X_test_2])
X_test_vector_tfidf = tfidf_train.transform(X_test_vector)
Step5: Apply multiple machine learning models
Multinomial Naive Bayes Classifier
#ML Algorithm
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_vector_tfidf, y_train)
predicted = clf.predict(X_test_vector_tfidf)
np.mean(predicted == y_test)*100
SGD Classifer
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(loss=’hinge’, penalty=’l2',alpha=1e-3, random_state=42,max_iter=5, tol=None).fit(X_train_vector_tfidf, y_train)
predicted_sgd = sgd.predict(X_test_vector_tfidf)
np.mean(predicted_sgd == y_test)*100
Decision Tree Classifier
from sklearn import tree
Dtreeclf = tree.DecisionTreeClassifier().fit(X_train_vector_tfidf, y_train)
predicted_dtree = Dtreeclf.predict(X_test_vector_tfidf)
np.mean(predicted_dtree == y_test)*100
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
randomclf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0).fit(X_train_vector_tfidf, y_train)
predicted_randomf = randomclf.predict(X_test_vector_tfidf)
np.mean(predicted_randomf == y_test)*100
K Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3).fit(X_train_vector_tfidf, y_train)
predicted_neigh = neigh.predict(X_test_vector_tfidf)
np.mean(predicted_neigh == y_test)*100
Step6: Deploying the model using pickle
pickle.dump(tfidf_train2, open(“tfidf_train2.pkl”, “wb”))
pickle.dump(clf2, open(“MultinomialNB2.pkl”, “wb”))
pickle.dump(sgd2, open(“SGDClassifier2.pkl”, “wb”))
pickle.dump(Dtreeclf2, open(“DecisionTreeClassifier2.pkl”, “wb”))
pickle.dump(randomclf2, open(“RandomForestClassifier2.pkl”, “wb”))
pickle.dump(neigh2, open(“KNeighborsClassifier2.pkl”, “wb”))
Codebase
Please post some queries or reviews. All constructive criticisms are welcome.
Thank you.