This is what happens when you machine-learn JIRA tickets

Gathering knowledge from XGB trees

5 min readJul 27, 2020

For software developers, one of the most-debated and maybe even most-hated questions is “…and how long will it take?”. I’ve experienced those discussions myself, which oftentimes lacked precise information on the requirements. What I’ve learned is: If only sparse information is available, a reliable estimate is almost impossible. To make matters worse, developers found themselves under pressure after having issued a wild guess and then requiring more time.

Experiencing the other side

As I started working in direct contact with customers, I’ve (reluctantly) realized that the collaboration oftentimes benefits from providing a schedule. In my experience, time becomes an important factor when customers have plans and projects on their own which can’t continue without knowing when a missing piece arrives.

Understanding each other (Business Understanding)

In the end, this comes down to a simple problem: “How can managers and developers work together on a project and get what they, respectively, need?”. For those who interact with a customer, this means that they need the information (estimates) to make the collaboration go smoothly. In turn, developers need accurate requirements and some flexibility has to be respected as well.

Learn!

With that in mind, I’ve decided to use a data-science-driven approach to gain insights into the estimation problem. You can find the code that I used and more detailed technical explanations in my github repository. What I wanted to know, was:

As a baseline reference, what are the average times that a “New Feature”, “Bug” (etc) spends in implementation (i.e. status “in progress”)?
Is it possible to estimate the time spent “in progress” from analyzing the text in the summary and description of a ticket?
Which words in the description make up for large / small durations?

Data Understanding

As data, I gathered around 20.000 tickets from the RTFACT-repository of the JFrog open source project. For all tickets, the following is available: Issuetype (i.e. “Bug”, “New Feature”), summary, description, time spent “in progress”. Some initial data exploration showed, that out of all the tickets, only 10% (2258) have a nonzero “in progress”-time. All the others have not been worked at or they were never put in that status.

To get a feeling for the data, I checked the counts of tickets by their issuetype. And as you can see in the next image, there is a large variation in the types with the highest count being Bugs.

Prepare Data

As a first cleaning step, I only kept entries with a non-zero “in progress”-time and removed outliers (outside of the 96%-quantile). Now, keep in mind that statistical models can only understand numbers, not text. To translate between strings of characters, I computed TFIDF text analysis features. These are a way of numerically representing the occurrence and importance of certain words in a text document.

Data Modeling

A powerful and insightful model for analyzing data are decision trees / random forests. One branch of that type of model are gradient boosted trees. These are my model of choice due to their performance (won several Kaggle competitions) and their interpretability. This mainly means, that we can draw further insight from the decisions made in the trees.

Evaluate the Results

So, the first question asked regarded a baseline for the duration of a ticket. As you can see in the next image, the mean duration spans between ~10h and ~100h. Note, that the standard deviation is very large (~50 or higher), which calls for additional estimation information through e.g. the boosted trees.

For the trees, the performance is good (and can be tuned to “great”) — on the training set. However, on the test set, the model generalized badly. This is, why I captioned this article by “early results”. As you can see in the next image, the ground truth (blue) deviates significantly from the estimated values (orange).

I think it’s still interesting to take a look at the keywords contribute in a positive way (larger time “in progress”) or a negative way (smaller time “in progress”):

As you can see from the results, the issue types “Bug” and “New Feature” have the largest positive impact on the estimation. On the other end of the spectrum are the “error” and “com”, which have the largest negative impact on the estimation. For the top 15 words that cause the highest positive / negative impact, see the figure below.

Future Work

What else needs to be done?

The dataset is not very large (the model had to be trained based on only ~2200 valid samples). The next step would be to find a ticket repository with a larger number of valid tickets.
instead of only estimating the implementation time (time “in progress”), the cycle time is possibly as well interesting to know
Is it possible to estimate (classify) the ‘resolution’ (Fixed, Duplicate, Won’t Fix, …) of a ticket?

Thanks

This was my first article on medium! Thanks a lot for taking the time. If you have any feedback or insights that you’d like to share: I’d be glad to get some feedback.