It’s Not a Bug, It’s a Feature…. No, It’s a Bug

Greg Vannoni
Sep 3, 2019 · 9 min read

At first blush, you may not think that machine learning (ML) and behavioral change go together, we didn’t either. Human behavior modification requires influencing and convincing. There are a multitude of studies and prescriptions for changing human behavior, each with their own caveats. Here, we wanted to leverage our Jira data to understand the current human behavior behind bug misclassification. We used this approach after being inspired at the way Seth Stephens-Davidowitz looked at Google search trends, a strategy where, instead of asking people about their behavior, he looked at their actions first. We developed an ML model (with 94% accuracy) that could tell us if a given Jira ticket is a bug.

Given this knowledge, we then analyzed Jira tickets that are not classified as a bug to see if they should have been. Thanks to Stephen J. Dubner/ Freakonomics Radio, we sought to get big returns from thinking small. Instead of an elaborate behavioral change exercise, we went low effort and sent surveys to a subset of people who had misclassified bugs as non-bugs. This small survey idea turned into big returns as we’ve since noticed a reduction in misclassified bugs. When people stop misclassifying bugs, you fix customer issues faster. When you fix customer issues faster, you save money and make customers happy. Instead of relying on a human to review bugs, our model can now review thousands of Jira tickets in a matter of seconds. Thanks to this, we are now able to add automatic quality control to everything we do: automating customer happiness.

In April of 2018 our newly founded Site Reliability Engineering: Actionable Insights (or SRE-AI for short, ironic, right?) team was in search of our first problem to tackle. We examined our Jira data considered to be both “clean” and “dark data”, that is, untouched by data science and machine learning wands and high quality data. We sought first to break down what it means to resolve bugs faster. We knew that if we could drive down Mean Time to Resolve (MTTR) that we would make a plethora of folks happy: call center teammates, technical support engineers, SREs, product owners, and maybe even some executives.

We took a look at some general problem statements that would be within our control to test.

Three main drivers for bug MTTR

The Priority Problem

We’ve got a handy matrix that is integrated into our SRE Issue Submission Portal (SISP), so we’re pretty good at roughly calculating the impact of a problem. As it happens in life, accidents do happen, and some problems may get misclassified as a lower priority. Higher priority problems come with a faster expected time to resolve service level agreement (SLA). If we’re off in our prioritization it may take some time to realize and correct.

To address this, we developed an ML model that would predict the proper priority and subsequently escalate lower priority bugs to get them more visibility…more on that in another blog post.

The Reassignment Problem

In addition to assigning priority, the SRE Issue Submission Portal tool also routes problems to teams, but routing relies on the submitter to select the correct product or service. With many products or services, we’re bound to see bugs incorrectly routed — and you always have that “I’m not sure which product is the problem” scenario.

Analysis of MTTR when assignment is changed

We’ve done some Student’s T-Test analysis (we’re talking p=0.000) to show that a single reassignment can add days to our mean time to resolve. In the above graphic, you’ll see that when there is no assignment change, our MTTR is close to zero, but when a bug is reassigned, our MTTR is larger. We’ve got some ideas currently in flight to tackle this one — another blog post to come!

The Classification Problem

Much like the curly brace postulate or its corollary the tabs to spaces transform, developers also love the age old debate of “bug or improvement”? In our case, we don’t have hard SLAs around improvements or features — only bugs, depending on their priority, have an expectation of how quickly they should be resolved. If a bug is classified as an improvement, we typically see an increase in time to resolve from a few days to a few weeks. As a customer champion, we need to resolve customer issues as fast as possible and this type of misclassification has the most severe impact.

Calling All Experts!

Instead of manually reviewing each feature or improvement, we went the supervised learning route by asking three valiant SREs to help us decide, after reading the comments and key details for each ticket, if the feature or improvement should have been classified as a bug. After 100 tickets were labeled we found that a Random Forest Classifier gave us an F1 accuracy score of 0.88.

The image below illustrates some of the inputs to the ML model along with our encoding method. We then used grid search and cross validation to select the model.

Inputs to the ML model training process

Here’s Where it Gets Interesting — Survey

In true scientific method fashion we crafted a control and experiment cohorts. Now that we have a model that is able to flag bugs that were incorrectly moved from bug to feature or improvement, we analyzed all historical data to find a set of people who had incorrectly moved them. After randomly dividing them between experiment and control, we sent a simple survey over email to the 150 people in the experiment group.

It looked something like this:

This survey was sent to our experiment group

This survey helped us to learn our first valuable lesson, no one reads email. Our email lacked the ethos and pathos to make people fill out a survey. After waiting 4 weeks and after a reminder email, we took our message to another platform: Slack.

This Slack approach got 66 additional people to complete the survey within 24 hours compared to 46 survey email-based responses over one month including a reminder email.

Slack is more effective than email

I’m obsessed with work hacks that make my life easier. In this case, because I was going to be reaching out to over 100 people, I wanted to make it easier on me, so I jumped into Python and scripted it as shown below:

Automation of “personal” Slack messages

After the surveys were complete, we dug-in and observed for a few weeks. Would we see a measurable improvement in the behavior of reclassifying bugs?


  • For both control and experiment: t=6.816, p=0.0000000, we’re calling this the “halo effect” of our survey. Both the control group and the experiment group were measurably better.
  • For people who received the survey: t=5.686, p=0.0000001, receiving the survey changed bug misclassification behavior.
  • For people who answered the survey: t=4.719, p=0.0000106, answering the survey also changed misclassification behavior, but with slightly less confidence than people who only received the survey.
There’s a significant behavior change for the people who took the survey (red line)

With the feedback we received from the survey, we updated our model with additional intelligence, this gave us an F1 score of 0.9153.

In general, we saw:

  1. Less bugs stay open: 26% of bugs incorrectly moved to improvement or feature are still open compared to 7% of bugs that remained as bugs are open.
  2. 4% fewer bugs were incorrectly classified as features or improvements after the experiment.
  3. Lower priority bugs were moved to improvement or feature less frequently after the experiment.
  4. We looked at the MTTR of all the closed tickets moved from bug to improvement or feature incorrectly before our experiment time window and compared it with MTTR of all the closed tickets that stayed as a bug in the same timeframe. Bringing in our behavior change metrics, we calculated the decrease in MTTR as millions in revenue saved over 1 year. More on this math below.

The savings calculation

Whenever people claim to save millions of dollars a year in revenue, everyone always wants to see the numbers. Instead of leaving you wanting more, I will transparently walk you through our thought process for this.

We applied our bug/non-bug ML model against 5 months worth of Jira data before we sent out surveys that created a behavior change. We gathered the MTTR for the bugs that were moved to non-bug incorrectly and compared them with the bugs that were closed as a bug.

For our calculation, we assumed that our behavior change was successful, in that, bugs would no longer be incorrectly classified as features. Further, if we apply our model in real-time against Jira, we will always catch incorrectly categorized bugs. The comparison of MTTR by priority is shown below. For P0 bugs that remain as a bug, they are resolved 11% faster than a P0 (the highest priority and the least occurring) bug that was incorrectly moved to a feature. We resolve P1 and P2 bugs 68% and 61% more quickly than if they were features.

We resolve bugs faster than features

Knowing that we resolve bugs faster than features, we reviewed our SRE Priority Matrix, a lookup table that correlates priority to lost revenue. Using these numbers along with our MTTR efficiencies gained by keeping bugs as bugs, we calculated millions in annual revenue savings.

Our Big Takeaway

We were able to drive a cultural change, not by sending emails, but by collecting data, using experts, developing a control/experiment, getting feedback, and Slacking. This type of data-driven approach is fundamental to our decision-making process at PayPal. We worked together with experts to label Jira tickets as bug/not bug as we sprinkled in supervised machine learning to allow this solution to scale. The addition of a survey to this process, while low touch, provided a human element and the big returns from thinking small (props to Stephen J. Dubner/ Freakonomics Radio for the concept) surprised our team. We also surprised our patent team with this concept and have already filled this idea with the USPTO!

Thanks to Everyone Involved

We would have been nowhere without our experts who helped to label data: Siva Gujavarthy, Josh Hardison, Uthkarsh Suryadevara. Nilesh Vyas volunteered his busy schedule to orchestrate all of this behind the scenes during his TLP rotation.

Yifan Liu, Haiou Wang, and idea architects of Jon Arney and Sree Velaga gave this project light, flawlessly executed the project, and are part of our patent submission Detecting Incorrect Field Values of User Submissions Using Machine Learning Techniques.

What’s next?

Armed with both the data and the outcome we collected from our misclassification project, we had enough evidence to add an automated message when a bug is reclassified. We plan to revisit this data to ensure we continue to see the same positive trend. We would love to present this idea at conferences and talk with people about using ML for behavior change, please share this within your network to improve our chances of doing so. Finally, PayPal SRE is hiring! Be sure to check out our jobs on PayPal Careers if you want to work on out of the box and challenging projects like this!

PayPal Engineering

The PayPal Engineering Blog

Greg Vannoni

Written by

Software Samurai @ PayPal

PayPal Engineering

The PayPal Engineering Blog

More From Medium

More from PayPal Engineering

More from PayPal Engineering

More from PayPal Engineering

GraphQL Resolvers: Best Practices

More from PayPal Engineering

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade