Every Expert Has A Severity Scale: Which Is Most Effective?

15 min readApr 7, 2017

One of the major challenges in solidifying the strength of the position of usability testing in industries is that the severity scales for classifying the importance of fixing a problem vary for every expert usability specialist. Not only does every expert have his or her own severity scale, but every company engaging in usability testing has its own severity scale. The range of products and problems makes it very difficult to apply a standard scale across all industries. Even applying a standard to the software industry, for example, has proven to be very difficult, due to the wide range of software products and the biases of the existing usability specialists.

In addition to the wide range of personal biases on the part of the experts, often these experts are not even working with the actual system itself, but with written descriptions of the problem, when making their severity judgments (Nielsen, 1994, 48). It is also difficult to get good severity ratings from the experts because they are focused on finding new usability problems, rather than rating the problems they find during the evaluation session (Nielsen, 1994, 48). As Nielsen points out, each expert evaluator of a particular product will only find a small number of the usability problems, so the set of problems found by a particular evaluator will be incomplete (Nielsen, 1994, 48). Each expert’s corresponding severity ratings of the problems found by that expert is likely to be skewed since he or she is only working with their small subset of uncovered usability problems.

Due to a combination of these and possibly other as yet unknown factors, even when considering an identical problem, different experts will rate it according to his or her own preferred severity scale and in the context of the problems he or she has either discovered or been asked to rate. This paper will consider the varying severity assessment methods for the following problem:

All participants are asked to complete a task of entering data into a table and saving the results. After entering the data, two of the participants forget to click the Save button. Instead they close the window using the Close Window control. The entries they made are not saved. The other three don’t have the problem because they click Save. How would each of the methods categorize this problem?

One of the factors many experts should consider when rating issues or “bugs” are the attributes of the problem (Wilson, 1999), but few take these into account. Expert Chauncey Wilson defines these attributes as: performance, probability of loss of critical data, probability of error, violations of standards, impact on profit, revenue or expenses, and aesthetic problems due to illegibility and clutter on a web page or graphical user interface (GUI). In considering the above problem, the attributes of the problem would include: probability of loss of critical data, probability of error, and impact on profits, revenue or expenses of the company. Rather than displaying a dialog box prompting the user to save his or her changes when the user attempts to exit the table, the software closes without giving the user an opportunity to save. It also doesn’t automatically save the information to a default table name, providing no recourse for the user to retrieve his or her work, after incorrectly exiting the table. This goes against (windows) user expectations, as most software programs will prompt users to save if they attempt to exit without doing so, and indeed, many users rely on this method of saving and do not perform manual saves of their work. Thus, there is a high probability that users will make this error, at least the first time, resulting in lost time spent reentering data, potentially missed deadlines in tight release schedules, which would in turn lead to decreased revenue and a tarnished reputation for the company as a whole. The fact that the data is not saved anywhere upon an incorrect exit means that the probability of loss of critical data is also high. Causing users to lose critical data and leaving frustrated users in its wake can be the death knell of a company that is already struggling and can tarnish the reputation of even a well-established company. This problem has the potential to alienate users because they feel that the software is not helping them and that the designers did not have their best interests at heart. Alienation of users causes users to reject systems, which in turn causes system implementation within a company to fail. If users will not use a product or, if forced to do so, refuse to use it at its optimal level, then the company that spent the money on the product loses. This company spreads its dissatisfaction with the product to other companies in the industry, and in turn, the company that provided the product, loses not only that customer but countless current and potential future customers.

After determining the attributes of the problem. Wilson’s severity scale has five levels of severity. Level 1 involves a hardware or software crash, Level 2 is a severe problem causing possible loss of data in which the user has no workaround, Level 3 is a problem such as an important function or feature not working as expected but does not cause permanent loss of data. This level of problem wastes time and forces people to learn a workaround, Level 4 is a minor, irritating problem that causes loss of data that slows users down; this level of problem violates guidelines that affect appearance and perception. The final level, Level 5, involves a minor cosmetic or consistency issue. (Wilson, 1999).

Based on these levels, the afore-mentioned problem could be classified as a Level 2 severity issue because it causes loss of data and provides no way for the user to recover the lost data. Although it could probably be classified as a Level 3 because an important function, the Save function, does not operate as expected, but due to the critical nature of the save functionality, failing to adhere to expectations results in users losing their work and having no way to recover their data, thus it is a higher severity than a less critical functionality not working as expected. Once this issue was internally known, users could work around the issue by making certain to save before exiting the software, however, each new employee would need to learn this workaround and likely many more users would lose their data in the future before they were either informed or discovered the unexpected save functionality the hard way.

Another expert, Jakob Nielsen, recommends two severity rating scales. The first, a single rating scale, rates problems from 0–4. A problem with a rating of 0 is not classified as a usability problem, a rating of 1 denotes a cosmetic problem that will only fixed is time allows, a rating of 2 denotes a minor usability problem which has a low fix priority, a rating of 3 is a major usability problem which is important to fix and therefore a high priority, and a level 4, the most severe, denotes a usability catastrophe that is imperative to fix before product release. (Nielsen, 1993).

Based on this rating, the afore-mentioned problem would receive a rating of 2. The fact that the program does not automatically save a user’s work and does not even prompt the user to save when the user incorrectly exits the program is a major usability problem. This is unexpected functionality of a critical function and users will be upset and frustrated over the loss of their data and their time.

Both of these scales were single-rating scales that consider the problem more than the number of users that will potentially be impacted by this problem. In Nielsen’s second scale, he “judges severity on a combination of the two most important dimensions of a usability problem: how many users can be expected to have the problem, and the extent to which those users who do have the problem are hurt by it” (Nielsen, 1993, 104). In this scale, a severity table would be used:

(Nielsen, 1993, 104).

As we can see, it is much more difficult to rate the same problem based on this scale, and it is easiest to rate the problem by first gauging whether it impacts a few or a lot of the users (Few/Many) and then whether it causes a small or large inconvenience to the impacted users (Small/Large Impact). As we have determined, the problem will likely impact 40% of the product’s users. That is a high percentage and a lot of users, thus the problem is already a Medium or High severity rating from just this criterion. Since the user who incorrectly exits the program will lose all of his or her data and will not be able to recover the data or the lost time spent inputting the data, the inconvenience to the impacted users is large. Thus, based on the problem impacting many users and causing a large inconvenience, the problem would be given a High severity rating.

Nielsen does comment that the severity rating of a usability problem would be reduced if the users could learn to overcome the inconsistency (Nielsen, 1993, 105). In this case, users could be taught to click the Save button before exiting the program, thus providing an effective workaround. Since this workaround would eliminate the problem continuing to reduce the usability of the program for experienced users (Nielsen, 1993, 105), based on this scale, the rating could possibly be lower.

Nielsen comes up with a slightly different rating scale in 1994, wherein he determines that the severity of a usability problem is a combination of three factors: the frequency with which the problem occurs, the impact of the problem or whether it is easy or difficult for the user to overcome, and the persistence of the problem, which evaluates whether this is a one-time issue that users can overcome once they know about it or whether the problem will repeatedly bother users over the course of time (Nielsen, 1994, 47). He adds a fourth variable into the mix, stating that one must also assess the market impact of the problem, putting forth the idea that “certain usability problems can have a devastating effect on the popularity of a problem, even if they are ‘objectively’ quite easy to overcome” (Nielsen, 1994, 47).

According to Dumas & Redish , problems can also be rated by importance, an attribute which has the two dimensions of severity and scope(Dumas & Redish, 1999, 322). Severity should be measured first, so we consider Dumas & Redish’s severity scale. This scale starts at Level 1, at which level the problem prevents completion of a task, followed by Level 2, which is a problem that creates delay and frustration, such as a lack of feedback causing the user to redo the same task again to make sure that the task is actually done. Level 3 severity problems have a minor effect on usability and Level 4 problems often point out a potential future enhancement to the product or application. Using this scale, the problem example would be rated a Level 1 because, in causing the user to lose all of his or her work, the program is effectively preventing the user from completing the task. If the user does not immediately understand that he or she should have clicked the Save button before exiting, the problem could continue unabated, with the user becoming more and more frustrated with the increased time and energy he or she expended in trying to complete the task.

After determining severity using the Dumas & Redish method, we must now determine the scope of the problem. Scope pertains to how widespread the problem is and severity pertains to the critical nature of the problem. The important rule of scope is that “global problems are more important than local problems” (Dumas & Redish, 1999, 322). A global problem has a scope that is larger than just one screen or one page, while a local problem is confined to one screen or dialog box or manual page. Since local problems do not have as widespread an impact as global problems, they are given a lower severity rating. In considering the problem example, based on Dumas & Redish, this problem would be a global problem. The program does not give the user feedback (Dumas & Redish, 1999, 322), either by prompting the user to save his or her work, which would be the expected program behavior or by warning the user that the work will be lost unless it is saved. This makes the user feel that he or she is not in control (Dumas & Redish, 1999, 322). Thus, using Dumas & Redish’s combination of severity and scope, the problem presents a global issue of a Level 1 severity.

Not only does every expert have his or her own severity standards for issues, each company also has its own severity standards for usability issues. Although some might argue that usability bugs are less important than functionality bugs, the users of the product do not classify bugs and thus view both types of bugs at the same level (Baekdal, 2005). If a user cannot print due to an error in the program or due to a cluttered, unintuitive interface which precludes them from finding the print function, the result is the same: the user cannot print (Baekdal, 2005). As a senior software test engineer and senior quality assurance analyst at several software companies, I have outlined the severity scale ratings of usability problems at these companies.

At InstallShield Software Corporation, a software installation solution company based in Schaumburg, Illinois, the afore-mentioned problem would be classified as a major bug, and would definitely have been fixed prior to release, as it would have frustrated customers and been bad for business and InstallShield’s reputation as the leading provider of software installation solutions. According to Desurvire in “Faster, Cheaper!”, since businesses are facing increasing competitive pressures [in the new global economy], companies recognize the necessity of designing more usable internal and commercial products (Desurvire, 191). Since InstallShield is a leader in its market base, it is crucial that it continue providing if not cheaper, at least better products at a faster rate than its closest competitor, Wise Installation. Thus, both functionality and usability issues were a major concern, even in 2000. At InstallShield, issues were classified on a three-level scale: minor, major, and critical. Minor bugs were cosmetic issues, such as a misspelled word, a truncated button, or some element in the software that was aesthetically unpleasing. Major bugs were when something did not work as expected or as it was supposed to, such as shortcuts or registry file entries not being created upon installation, or elements in the software installation not being populated or being improperly populated into the resulting MSI database. Critical bugs were those that crashed the system, often requiring reboots; these bugs were fixed the same day and if not, the schedule was adjusted accordingly. If a user’s data was not saved, this would be classified as a major bug under InstallShield’s scale. Even though a workaround does exist (clicking the Save button before exiting the program), the program does not function according to user expectations, and would thus be fixed prior to release. From the perspective of a software test engineer, InstallShield’s method of rating bugs was very straight-forward, and the cognitive load on new software test engineers to learn this severity scale was low.

At Delorme Mapping, a software mapping company in Yarmouth, Maine, issue severity was based on a three-level scale as well: minor, normal, and major. An example of a minor bug was that a particular street wasn’t on the map, a normal bug was that directions instructed a user to take a right off an overpass rather than instructing the user to cross the overpass before turning right, and a major bug was one in which the mapping software application crashed or one in which data was not saved. In considering the problem example using this severity scale rating, the issue would be rated as a major bug. Even with a workaround, the user’s data was not saved and thus this constituted a major usability bug. According to Baekdal, classifying a bug as “normal” makes the engineering team and management feel that it is not very important. Attaching “normal” to a bug indicates that it lacks a sense of urgency (Baekdal, 2005). However, in this scale, a “normal” bug is the second level of severity, and according to company standards, would have been fixed prior to release. Also, the very meaning of the word, “normal”, for a bug means that the bug would have had to be present in the product for several versions (Baekdal, 2005); “normal” indicates that this is expected functionality and that it belongs there, which it obviously does not. A more appropriate such as “serious” would indicate that the bug was more important to fix (Baekdal, 2005). Rating bugs at this company was not nearly as intuitive, which is due partly to the terminology and partly to a new and inexperienced quality assurance department with ineffective management that did not understand the impact of each bug on the users.

At Boston Communications Group Incorporated, a cellular phone billing company in Westbrook, Maine, errors were rated on a four-level scale: minor, major, normal urgent, and mission critical urgent. A minor was an aesthetic or cosmetic issue such as a misspelled word, a major error example was the database program timing out and crashing, indicating that the program needed to be rebooted. A normal urgent error was one that impacted revenue, such as not being able to initiate new customer contracts in the database program, while a mission critical urgent bug was one in which business could not proceed without being fixed, such as the switch interface being down causing all customers to be unable to use their cellular phones. Using this scale, the example problem would be classified as a normal urgent error because it impacted revenue when the user’s data was not saved. All tables are linked in a database at Boston Communications Group, so if a user is unable to save something to a table in a database, then it is not billable and thus company revenue is impacted as a result of this problem. I found bugs very difficult to rate here as well, as “normal urgent” still does not imply a sense of urgency, as Baekdal states (Baekdal, 2005). Since this rating indicates that a bug will impact company finances, a more comprehensive term, such as perhaps “financial urgent”, should have been used to convey this urgency.

As we have seen, the same usability issue received a different rating in every severity scale. In considering the range of severity scales, I found Wilson’s severity scale to be the most effective. Wilson’s scale encompassed catastrophic errors as well as minor cosmetic problems, thereby covering the gamut of usability problems. It also assessed the attributes of the problem, which I felt helped to assess the financial and legal impact of the problem. If a problem causes at least 40% of customers to lose their data, this has quite a financial impact on the company. First, customers are frustrated that they have lost their work and wasted their time and that they will need to spend the time and effort to recreate that work. Second, they have lost their trust in the system. They no longer believe that the software was designed to help them and that the designers had their best interests at heart. They feel that the software has precluded them from being able to finish their work. Rather than being a help, it has been a hindrance. Widespread disillusionment of not only the product but the company that created this product will commence throughout the customer base. This could likely spell financial disaster for the company striving to remain competitive in this globalized economy, where reputation and customer loyalty are critical, and customer loyalty is only reliable as long as the product continues to meet and exceed customer expectations. Once a company’s reputation is tarnished by a problem that impacts so many users, it must work much harder and may never recapture its market position and customer base. Although the problem example did not encompass the other three attributes of Wilson’s severity rating method, they bear discussion. If a bug causes a program to violate industry standards, this bug could cause the company to lose standard certification, thereby causing the company to lose customers as well as face. Although a lesser problem, aesthetic problems such as hard-to-read or cluttered product graphical user interfaces and websites can also mar a company’s reputation. If a competitor provides a product or website that offers similar functionality with a more usable interface, it will not take long before customers defect.

From the perspective of a veteran software test engineer, I found it easiest to classify the problem using this scale. I found Dumas & Redish’s scale to be focused primarily on more minor usability issues that would not be given a high priority in a release-driven software environment. I also found Nielsen’s scale to not be very specific, and appreciated the description Wilson provided in helping to classify a problem based on its attributes as well as its severity. Nielsen’s impact table scale was good, but I believe that the severity of the bug is more important than the amount of times that users might encounter a bug. The classifications of bugs at Delorme and Boston Communications used a terminology that lacked urgency and was ambiguous. InstallShield’s scale was very comprehensive, but did not take into account financial impact and other problem attributes, which I feel would have made the scale much more robust and encompassing. Adding this dimension to the scale could have cut down on release cycles as well, if certain attributes, such as financial impact, were given priority over others. Wilson’s method of defining problem attributes combined with a comprehensive and descriptive severity scale made it easy to rate the severity of a problem not only based on its own merit but on how the usability problem could negatively impact the company as a whole.

(I wrote this in 2006 as a graduate student of Bentley University’s McCallum School of Business, where I earned an MS in Human Factors in Information Design. I’m publishing it in the interest of sharing information and research and building our UX community knowledge.)

References

Baekdal, T. (2005). Usability Severity Rating — Improved. Retrieved from

http://www.baekdal.com/articles/Usability/usability-severity-rating/ on October

11, 2005.

Desurvire, H. (1994). Faster, Cheaper!! In D. D. Cerra (Ed.). Are Usability Inspection Methods

as Effective as Empirical Testing? Usability Inspection Methods (p. 191). New York,

NY: John Wiley & Sons.

Dumas, J. & Redish, G. (1999). Tabulating and Analyzing Data. A Practical Guide to Usability

Testing (Revised ed.) (pp. 322–326). London: Intellect Books.

Nielsen, J. (1993). The Usability Engineering Lifecycle. Usability Engineering (pp. 71-

113).

Nielsen, J. (1994). Heuristic Evaluation. In D. D. Cerra (Ed.). Usability Inspection

Methods (pp. 47–49). New York, NY: John Wiley & Sons.

Wilson, C. (1999). Readers’ Questions: Severity Scales. Usability Interface, 5(4).

Retrieved from: http://www.stcsig.org/usability/newsletter/9904-severity-scale.html on September 30, 2005. Boston, MA: Academic Press.

Every Expert Has A Severity Scale: Which Is Most Effective?

References

Written by Courtney Jordan