Software Reliability for dummies

[Part 1] Introductory article to software reliability and it’s significance explained in lucid language.

Yogesh Singla
y.reflections
3 min readDec 28, 2017

--

Setting the stage

Ever wondered about the software that is running the airplane(safety-critical systems) you are sitting in. Those moments when the wheels will touch down and brakes be applied. There is a software in the picture. Isn’t it? Do you rely on it? How much? Software reliability is much more important than we might imagine it to be. Just like a bridge is designed and tested by civil engineers, a car by a mechanical engineer (I know, different departments’ engineers work together for a bridge or car. Just for analogy sake here!), a software is tested and reliability ensured.

Profiling

“90 percent of execution time in a typical program is spent in only 10 percent of the instructions in the program”

This 10 percent of the program is called the core. Rest is the non-core. Now, given any program, the general question arises. How do you find which part is core and non-core. The technique we use is called profiling. The tool we will use is perf. It can be installed using the following command:

sudo apt install linux-tools-common

Using this sample C code:

static char array[1000][1000];

int main (void)
{
int i, j;

for (i = 0; i < 1000; i++)
for (j = 0; j < 1000; j++)
array[j][i]++;

return 0;
}

Save this in a file named temp.c. Next, we can generate a report using perf with the command:

gcc temp.cperf stat ./a.out

Such type of profiling shows which areas can be optimized in the software. We will not go in detail right now. The key idea here is to identify the core and non-core parts. For the above example,

sudo perf annotate

gives the break down of the instructions and their execution time.

Removing a bug from a core part of the software will improve software reliability by a larger percentage than when removed from a non-core part.

Hardware vs Software reliability

Source: Fundamentals of Software Engineering by Rajib Mall 4th Edition

The curves here are self-explanatory. (sorry for poor resolution:P) Hardware products have a bucket curve for reliability measured by failure rate. In the lab conditions, it is worked out and optimized(burn-in phase). Then it has a useful life after which physical wear and tear makes the failure rate high leading to low reliability. On the other hand, for software, after the testing phases, reliability increases through bug removals. But then, these products might become obsolete even though they are still reliable.

That’s one reason why we still see Windows XP in government offices or floppy disks being used by US Army for their nuclear missiles. Don’t believe me? Check this out. Since software is reliable and working, opportunity cost of shifting to latest softwares is considered unnecessary.

Reliability Metrics

Rate of Occurrence of Failure

  • Failures per unit time
  • Time variant
  • Frequency based

Mean Time to Failure

  • Time between two consecutive failures
  • Time variant
  • Average based

Mean Time to Repair

  • Time taken to repair
  • Time variant
  • Average based

Mean Time Between Failure

  • Sum of MTTF and MTTR.
  • Time variant

Probability of Failure on Demand

  • Time invariant
  • Useful for products that do not run continuously

Availability

  • Time variant
  • Useful for products that run continuously

--

--