Software Analysis (Basic Introduction)

ThadT
12 min readFeb 9, 2020

--

Software analysis consists of methods with specialized tools for software diagnosis and testing, for improving software quality and correctness, encompassing reliability, security and performance. Software testing and debugging, comprise over 50% of the cost of software development.

Programming bugs can lead to security vulnerabilities and even serious accidents. This means program analysis is very important.

Types of Program Analysis

Program analysis is the process of automatically discovering useful facts (e.g. programming errors) about programs. There are 3 types:

  1. Dynamic — Run-time analyses. These analyses discover information by running the program and observing its behavior.
  2. Static — Compile-time analyses. These analyses discover information by inspecting the source code or binary code of the program.
  3. Hybrid — Combine aspects of both dynamic and static analyses, by combining runtime and compile-time information.

Dynamic program analysis infers facts about a program by monitoring its runs. Some tools include:

  • Purify — For checking memory accesses, such as array bounds, in C and C++ programs.
  • Valgrind — For detecting memory leaks (program fails to release memory that it no longer needs) in x86 binary programs.
  • Eraser — For detecting data races in concurrent programs where two threads in a concurrent program attempt to simultaneously access the same memory location, and at least one of those accesses is a write (the order in which the accesses occur can produce different results from run to run).
  • Daikon — For finding likely invariants (a program fact that is true in every run of the program).

Static program analysis infers facts about a program by inspecting its code. Some tools include:

  • Tools such as Lint, FindBugs, and Coverity — Inspect the source code of C++ or Java programs for suspicious error patterns.
  • SLAM — Checks whether C programs respect API usage rules.
  • Facebook Infer — For detecting memory leaks in Android applications.
  • ESC/Java — For specifying and verifying invariants in Java programs.

Different tools can be used to, for example, identify invariants with different effectiveness. For example, dynamic analysis discovers information by running the program a finite number of times, and so cannot discover information that requires observing an unbounded number of paths. Therefore, it can only detect likely invariants. However, it can conclusively rule out entire classes of invariants even by observing a single run.

To conclusively determine an invariant, we need static analysis. By inspecting the source code, you can find the logic that proves an invariant conclusively.

Terminology

Static analysis typically operates on a suitable intermediate representation of the program (e.g. control-flow graph which summarizes the flow of control in all possible runs of the program).

Control-Flow Graph
  • Sacrifices completeness — Since program is not run, static analysis tracks the abstract state (i.e. arbitrary constant values of the three variables at each program point) instead of the concrete state which tracks the actual values in a particular run. This may result in failure to accurately represent the value of a variable in an abstract state (?). and missing variables that have a constant value.
  • Sound — Whenever the analysis concludes that a variable has a constant value, this conclusion is indeed correct in all runs of the program.
  • Example: Iterative approximation might need to visit the same node multiple times because of loops, which can require the analysis to update facts that were previously inferred at a certain node
Iterative Approximation
Cost and Effectiveness

Program analysis cannot (realistically) guarantee both soundness and completeness: no false positives nor false negatives. Designing a program analysis is thus an art that involves a trade-off between termination, soundness, and completeness.

There are 3 primary consumers of program analysis:

  1. Compilers — Use program analyses to generate efficient code on a target architecture for programs written in a high-level source language
  2. Software quality tools — Use program analyses for various purposes such as finding programming errors, proving program invariants, generating test cases, and localizing the causes of errors (similar to examples on static analysis in content above)
  3. Integrated development environments — Use program analyses to help programmers understand programs and refactor programs (restructuring programs without changing its external behavior)
How a Compiler Uses Program Analysis to Generate Efficient Code

Testing

Software testing is the process of checking the correctness of a given piece of software. There are some characteristics of testing:

  • Program specifications have to be explicit for developers and testers to communicate properly.
  • Development and testing of a program are often done independently to ensure that errors are caught.
  • Finite resources available to development teams so not every bug can be found and caught.
  • Program specifications are not static; they evolve over time, and programming teams need to adapt.

The goal of testing is to check whether the implementation of the program is consistent with a specification of the program. If both specification and implementation are both wrong, testing would not be able to pick out issues. As such, the tester and developer have to be independent to minimize the risk of this happening.

Also, limited resources mean writing tests and maintaining/updating them must be prioritized along with other requirements. This problem lends to the appeal of automated testing.

Testing Approaches

Approaches to testing software can be classified according to two orthogonal axes: manual vs. automated and black-box vs. white-box.

Approaches to Software Testing

The vertical axis describes the amount of human participation in the testing process while the horizontal axis describes the amount of access the testing apparatus has to the tested program’s source code.

  • Black-box testing refers to testing where the tester can see nothing about the tested program’s internal mechanisms (issue inputs, observe outputs)
  • White-box testing refers to testing in which the internal details of the program being tested are fully available to the tester
  • Gray-box testing refers to a combination of both where some internal details are available while others are hidden

Pros and Cons

Automated Testing:

  • Find bugs more quickly
  • No need to write tests
  • If software changes, no need to maintain tests (computer will generate new tests)

Manual Testing:

  • Efficient test suite (computer test suites generally bloated)
  • Potentially better code coverage

With these trade-offs, generally the best way is a combination of both approaches.

Black-Box Testing:

  • Can work with code that cannot be modified
  • Does not need to analyze or study code
  • Code can be in any format (managed, binary, obfuscated)

White-Box Testing:

  • Efficient test suite
  • Potentially better coverage

Automated Testing

We do not fully automate our testing largely because automated is difficult, especially for large programs with many nodes. Therefore, it is almost impossible for entire systems to be tested comprehensively.

More importantly, for automation to happen, specifications are required. One mechanism is pre- and post- conditions. A pre-condition is a predicate that is assumed to hold before the execution of some function, and a post-condition is a predicate that is expected to hold after the execution of a function whenever the pre-condition holds.

These pre- and post- conditions are most useful if they are executable, and can be written into the program using the same language. They also need not be precise to prevent it from becoming intractable/too complex (especially if it has many nodes).

Pre- and Post- Conditions

The quality of our test suite is tricky to determine. Too few tests and we may miss bugs, especially regression ones. Too many tests and you get higher costs, bloat, redundant tests and difficult maintenance. To quantify the quality of our test suite:

  1. Code coverage metrics — Metric to quantify the extent to which a program’s code is tested by a given test suite (0–100%). Higher code coverage is generally indicative of a better test suite and a better tested program (100% is very rare).
  2. Mutation analysis — Founded on the “competent programmer assumption” (i.e. program almost correct, except for small errors) and therefore, we should be testing variations/mutants of the program under test (e.g. changing signs from minus to plus). If our test suite is robust, we should expect each of the mutants to fail some test, while the original program passes all tests (original program must be able to be distinguished from mutants).

Aspects of programs used to measure code coverage:

  • Function coverage measures the number of functions called out of the total number of functions in the program.
  • Statement coverage measures the number of statements that are executed out of all statements in a program.
  • Branch coverage measures the fraction of branches of each control structure that were taken (e.g. only taking the “true” path of if-statements would result in 50% branch coverage).
  • And there are many others, such as line coverage, condition coverage, basic block coverage, and path coverage.

For mutation analysis, one problem that could arise is that a mutant is created which is equivalent to the original program (i.e. no test kills the equivalent mutant). This could cause confusion over whether it is due to lack of robustness in testing, or just an equivalent mutant. This is common and not easily solved, hence most cases require manual human intervention.

Random Testing (or “fuzzing”)

This is one specific paradigm for automated testing. This is based on the Infinite Monkey Theorem which states that “a monkey hitting keys at random on a typewriter keyboard will produce any given text, such as the complete works of Shakespeare, with probability approaching 1 as time increases.”

Basically, we feed a program a set of random inputs, and we observe whether the program behaves “correctly” (based on specification or whether it crashes) on each such input. Compared to mutation analysis, fuzzing randomly perturbs a specific (rather than arbitrary) aspect of the program.

Random testing is a paradigm (i.e. specific) and hence requires the test inputs to be generated from a reasonable distribution, which in turn is specific to the given program.

Some case studies for fuzzing:

  1. UNIX utilities: Univ. of Wisconsin’s Fuzz study → Started in 1990s on command line programs (25–33% programs crashed) and expanded to GUI-based programs in 1995. Although most bugs were left unresolved, one major one was resolved — which involved the gets() function in C which causes buffer overflow because it does not have a parameter to limit the length of input that is read (gets() was deprecated for fgets() which includes that length limiting parameter). Therefore, fuzzing can be effective at scouting memory corruption errors in C and C++ programs!
  2. Mobile apps: Google’s Monkey tool for Android → The Monkey tool generates a sequence of TOUCH events (this can be interpreted as gestures as well) separated by a set amount of delay, and at random pixels on the mobile device’s display (x- and y-coordinates within ranges for the mobile device). One can simulate even more sophisticated input events (e.g. incoming call or a change in location). One can ensure correct behavior by varying the sequence of events or the delay between them.
  3. Concurrent programs: Cuzz tool from Microsoft → Unlike a sequential program which consists of a single computation, a concurrent program consists of multiple threads of computation executing simultaneously, and potentially interacting. Bugs can come, not only from the input, but also the order of thread execution (dictated by OS scheduler) which is typically non-deterministic. The predominant approach today is to introduce random delays by the calls to a system function Sleep() which lowers thread priority. Using Cuzz, the calls to Sleep() are introduced automatically instead of manually by a human tester, and they are introduced systematically before each statement in the program. Cuzz even provides a good probabilistic guarantee on finding concurrency bugs
Grammar of Monkey Events

The depth of a concurrency bug is the number of ordering constraints that a thread schedule has to satisfy in order for the bug to be triggered. An ordering constraint is a requirement on the ordering between two statements in different threads.

For the case above, if the statement in thread 2 runs before thread 1 an exception is thrown because the variable t is not yet defined. This specific bug has 1 constraint on the ordering of statements across threads and so the depth is 1.

Ordering constraints within a thread don’t count towards the bug depth, because a thread’s control flow implicitly defines constraints on the order in which statements are executed within a thread. Also, the greater the bug depth, the more constraints on program execution need to be satisfied in order to find the bug. Typically, concurrency bugs have a small depth (“small test case” hypothesis). Therefore, when we run Cuzz, we’ll restrict our search space by only looking for bugs of small depth (reducing need to run too many test cases).

Cuzz Algorithm

The Cuzz algorithm calls the Initialize() function once at the start of the program, and the Sleep() function before executing each instruction in each thread.

The Initialize() function randomly assigns each of d+1 through d+n as the priority value of one of the n threads.

Triggering a bug of depth d requires d-1 changes in thread priorities over the entire execution. So the Initialize() function picks d-1 random priority change points k1, . . . , kd−1 in the range [1, k]. Each such priority change point ki has an associated priority value of i. This is where the lower priority values 1 through d get used.

The underlying thread scheduler schedules the threads by honoring their assigned priorities in the array pri[]. When a thread reaches the i-th change point (that is, when it executes the ki-th step of the execution), its priority is changed, that is, lowered, to i. This is done in the call to the Sleep() method before each instruction in each thread.

The probabilistic guarantee that Cuzz provides on finding concurrency bugs can be derived with the formula above. In other words, we expect to find the bug once after n*k^(d-1) runs, which is a tractable number of runs for n and k in these ranges. This is the worst-case guarantee (typically Cuzz finds bugs with far fewer runs than this).

Example

In multi-thread scenarios, and since Cuzz randomly assigns initial thread priorities, the probability that Thread 1 has a lower priority than Thread 2 is one-half (and if there are n threads, the probability that thread 1 is lowest priority among all threads is 1/n). Thread 1 needs to be lowest priority because any other arrangements might introduce in-between statements that might break the required flow that generates a concurrency bug.

Next step is for thread 2 to become lower than thread 1 after statement X is executed. As Cuzz picks the statements where the thread priorities are changed uniformly over all statements, the probability of picking somewhere between Statement X and Statement Z to change the priorities is at least 1/k. Therefore, since these events are independent, the probability will be 1/n * 1/k = 1/nk.

For a bug of depth d, thread priorities are changed (d-1) times; that is, Cuzz needs to pick (d-1) statements in the program. The probability of picking the right set of (d-1) statements for changing priorities is at least 1/k(d-1), so the probability of triggering a bug of depth d ought to be 1/nk(d-1).

Cuzz normally finds bugs more commonly than this probability (guaranteed lower bound) suggests because:

  • This measures the hardest-to-find bug of a given depth
  • While the theoretical lower bound decreases as the number of threads increases, in practice we see that the probability of finding a bug increases as the number of threads increases (more ways to trigger bugs)

Focusing on bugs with very small depths is likely to be enough to cover most of a program’s concurrency errors. Also, systematic randomization improves concurrency testing. Cuzz is also superior to stress testing due to its effectiveness, simplicity, scalability to thread count and test duration, and a low adoption barrier (fully automated).

There are pros and cons to these random testing techniques:

Pros:

  • Easy to implement
  • Provably good coverage given enough tests
  • Can work with programs in any format
  • Appealing for finding security vulnerabilities

Cons:

  • Inefficient test suite
  • Might find bugs that are unimportant
  • Poor coverage (The lexer will see all of these inputs and will (hopefully!) reject almost all of them as invalid Java programs. So perhaps only one thousandth of these inputs will pass the lexer and reach the parser, and even less will reach the backend)
Inefficiency in Later Stages of Compiler

As such, random testing should be used to complement rather than replace systematic and formal testing.

--

--