Automated Security Testing for Developers

Most known security issues arise from undetected inaccurate behaviour that doesn’t visibly break the main functionality of the software. As a result, buggy software gets shipped and used. Such bugs are of the most cunning kind — they don’t surface easily because everything seems to be working as intended. Separate testing that ensures that new commits and builds don’t introduce security problems is an essential practice to be applied to software with high demands to security. And in the modern landscape, everything is security-sensitive, or at least it should be.

The merits of continuous security testing

Security tests are just like any other kinds of tests — except that most of them don’t verify predefined behaviour. Instead, they check for the absence of certain behaviours and weaknesses known to lead to security risks.

As a part of the continuous integration and running tests repeatedly, security tests allow making code free from:

  • memory bugs,
  • input bugs,
  • performance-hindering issues,
  • insecure behaviour,
  • undefined behaviour.

However, before the rise of continuous integration, writing and implementing such tests wasn’t everyone’s idea of a great resource allocation even within the security engineering community (although performing security checks manually is an example of an even worse time expenditure). These days, when security-related scares and threats arise repeatedly, writing such tests should become everyone’s favourite hobby because every piece of software contributes to the overall security or the lack thereof.

Security testing 101

A good place to start with security testing is to find out what security controls are actually present in your product (application, website, etc.). The next step is gaining an understanding of how these controls are affected by execution flow and input data, whether it can be altered, or is it possible for the expected behaviour to fail unexpectedly.

The process of security testing can be divided into 4 categories:

  • Functional security tests that verify that security controls of your software work as expected.
  • Non-functional tests against known weaknesses and faulty component configurations (i.e. usage of crypto that is known to be weak, source code analysis for memory leaks and undefined behaviour, etc.).
  • Holistic security scan — when an app or infrastructure are tested as a whole.
  • Manual testing and code review — sophisticated work that still cannot be quite algorithmised and delegated to machines so it requires human attention.

Testing security controls

Testing security controls and security components boils down to making sure that security controls behave as expected under chosen circumstances. Examples include active/passive attack on API calls wrapped in HTTPS, passing SQL injection patterns into user input, manipulating parameters to mount path traversal attacks, etc.

Some of the helpful open-source tools for such testing are:

  • BDD-security suite used as a testing framework for functional security testing, infrastructure security testing, and application security testing.
  • Gauntlt, a number of Ruby hooks to security tools to integrate into your CI infrastructure.
  • OWASP Zap and OWASP Zapper (Jenkins plugin) allow automating attack proxy to test some of the possible attacks.
  • Mittn, F-Secure’s security testing tooling for CI.

Testing memory behaviour

Buffer overflows and remote code execution are among the most dangerous and damaging security issues your code might create. Detecting memory problems, unidentified behaviour and other glitches that lead to attacks on execution flow can be automated through source code analysis. Most of the warnings to look at are memory leaks, buffer sizes, and such. Here your starting points should be:

Fuzzing to find vulnerabilities

“Never trust your input” is one of the cardinal rules of computer programming. And fuzzing is an automated process in software testing that takes advantage of this rule and searches for exploitable bugs through feeding random, invalid, and unexpected inputs to the tested software.

The data provided by the fuzzer is just “ok” enough for the parser not to reject it, but is rubbish otherwise and the result of feeding it into the tested software can lead to quite unexpected outcomes. This helps surface the vulnerabilities that would be undetectable otherwise. Another advantage of fuzzing is that this kind of testing is almost fully devoid of false positives (which quite often take place with static analysers).

Security-wise, fuzzing is both about testing the security controls and testing memory behaviour, as bugs like famous Heartbleed (which combines both poor security controls and unexpected memory behaviour) could’ve been fuzzed easily.

For security testing, fuzzing is especially useful when it comes to feeding into the app such pseudo-valid inputs that cross the trust boundary. A trust boundary violation takes place when the tested app is made to trust the unvalidated data fed into it. This approach mimics an adversary trying to feed malicious content into the app in the hope of achieving privilege escalation or plain malfunction, crash, etc.

Trying fuzzing on a filter that only lets through rubber duckies of a certain kind.

All fuzzing approaches will find vulnerabilities, and the more fuzzing is done — the better. However, to get fuzzing done right, you need to have at least a general idea of what you’re trying to accomplish. Fuzzing tests need to be well thought out, well planned, and well written. Fuzzing is also affected by the execution environment (the ripple effect), configuration, and capabilities of the test suite. It will not be a magic bullet for all your automated security testing, but it will take you far — but only as far as you’re willing to invest time and effort into preparation. To get started with fuzzing, you may want to visit this curated list of fuzzing resources.

Testing larger entities and high-level behaviour

Apart from actually testing the code you write, it’s useful to test larger entities: whole services and infrastructural components. It is important when the product you’re developing consists of many high-level entities, micro-services, and holds many autonomous dependencies.

There are vulnerability scanners that also help with automating security testing through scanning your websites and/or network for a huge number of known risks. The result of such testing is usually a list of vulnerabilities detected in your infrastructure and recommendations on how they can be patched or otherwise secured. Sometimes the patching process can also be performed by the automatic vulnerability scanners.

This is specifically relevant for software that is being composed, rather than top-down written, which consists of many services, libraries, and chunks of code. Some of the popular free network vulnerability scanners include Open Vulnerability Assessment System (OpenVAS), Microsoft Baseline Security Analyzer, Nexpose Community Edition, Retina CS Community, SecureCheq.

It is worth noting that infrastructures should be checked when they are complete and functional (live or near-live) for the maximum impact and usefulness of the check-up.

Performance testing for security purposes

Performance testing is something that doesn’t really spring to mind in the context of “security testing”, but performance reliability is really the first step towards ensuring a safe and secure functioning of a system. One of the risks to consider is a denial of service caused by an overload or by (D)DoS-type attacks. Regardless of the possible cause, it is necessary to have an exact estimation of the future calculated load level and the point beyond which the denial of service happens, at the moment when the system is designed.

Despite the lack of a single yardstick for different systems, it is possible to get useful testing results. If they are recorded and published along with the characteristics of the testing platform, they will help the system architects. Running performance tests on the target equipment after the installation and configuration of the software will also yield more precise the threshold levels. This, in turn, will help to configure the load-limiting and alert systems accordingly.

The quantitative results of performance testing can be:

  • The number of operations performed per unit of time (both overall and segmented into groups: normal mode of operation, invalid data, “light”/”heavy” input data, etc.);
  • The number of errors that arise during an execution of an operation (both general and type-based);
  • The necessary amount of resources needed for each testing mode.

Three different approaches towards the methodology of running such tests exist. The suitable type is selected depending on the end goal:

  • Research of a separate software block (or a group of blocks). The goal of such testing is to find the “bottlenecks” and to do the performance estimation for a particular item. In this case, it is necessary to create such testing conditions where the tested block is working under the conditions with regular resource allocation, while all the other blocks are not limited in their resources.
  • Complex testing. In this case, the main goal is an evaluation of interaction between separate elements of the software tested as a whole. Testing environment may differ from the real target environment.
  • Testing with approximation of real conditions. The goal of such testing process is an evaluation of software performance under the anticipated operating conditions. In this case, the testing environment should emulate the real target working conditions as much as possible and eliminate the influence of additional side-components.

At the development stage, a sudden spike (both upward or downward) in the app’s performance can serve as an indicator of an error. A sudden decrease of the performance stats can be caused by errors that lead to performance degradation and overload failures. A sudden increase in performance can indicate changes in the logic of the working pieces of code — i.e. an erroneous exclusion of the incoming data validation step.

When things work “better” than expected, start worrying.

There are numerous testing patterns, but the following three are the most relevant to the subject:

  • Stress testing. A step-by-step load increase which helps to identify the performance limits. Also allows to evaluate the estimated nominal load level.
  • Endurance testing. A long-term software performance evaluation. Allows to find memory leaks and cumulative errors.
  • Spike testing. Testing with sudden spikes in the load. Helps surface the problems that can arise during the breakdowns in the normal functioning of balancing systems, routing, and during (D)DoS attacks.

It is recommended to expand the classic performance testing methods by combining them with other kinds of testing. This will provide a more complete picture in contrast to only running performance tests in the regular modes of work. For instance, carrying out a stress testing with inputting invalid data will allow estimating the work of the validation mechanism. Endurance testing carried out using the decidedly “heavy” data will allow to make a valid estimation of the resource consumption.

Performance testing for security purposes is something that needs paying attention to because the popular attack methods are often based on the attempts to cause (D)DoS through invoking non-standard operation modes. The attackers rightly assume that in this case, most products will experience a striking performance drop and the deep branching of the program logic may be untested for such cases and, as a result, vulnerable.

Incident recovery testing

Most likely, you do backups for the data inside your system. Data recovery can be a stressful scenario in itself, it doesn’t need the additional pressure of worrying whether backups are valid or not. The solution, of course, is to test that backups have worked by restoring data. Testing the backups is a hard and time-consuming task that doesn’t yield obvious immediate returns.

Backup testing should include testing of physical recovery, virtual recovery (using virtual environments), data recovery, and full application recovery. In a perfect world, every backup should be tested after it’s created, but a more practical approach would be to include backup testing into the regular backup cycle or perform it after significant changes in the application or application data.

Is it possible to forego the backup testing and only concentrate on testing the main system, hoping the backups will just mirror it? Well, they say nothing’s impossible in this world, but assuming that something works is not the same as testing it and knowing for sure. A story of struggle and loss (of several hours’ worth of backups) would be the time when some untested backups failed GitLab.

The challenges of continuous security testing

Having worked on products which underwent extensive security testing for many years, we see how frequently people get uncomfortable with running an additional test infrastructure to ensure security. We’re still strong believers that even though ensuring code security always brings usability trade-offs, it’s worth it.

The downside? Security tests are not your regular unit tests or functional tests, they take long to run, sometimes they take considerable time to accumulate the data. Depending on the type and criticality of development, you might be tempted to make the process of running them in parallel and non-blocking to the main test pipeline. But this is laziness at its worst because when you’re (finally) getting serious about security, you must only run them as blocking tests.

Slower testing process sometimes leads to detection of dependency failures and vulnerabilities not directly related to your code. For example, last year we had to deal with a non-transparent change of compiler’s behaviour towards external dependencies (C imports in Go that changed significantly from Go 1.3 to Go 1.6), which could’ve manifested in a serious security issue if testing and benchmarking didn’t include volume tests on input. Hadn’t we had tested this beforehand, this would be a ticking time-bomb, even though Go is known to be an extremely safe language when it comes to memory, and such issues should never have emerged.

Wider scope, larger problems

Running tests on larger entities has its own issues and pitfalls. While testing for vulnerabilities in your product or website, automated vulnerability tests are basically trying to wreak as much havoc and do as much damage as possible in the process. Better them than an actual enemy, but the process, if the testing is being carried out on a live infrastructure, can result in breaking down your application (i.e. when malicious injections or simulated (D)DoS attacks work too well). Your email can be flooded, logs — overflowing, sensitive links crawled and exposed for the whole world to see, and server — down due to an overly — let’s say — efficient work of automated vulnerability scanners on whatever they’ve been set out to battle-test.

A diligent approach to automated security testing can sometimes make you feel slightly overwhelmed.

Most scanners provide settings that allow you to chose and restrict the processes that you do not wish to test (or to be carried out on the tested material). Still, it is better to see everything messed up and broken once, intentionally, and to fix it knowing that the worst had already happened, without terrible consequences and with your total control and blessing. The sorry alternative is seeing some minuscule and (previously) seemingly irrelevant component that was left out of the check become the entrance into your system for a truly malicious outside attacker.

Sometimes security tests require human interpretation. Such tests are better run on builds, rather than on every commit, but they are challenging for modern CI/CD approaches.

From the trenches (How we test)

At Cossack Labs, working to build security products ourselves, we carry out automated security testing wherever we can, using both ready-made third party testing suites and our own heavily customised solutions.

For example, we use automated testing of Themis — our multi-platform cryptographic services library, which is a foundation of most of our products. During the build process, we’re checking every commit with CircleCI, which runs a set of tests on the whole code. In CI, every commit gets checked by:

  • Valgrind as a dynamic analysis tool that detects memory leaks and memory management problems because we understand that it’s poor memory management that is often the source of the most catastrophic bugs.
  • Splint as a static analysis tool that detects potential poor coding practices.
  • A number of cryptography-specific tests to ensure that no errors in cryptographic dependencies and random number generation might creep into this exact build.

Valgrind and Splint are used for testing the very core of Themis library written in C, but since there is also a number of wrappers written in other popular languages available, those wrappers are tested using the standard language-specific means. For example, wrappers in Python and Ruby are tested with unittest and test-unit and respectively. Testing mobile wrappers for iOS and Android has its own difficulties: Android emulator needs 5–10 minutes just to start up and iOS testing requires using the macOS platform.

You can study the Themis github repository for crypto-specific tests and the approach towards running all of that in Circle CI.

Test environment means a lot

An important aspect of automated testing is the engine which performs the tests. While using Circle CI is great for public repositories with stable products, it’s not the most efficient tool for projects in active development, where test suites evolve with the product and test scenarios change overnight.

For that, internally we use the BuildBot continuous integration framework, which provides very high flexibility of scenarios: blocking, non-blocking, containers, types of deployments, types of artefacts gathered — everything is extremely configurable. While regular tests can be written to fit the system, sometimes involving third party instrumentation to test security properties requires complex integration scenarios and having flexible CI framework helps leave no stone unturned.

Security testing is still not a fully automated endeavour

Similarly to code reviews in traditional software development, having a human eye on code changes is a must: some behaviours just cannot be detected automatically. Having a third party review of a major release is a crucial practice security-wise.

Summing things up

The nature of work most of us have been doing for the larger part of our careers is such that the security issues are the first that come to mind when developing and testing software. Which is kind of backwards as compared to the non-security related developer community that is focused on shipping fast (always), consistent (sometimes), and reliable (rarely). Reading the news about yet another breach, the maxim “everything will be broken” rings as true as before, with no chance of changes for the better in the foreseeable future due to the insulting carelessness ubiquitously practised security-wise. And apart from usability trade-offs and plain sloppiness, it’s always a question of knowing how and what should be tested.

Well, now you’ve got a few reference points.

If you would like to add something about the processes in automated security testing or have a story to share — we’d love to hear from you! Please reach out to us via info@cossacklabs.com or @cossacklabs.