Do you Waterproof your Software?

Osher El-Netanany
Israeli Tech Radar
Published in
12 min readMar 15, 2022
waterproof?

In a world of rushing to each release and releases are continuous and often, what do you do to safeguard your software? What can you do, like, what are the options?

From a what exactly is CI/CD:

Software CI is a process whose goal is (…) a distribution that is as proven as possible (…).

While As proven as possible” is subject to interpretation, I’ll claim it boils down to using as many tests as the budget allows, budget being money and time.
If there’s a test that would add to your confidence that you can afford— it should be a part of your strategy.

Tests, Generally speaking

The goal of tests is that when at least one of them fails, the breaks are pulled to stop issues and bugs from compromising SLA and SLOs.

Automated tests means — tests that run by code.
This divides the code produced by developers roughly into two:

  • Everything that gets to be part of the distribution — a.k.a shippable, or production code.
    Ideally, this is the bare minimum that is necessary to provide the value to the users of the software.
  • All the rest is auxiliary, or non-production code.
    Utilities programs and configurations that help build the distribution, proof it, test it, ship it, debug it, develop against it, monitor it, support it, etc.

Who guards the guard?

who guards the guard?

The non-production part has a crucial direct impact on the resulted distribution even when it’s not strictly a part of it. Plus — it’s essentially still code and requires maintenance.

Thus, the non-production code should face the same standard as the production code. Despite that, many a team relaxes some of the rules concerning the non-shippable parts, resulting with a double-standard.

We can debate if the efficiency of a test-code is as important as the efficiency of the production code. However, a bug in the test code is too likely to mean a bug in production…

I myself am an advocate against the double-standard:

All code is code, and should comply with all the code standards applicable to it.

SUT = System Under Test

Regardless if we proof the shippable or the auxiliary — I’ll refer them here as SUT — which stands for System-Under-Test.

You’ll find in the literature many forms of (X)UT where X can be C-Class, Code, D-Deployment, E-Environment, F-Function, H-Hardware, M-Module, P-Process, Product, Platform, U-Unit, S-Service, Server, Software and more, — all of which are private cases of a system. So yea. SUT.

Test Code

The lion share of the auxiliary code are programs that are developed side by side with the shippable code usually by the same team, that when run — make sure the shippable part is functional, performant , and meets all its requirements.

Automated tests done right can express requirements and become an automated qualification criterion for the shippable code. This has led to trends of writing them first (TDD, BDD). Some tools even aim to makes them human-readable text — accessible to none-coders (e.g. cucumber & gherkin).

Ratchet

While all tests aspire to give a pass/fail answer, few types of tests result with a quantifiable metric that is compared to a bar to make the pass/fail resolution. When this bar is moved higher every time a new peak is met —we commit to a continuous improvement. When this bar moves automatically — we call it a Ratchet.

Up the bar on every win

The brainy part about Ratchets is not just how to measure and save the bar, but to design a mechanism flexible enough that lets humans override it or update its bar whenever it gets corrupt with an unreal number or impedes you from delivering as fast as your business requires.

Now, that we levelled some terms — lets explore the prominent test types applicable to integration flows.

Static Code Analysis

Static means to proof the code without compiling it or running it. These analyzers don’t care if the code compiles or runs (although many of them can detect broken code) but are more interested in the code itself to deduct their issues with it.

Architecturally, they are built of a processor and a modular set of reusable rules that teams can chose to include.
Complete selections of rules make a policy.
Custom policies are usually a part of the auxiliary code in a codebase.

A known issues with all static code analyzers is their rate of false-positives. As such they will produce noise which will requires a human to go over and decide upon, and that is their main continuous cost.

Lint

Aimed mainly in the form of the code.
Lint policies are called — Code-Styles.

Types of rules:
Most rules fall under one of these motives:

  • prohibit forms that are known to be prone to bugs
    E.g. — unused variables, used undeclared variables, unhandled errors, suppressed errors, over-nested inline-if expressions, complicated expressions, abused abstraction, and more.
  • prohibit forms that are known to reduce maintainability
    E.g. — short / cryptic names, code organization problems, over-complex code expressions, deviations from the team’s preferred style.
  • enforce a single form that is easy to get use-to
    This basically creates consistency by preferring a single way to do stuff from all the ways the language supports.
    E.g — declaration styles, spacing around reserved words and syntax denominators, bracket styles, empty-lines, block indentation, maximum line-length, maximum statements in a function, statement terminations (semicolon) and more.
    This is the part that linters are known to limit your liberty and hurt your feelings for the price of more consistency in the team.

An eternal Trade-off

Adding features to a language makes it more expressive — letting it express the same things in many ways — but — raises the bar for the maintainer and the entry level for new team members.

Linters on the other hand keep code simple-minded by limiting it to a narrow preferred selection from all the forms supported by their target languages. (preferred by …who?)

Lint tools for each file
While each programming language comes with its own styles and linting tools it’s yet common to see linters in one language linting another, especially for peripheral languages or markup languages that do not have runtime of their own like HTML, CSS, JSON, YAML, Markdown, and more.

Lint an old repo
Applying a lint to a repo can result in a huge number of rejects. Applying all the fixes in a boom could be risky. An alternative solution is to ratchet down the number of rejects — e.g. fail the build if the number of rejects increases, driving it down every build.

Dependencies scan

Prohibit use of dependencies that are deprecated or with known show-stopper issues (usually security exploits).

The older a version is — the more time hackers had to find its exploits, and the less amends it has to exploits that have been found before.

Dependency scanners usually offer a free tier of detection, where payed tools will not only detect issues — but offer mitigations when available. (e.g. snyk)

License Scans

Open-Source Software (OSS) come with a license that define its applicable legal use , however, not all licenses are born equal: few of which can make the owner of the OSS eligible to claims that could undo a business model.

License scans enforce a license policy so that none of used 3rd party code comes with a license that could hinder business growth or make the rights-owner eligible to destructive claims.

Security Scans

Focused on detecting security exploits, most of which result of preventable bad practices, e.g. — saving passwords in code, code-injection vulnerabilities, etc.

Such scans usually require complex analysis and accumulative knowledge that have been obtained with a greater effort than the former ones, so many of them require a license — either for their scanning capabilities, or for tools helping following all the issues they find. (e.g. whitesource, sonarqube, fortify)

Like lint — once applied to an old code-base, one should expect a long list of rejects. Here too — a down ratchet on the number of detected issues could be a solution to the assimilation period.

Build / Compile

This stage is the core of CI for every language where the files that developers author are not the files used by the runtime, and this is an opportunity to perform many validations on the code that are otherwise found in runtime.

The name of this stage is the reason a distribution is also known as a build. It is the father of all the static-code analysis, as every compiler starts with its own static scan using the set of the language rules.

But on top of this scan — it has a target task:

Compilers read all the source files into a model in which all parts of the programs are connected, validate that all parts connect well and there are no loose edges, and end with output it as the files that the runtime consumes.

Thus, on top of the report of rejections and warnings — upon success, it produces the build.

Functional Tests

Show that the SUT does what it is expected to do — i.e. that it’s functional.

The Elephant in the Room

This is the elephant in this room: This is the most known type of tests that sadly, most times when dev-teams say tests — they mean functional tests, as if there is no other type of test in this world.

Being that known, I will only say they divide roughly to unit-tests, that proof each unit in isolation and is good at the detail, and end-to-end tests that proofs it all works together on the expanse of the detail, and component-level tests that fumble around the semantics of what makes a unit or a whole trying to win the best of both worlds.

There’s a long discussion to be made about these semantics and how they serve in a coverage strategy— but this is too long to be put here, and deserves a post of it’s own. Anyway — that brings us to:

Minimum Coverage

These are tools that monitor the SUT while it’s being tested by the test-code, and are able to tell all the parts of the SUT that got to play a role in that test, and how many times they were activated.

They produce a coverage breakdown and a bottom line of a total percentage of the code that was tested — a.k.a. — the coverage. Ideally it’s done after consolidation of the coverage data from ALL of the tests in the codebase.

The test fails when the established coverage fails to meet a minimum level.
I.e — the coverage uses the SUT to proof the test-code — the build will fail when the test suite is not covering enough (it guards the guard).

Being a ratio between code active in test and total code — the coverage decreases when uncovered production code is added, or tests are removed or modified not to involve code units that were covered before.

Coverage rate is the most common ratchet, assuring coverage can only go up (e.g. coveralls, codecov, sonarqube).

Performance Tests

Ok, your iRobot/Rumba gives a clean end result. But will it do it in time? Performance tests ask similar questions regarding your software.

Benchmark

This type of a test is focused strictly on the performance metrics of the SUT, usually speed, and/or memory consumption. The test is ran, the metrics are gathered and the test fails if a baseline is not met. This stops code that impedes performance or abuses memory.

In CI the result is compared to a baseline to which you can apply a Ratchet, although I rarely see it done.
However, outside CI — can compare the SUT to its competition…

Load tests

Proof that a SUT that should run concurrent tasks — can.

keep it all up

The term Load tests are often used interchangeably with benchmarks — which is not such a big error: Load tests are a private case of benchmarks that deal specifically with scale, often enough — concurrency.

Wasteful code may abuse runtime resources to perform its task: memory, CPU, disk, network, etc. While functional tests usually run on rather limited concurrency to be able to point the root-cause of a failure — this is what prevents them from detecting failures in handling load.

Since load failures usually happen under harsh and complex conditions and many things happen in parallel, load tests will give a definitive but general answer rather than pin the detrimental points of failure.

Stress Test

Stress test is explicitly interested in the range of conditions that is known to bring the SUT to stress:
Systems that are designed to face stress will know to detect they are under stress and will be able to sacrifice some of their function in order to stay alive: i.e.— it’s better if a stressed system responds to a request in a form of “sorry, can’t help you now” or “I’ve written your request down, you may check on it later”, than die processing the request, and the stress tests are designed to show that it knows to do just that.

Data leak test

Every software communicates with messages it sends the outside world. This test gathers all the messages from the SUT and makes sure they do not leak any sensitive data.

This is true for views the SUT provides as part of its protocol as well as so-called side effects such as logs and metrics.

The test will fail if sensitive data is found in any of these messages.

Compatibility tests

These tests aim to make sure that the distribution is able to run on all the platforms it’s expected to run on and against any version of peer components.

Microservices know exactly what platform they run on and what versions of peers they expect — it’s usually known where they are shipped to. This is not true for software that is shipped to users: libraries, mobile applications, and to some extent — web-applications.

A compatibility test will try to run the distribution against every platform the distribution supports officially, and against every version of peers — e.g. a DB driver that should support different versions of the facilitated DB.

The crossing between all the varying platforms and versions is called compatibility-matrix, which is then tested using test-grids that keep available all the different platforms the software is required to perform on.

Test grids can conveniently be outsourced from 3rd party services (e.g. browserstack, saucelabs).

Conclusion

The final goal is to proof distributions, preventing risks to SLO, security threats and bugs from getting to production.

This list is far from being complete. There is no end to the amount of tests that can be performed, however, the budget IS by definition limited.

Once integrated, it takes resources to maintain them, and to run them as part of every build. Even if you’ve got capacity and money — it will still boil down to run in-build as much tests as an agreeable build time allows, where tests that cannot run as part of the build are run either by schedule, or as part of the delivery process.

Choosing which tests to implement and when to run them is the core design challenge of the CI flows that provide you with waterproof software.

--

--

Osher El-Netanany
Israeli Tech Radar

Coding since 99, LARPing since 94, loving since 76. I write fast but read slow, so I learnt to make things simple for me to read later. You’re invited too.