Do you have the Basics of Functional Tests?

Published in

Israeli Tech Radar

11 min readApr 29, 2022

It is the most common test type. As such, almost every coder learned it by example, alas — very few understand it in depth.

The internet is so full of code-snippets, that too few take the effort to understand above the minimum it takes to make them work in a project. This created a very low bar: So much so that even good and talented authors of recommended test tools provide them with examples that demonstrate how their tool is used, but mislead on how to write a good test.

This post goes through the core concepts of Functional Tests and brigs up methods to increase their maintainability.

If you’re interested in the ecosystem they’re a part of — you may read on other test types and why to perform each of them on the previous post about proofing software in CI.
To understand CI/CD — read the post about what is CI/CD.

First — a term: SUT = System Under Test

SUT — stands for System-Under-Test, which is a generalization for Unit-Under-Test, Class-Under-Test, Endpoint-Under-Test, Component-Under-Test, Service-Under-Test, Software-Under-Test etc. — all systems. So yeah. SUT.

What are Functional Tests?

Functional tests check if a working SUT provides the desired result expected of it.

Consider a mathematical approach: any SUT works in a context of a set of parameters, and yields a result — like a mathematical function — hence the name. That’s all.

Not related to functional code!!

The discussion about functional code and distinctions about the returned value and side-effects — is an important and tempting discussion, but this is not our current discussion!

On the contrary: the goal is a GENERIC definition that will suit ANY code whatsoever, functional or not.

What Functional Tests AREN’T (or, what test runners ARE)?

Purely speaking, Functional Tests are not interested in how much time the SUT took to get to the expected result. Or how much resources it occupied to do so, or if it did its job efficiently. Only that the result is correct.
That’s purely speaking.

In practice — most test tools come with inherent features outside the domain of functional testing — e.g. timeout mechanisms: but when you ask if it did it in time — you ask how long it took, and that’s a performance measure.

The takeaway is that the type of test is a logical thing and is not an implication of the tool you’re using.
On the contrary- Good test tools aspire to let you do everything, if only they could.

Everything? Yes. Functional, performance, coverage, load, bench — anything that runs.

That’s why test runners are in their core task managers: Your test-cases tree is your task list. You could run them all (like in CI) — but you can run selective tasks if you must (like in debugging a single test-case).

This has an implication: ideally, test cases must not depend on each other. Each test is a task which should be runnable on its own.

When a case depends on other cases being run before — it’s called case sliding. The upside — in very heavy systems, while you’re at it — you just might check one more pesky thing…
The downside — it doesn’t give a reliable status. When a case fails — all the cases that come after it probably won’t even run, or will run and give misleading failures.

func(params) = expected

No, that’s not an assignment. That’s a mathematical statement.
It evaluates to true or false, which correlates to PASS or FAIL.

The expected result

To be able to tell if a function worked as expected — we need to know what result we’re expecting.

A common misbelief about the expected result is that it is a single value. IMHO, this is a legacy of the early programming languages which inspired each other, that tried to define functions by the type of their returned value, and thus, require their functions to return a single value. Anything outside this returned value ended up being called a side-effect, which is a cheap cheat and a loop-hole in terms of testing, designed to allow us to ignore things the function does, even though it’s definitely being done by our code — and therefore must be accounted for and tested.

Another common workaround is that a single returned value may be a data-structure. But then, you still need to verify every part of that structure you’re interested in, thus — a data structure is a Set of results, just like the set of the returned value and any other values assigned to global, delivered by reference, or passed as a message by the SUT.

The business-level calls them objectives

Now that we have acknowledged that a result can consist of several values, we’ll go up to the business level: the expected result often includes a few objectives.

Consider the following example:
A successful registration means:

a new user is created — with the correct fields
an email has been sent — with the expected message
a log line has been produced — with the expected information
confirmation message has been displayed— with correct data

The key takeaway is that ANY effect that is caused by the SUT may be a part of the expected result.

BTW — effects that are applied and aren’t required are likely to be a waste!

The causality of parameters

Causality — describes a relation between cause and effect. Purely speaking, parameters propagate through the SUT and together they cause a result.

Parameters — in their wider sense: can be an input, an inner state, or an outer state.
E.g.:

add(a: Number, b: Number): Number — takes input
count(): Number — relies on inner state
fetchNext(): Entity— talks to an external system

Obviously, SUT can work with any combination of the three.

Some parameters are totally in our control, like a and b of the add. Some are not in our hands — like the current time, which is constantly moving, or an answer from an external system that is not in our control.

If we cannot set them ourselves — we should be able to sample them. When the sampling itself incurs a change of state — the test code must account for that too.

Identify and account for ALL the parameters affecting the expected result.

The AAA Method (Arrange, Act, Assert)

This is a theorem that comes to increase maintainability.

When something goes wrong in the test — how easily can you spot the culprit?

It directs code-organization around the following test-concerns:

Arrange

This organizes all the parts the SUT comes to contact with, finds their state or set it explicitly.
As such, it includes constructors and factory calls, preparing mocks, laying traps, and/or IO operations to remote systems.
We may not succeed to do just that — and the test-case will fail.

A failure here means the test code and/or any of the helpers it uses are broken: The test-code did not even get to interact with the SUT.

Act

This is the part where we interact with the SUT to operate it and gather all the results.
This will be a single API call: a method call, a fire of an event, a send of a message, or even a network request.
In some cases, there will be more data-gathering IO operations right after.
We may not succeed in performing the action at all — and the test will fail.

A failure here will mostly mean that the SUT is not resilient or is unstable. When the test-code does not communicate well with the SUT — we expect the SUT to communicate it well in the errors it throws.

Assert

This is the part where we go over every value in the gathered results, and make sure that it is what it is expected to be.
All values have been gathered — now we validate them.
It runs completely synchronously, entirely in memory.
This part is the Purpose of the test: This is where we validate that the SUT is doing what it’s expected to do — that everything that should happen did, and that no undesired result is left in the world the SUT lives in.
Ideally, each assertion will produce in the report a single title.
Obviously, there are shortcuts — like assertions that compare an entire data-structure.

A failure here means the SUT does not meet its spec.

The AAA is something that is found on the web in many sources.
Let’s add to that a small contribution:

AAAA — lighter, smarter, cleaner and now — more powerful!

Aftermath

Oh, I did not invent this practice — Arrange + Act together are known in textbooks as setup, and Aftermath is known as teardown or cleanup.
I just chose a name that starts with A to go with the other AAAs

This is where we make sure that the test suite itself did not leave any undesired results on the world it lives in.
It’s made of untie mocks and traps, destructors, memory deallocations, and/or IO operations to remote systems.

A failure here is a bug in the Test-Code.

This stage sometimes seems to be neglected, especially when the test-suite runs in complete isolation with databases that are spun-up and destroyed at the end of the entire suite.
This does not mean this concern is not there — it’s just moved from the test-cases code to the orchestration level. I mean to get into how to leverage that in a future post.

The key takeaway is the distinction between the stages. When a test fails — the stage that fails tells us what should be worked out.
We communicate it to future maintainers by organizing code in the 4 test concerns and externalize it all the way to the the test report.

The Trinity of Tests

This trinity does not deal with divinity, but makes a test suite self-explanatory to increase its maintainability.

The S.U.T.

Often it’s a part of production-code. But it can be a test-helper or the test itself.

The Test Code

This is the element that interacts with the SUT and measures its effects. This includes any test-helpers, test data-fixtures, and the implementation of the test-cases with their setup/teardown and their assertions.

Here’s a mind-bender: when you measure coverage, they change roles: your SUT is your tests suite — you want to know if it covers you well, and the production code gets to be in the role of the Test-Code…

The Spec

This part was suppressed from tests for too long. Systems like JUnit and the rest of the (X)Unit family originally let coders snuck the specs into names of test-methods and classes — which was very limiting, and the rest was either truncated and lost or in the better cases snuck into comments and was not apparent in reports. Although we have evolved since, the entire industry still bears the scars of this past.

Ok, nice. How is this useful?

Often tests are very clear once written, but become gotcha-full head-scratchers even for their author when she meets them in the future.

1 — It’s all about what the test communicates to its readers

When a coder comes across a failing test — she needs to take judgment. When the intent of the test is not clear, the coder has to speculate too much in order to judge between the test and the SUT:

In the midst of bug hunt: is the bug in the SUT or in the Test-Code?
Is it a matter of evolution? Did the SUT evolve and the test did not, and now the test expresses a requirement that is no longer relevant? Or does the SUT now break a spec?

2 — The thing is: Trinities do not divide in two

The spec helps to judge between the SUT and the Test-Code. Usually the spec agrees with one, indicating that the other should adhere. But I’ve seen cases where the SUT and the Test-Code agree, and the title is wrong (a poor copy-paste).

Putting it all together

The bad example — in the following gist:

The better example — in the following gist:

Oi, the verbosity

It is not much about the verbosity as it is about the intent to leave clues for all your future readers:

The reader of the report in the CI output
The reader of the test-report ran in the IDE
The reader of the coverage report
The reader of the code itself

Please note: the specs appear in all of them, while the code appears only in the last. Make sure your breadcrumbs get to ALL of them.

I confess that when the entire test-case is for a single assertion, even I am tempted to do the entire four A’s in a single it(..) . But I do make sure to leave enough info about what I mean the test to do as part of what context.

The takeaway is — whatever titles you write , make sure that the they explain your intent and are unambiguous…

Test Factories

This is worth a post of its own, but I will not come clean if I do not mention it.

Often a set of test-cases require the same setup and teardown. Given the recommended test verbosity demonstrated above — a naïve implementation will result in some copy-pastes with poor maintainability.

The Wise packs the repetitive code in test-helpers that assist in these repeating setup / teardown tasks. The Even Wiser — feature a helper API that manages a complete test-context with setup and teardown hooks, providing the test-code with a context ready for the Assert stage.

A test-helper becomes a test-factory when it features a single API that expects arguments that declare not only how to perform any required setup and teardown of the test context, but also what has to be asserted as part of what spec. Calling it generates a complete test-case that appears in the report with readable titles.

For a simplistic example — pretend the language did not provide you with indexOf and you had to implement it yourself.
Now consider the following gist:

Do not be afraid to Delete

Once you get proficient with producing tests, when spec change, it often becomes easier to rewrite new tests instead of refactoring existing ones.

Given that you kept your test-cases isolated (no case sliding), kept your AAA(A) organized, and placed reusable code in appropriate helpers, writing a case should not take much time — it is a matter of declaring values.

It is hard to delete a case when you like its aesthetics. But if it has lost relevance and/or does not add information or coverage — get rid of it!

Conclusion

The goal of a test-suite is to affirm that our software works as expected. However, beyond this desired affirmation, when it is denied — we want to know what went wrong and what must be fixed.
The first level of defense is pointing out what went wrong by organizing our test code into recognizable segments, the goals of which are defined by the AAA(A) theorem.
To crosshatch this, we bring the spec into the test and communicate the goals of each test-case in the suite, so that when they fail we know what the failure points at without the need to read the test. And when they are refactored — we know to preserve these goals and maintain coverage.