Lean QA (aka QA Ops)
Developers have — with the advent of DevOps — been working more and more in Operations and Infrastructure. Testers however, haven’t.
Thus far, the testing personnel have been mostly or wholly assigned to application testing work. As SOFTWARE testers, we have only worked on software — and then mostly only on application software.
I pose the questions: What about infrastructure as code? Should that not be explicitly tested?
And: if Testers are meant to be testing the system, why then have they not explicitly been testing the whole system, infrastructure included?
I am going to make a case here for including QA in Operations and Infrastructure, by clarifying how I see the QA fitting in the DevOps world.
Beware the cargo cult
Devops is meant to be a partnership between all disciplines to achieve common goals, which ultimately means working software in a production system, delivering value to customers. The term ‘DevOps‘ means development teams working with Operations teams, not just individuals in developer and operations roles working together. The QA role is considered to be part of the Development team.
As such, QAs are and have always been members of DevOps. Some have however, excluded QA from infrastructure and operations conversations, and relegated them to applications-only conversations.
While DevOps is also about tools and automation, it cannot succeed without the right culture of collaboration. That collaboration cannot exist when QAs are excluded from operational conversations.
When the behaviour of selectivity is permitted but is called ‘DevOps’, it’s in fact a cargo cult — an imitation and approximation of the real thing because the real reasons of WHY have not been fully grasped.
Project churn, conflict, friction, slow response, unreliable and unstable systems, needless expenses, lengthy handovers, defects, time delays, failed projects, manual overheads, and many similar issues, are all waste.
‘Agile’ and ‘DevOps’ were conceived fundamentally to reduce waste. (I generally prefer not to use these capitalised nouns as they’ve become overloaded, so use them here cautiously).
“Anything that does not create value for a customer is waste. A part that is sitting around waiting to be used is waste. Making something that is not immediately needed is waste. Motion is waste. Transportation is waste. Waiting is waste. Any extra processing steps are waste. And of course defects are waste.”
“The way to reduce the impact of defects is to find them as soon as they occur. Thus, the way to reduce waste due to defects is to test immediately, integrate often, and release to production as soon as possible.”
– Mary Poppendieck, Lean Software Development (2014)
Testers have up till now — even when they have been ‘system testing’ — not explicitly tested the infrastructure. With cloudification of infrastructure the ability to create, maintain and test on production-like systems and in production itself, is more regularly feasible. And in doing so, to test the individual components as well as the system itself, which includes infrastructure.
The ‘whole system’ includes the infrastructure.
In the image above, the roots of the flower might be the infrastructure — that which might be ‘invisible’ but holds up and makes the entire flower possible. On the roots’ own, they are valueless. The roots not only need to be part of the system to matter, but also are as important as the pretty bits above ground. (And on their own, the petals would never have come into being.)
What? Testing in production? With good judgement, yes indeed: QA in Production. Production is in our business, the ultimate system.
QAs are not there to ‘verify requirements’ or just ‘to check UIs’
Automation testing (or ‘checking’) can do that — which can be written by developers and/or testers alike.
Your testing personnel are there to add intelligence, curiosity and out of the box thinking and exploration to the balance, to build in quality and to explore a system before, during and after development.
Again, this should include the whole system including infrastructure.
What does a QA bring to Ops conversations?
The role of QA is there to question, to explore and to assist with critical thinking. What could a QA help question? What requires critical thinking? How could waste be kept to a minimum?
In infrastructure, questions could be along the lines of:
- Is there an interface/infrastructure/architecture diagram?
- What CPU size are we speccing for the EC2 instance, and why?
- Can we reuse another EC2 instance?
- What happens when that file gets there?
- How critical is that data and does it need backing up?
- What are these strange entries in the log?
- Is there some way we can uncouple these two services?
- What is that spike in load?
- How do we know it works?
- What does ‘done’ look like?
My proposition is to apply to infrastructure, what already works for us in applications testing, such as :
The test pyramid can be applied, using for instance the following tests which we could (in my opinion, should) TDD :
- Unit (highest quantity, lowest value): checking that the web server is up, “ps aux | grep nginx”
- Service/Integration (high value): Did that service react appropriately to input from another service (not necessarily a mock!)
- E2E (low quantity, highest value): Key metrics and monitoring
Test Driven Development (and Test Driven Infrastructure)
Test Driven Infrastructure (TDI) is a thing: https://spin.atomicobject.com/2014/10/28/test-driven-infrastructure-tdi/
For each layer of the Test Pyramid, Test Driven Development (TDD) or TDI could be applied.
I would opt for minimalism in tests, not writing too many but only writing the ones that really matter.
In the case of the highest value tests, such as monitoring and metrics, display these results on a low-noise (only key, targeted information) dashboard for the whole team to see.
Include in the test-early approach, the quality requirements (aka non-functional requirements) which can and should be tested early and continuously, including security, performance, load, etc. — all of which are directly influenced by the marriage between infrastructure and application code.
What an application does is determined mostly by functionality. How it does it is determined by its quality (usability, scalability, resilience, security, etc. — all the ‘-ilities’).
It’s the ‘how’ that I am suggesting we test early too, not just the ‘what’.
Metrics and monitoring
Define key metrics and monitoring early and use these as the basis for your highest value (low quantity) tests. Define success and value. Measure these and monitor the system, from as early as possible.
Display the results of these metrics on a low-noise dashboard that the team looks at and cares about.
define early, measure and make key metrics a (low-noise) focal point for the team
That means building in the tools to measure and monitor, also as early as possible. If you are applying TDD, these metrics will act as the skeleton upon which the meat will take shape. They will fail at first, and then as you meet the criteria, will then pass; and become continuous feedback from then on.
For every actionable bug we find, we not only repair the bug but we also help prevent its recurrence. (By actionable, I mean something we care about and want to fix.)
The reason why the bug occurred is determined and then addressed through a bug fix and a new test. If the test can be automated, it should be.
In reducing churn and manual overhead, automate as much as possible — from tests to deployments to monitoring. Anything repeatable should be automated, if at all feasible. Use your humans for what they’re good at: intelligence and unpredictable decisions and tasks.
What about existing systems? How do you TDD/TDI and Test Pyramid these — which have for instance no or badly implemented tests and monitoring up to this point?
Likely it cannot and shouldn’t be completely retrofitted with new tests.
But with every tweak made to improve or alter it, apply test-early and the test pyramid, thus slowly and incrementally improving quality.
Lean is about reducing waste to the end of value, and thus quality.
Not doing QA on the entire system as early as possible and as frequently as possible, leads to waste when it needs to be done later and bugs are found later.
The reduction of waste (rework, defects, friction) by holistically testing the whole system, early and continuously (even into production), including measuring what really matters, and incrementally improving thereupon, is lean QA.