Game on!
After a successful first year of operating a bespoke warehouse management system (WMS), we had many thoughts on how to improve our processes and software.
However, a suggestion from an external technical architect, who reviewed our entire system, led to the team re-examining some core security principles.
Instead of a “dry” review or audit of our processes and software, it was suggested that we should run a ‘disaster day’ or role-play exercise. This is becoming common in the cloud infrastructure space and is known as a game day.
Historical aside: whence the game day?
The idea of game days originates in disaster planning and recovery, something many IT operations staff will be familiar with. However the game day concept has crystallised around cloud infrastructure in particular, due to the nature of the reliability of individual components or hosts when running such large-scale distributed systems.
The main drivers of game days are the usual suspects in the cloud space: Amazon, Netflix and Google. The acknowledged founder of the concepts is Jesse Robbins, who led the initiatives at Amazon and has had a large impact in the devops space since. Of course, no discussion of cloud and disaster planning would be complete without calling out the monkey in the room
Chaos monkey is used in many companies running distributed systems in the cloud to help engineers understand the inherent unpredictability of operating distributed systems and to prevent software being developed under the fallacies of distributed computing.
Back to our scenario
Within the team, it was quickly decided that we should focus on a single aspect of system reliability for our first game day exercise. That aspect was the security of our system.
The whole enchilada
In comparison to a web system or a typical cloud hosted SaaS system, our warehouse management system has three distinct architectural layers within which the processes operate:
- Hardware — physical (automated) conveyors, trolleys, totes, forklift trucks
- Software — the high level WMS encompassing the business rules of a warehouse, the warehouse control systems (WCS) controlling the automated conveyors, driven by commands from the WMS, and finally
- “Wetware” — operators physically moving stock around the warehouse and informing the WMS of their actions
Any security-based game day scenario could look to expose flaws in any of these layers — obviously as a software guy, I assume my code is perfect and would look to the hardware/wetware for the security flaws ;-) — I was over-ruled by the team.
Initial planning
Unlike our complementary THG Ingenuity tech stack, we do not store customer information or any billing information — so a standard ‘hack’ of the warehouse systems would gain a ne’er-do-well little information of resale value.
Our focus shifted from cracking the system to system disruption, and from a financial perspective, actual theft of the goods from the warehouse.
Without revealing too much about our internal operations, this initial planning phase led us to expose some flaws in our processes that we hadn’t previously investigated — already the game day was paying off and we hadn’t even run the scenario yet!
Beta
After we had planned our scenario, we created some realia to aid with the execution of the scenario. Then armed with our ‘fakes’ (mock objects if you will), Josh infiltrated the warehouse and photographed them in-situ:
Armed with realia and photos of a ‘security’ situation, we ran through the pre-planned scenario with the senior engineering team and managers.
This beta, or trial run, of the scenario was extremely helpful in fine-tuning the process of running the scenario, and again as part of running through the scenario, additional potential flaws in our processes were noted and once again we added these to our risk register.
One of the results of the beta session was that the security breach was too complex and should be simplified to reduce the scope of where the break-in could take place.
The scenario plays out
We divided the entire team into two groups. Each group was given the same initial information and had to work out what was happening and how to react to the situation. The goals were the same for each group:
- What is the root cause?
- How do we limit the operational impact?
- What is missing from your tools/dashboards/access to allow you to debug and recover from any other situations like these in the future?
Despite both groups being a very different mix of experience (one group had the most experienced process team members whereas the other group had some of the more experienced software team members but more junior process team members), both groups discovered the root cause for the security breach in similar times.
Feedback from the participants was positive with many of them enjoying the lateral thinking parts of the exercise.
Results
Our security focused game day exercises yielded the following results:
- Greater awareness of potential security flaws in our code, processes and physical infrastructure across the whole team
- A number of risks were identified that we categorised and are working through eliminating or mitigating
- An audit of access tools that the team need to be able to trouble-shoot similar issues, identified some missing access for team members
- An audit of access rights highlighted some oversights
- A script that was executing as a particular user instead of as a system user with limited access
We will plan and hold follow-on exercises with different agendas in the future.