Examining the limits of serverless
Recently the team I work for has been working with serverless technologies. Specifically our team has been implementing a new and simpler method of processing the many millions of records generated when our audiences interact with BBC online services. This provides insight into how people are utilizing BBC services with the aim of ensuring that everybody gets value from our output. While the pros of the new approaches are compelling, we have also been reminded to consider the limitations of these technologies and how they may impact the security of our services.
Continuous change
During our relatively short time on this project the team has come to the practical realization that the rate of change in cloud platform services is such that the traditional model of write something well and then maintain it for years to come no longer holds true. The expectation now is that we are going to need to constantly reinvent how we do things in order to be able to meet the expectations of our stakeholders while at the same time meeting increased volumes yet still reducing costs.
Typical industry solutions for big data processing include using MapReduce across a set of servers to batch process the data. This involves creating clusters of machines which can be non-trivial, not just in building the solution but also in terms of maintaining it and managing it through its lifetime. However with the advent of serverless technologies such as AWS Lambda and Google’s Cloud Functions new and more cost-effective approaches have become viable.
The positives of these new approaches are many. From simplified architectures to reduced complexity of the implementation. The big wins for us though are cost reduction and the offer of seamless scalability to meet the demands of big news days without having to reserve otherwise unused capacity. To give some context to this last statement, a big news event in the UK or rest of the world causes a significant uplift in the number of people using our services. For us this is the difference between having to deal with several billions of daily events rather than an average one billion.
To deal with such a range in a traditional context you would require to reserve extra capacity to deal seamlessly with these volumes. The introduction of serverless solutions has made a dramatic difference to the way in which we can cope with the highs and lows of world events and moves us along our journey to the goal of real time processing and analytics.
All that glisters
That said, all that glisters is not gold, and we need to be mindful of the unexpected effects of new ways of doing things. Threats have a tendency of lurking in dark crevices and can pose a threat to not just the performance but also the security of the service. Specifically we need to consider the new techniques by not just looking at the benefits but also by examining the limits of these new methods. Further we need to explore what happens if they are pushed beyond those limits.
Before pressing on with an example, I must give credit to Dinis Cruz who has played a key role in helping to shape my thinking on issues such as this. I encourage you to read his blog which has a wealth of security related information.
One such example of where we can run into unexpected issues comes from the temptation to keep throwing serverless functions at problems. Let’s face it we all like to play with new toys, but not thinking carefully about the effects on the system can be detrimental in unexpected ways.
Consider therefore a system comprising several independent data flows. Each flow is logically separate from the other and has a similar pattern of receiving data from an external data source which leads to triggering serverless functions to post-process the data before making it available for analysis.
In isolation each set of processes seems very simple and can be tuned quite nicely to optimize throughput. As more events occur additional functions are invoked by the cloud service to seamlessly scale. All good, well not quite.
Threat modelling
After detailed security threat modelling of our new architecture we realized that we had introduced a potential new vulnerability into the system.
Despite us taking care to logically separate out each pipeline so there were no clear dependencies it turns out that a malicious attack on any individual data flow may actually cause all of the services to fail. Not good.
What we overlooked was the fact that a service such as AWS Lambda has an account wide limit for the maximum number of parallel invocations. As such, an attack on one independent pipeline could in theory exhaust the supply of Lambda functions and therefore not just interrupt the targeted service, but starve all services within that account.
We have coined the term ‘cross-pipeline dependency’ to describe the threat. Technically this is not really a new class of problem; systems have always had inter-dependencies. With traditional platforms the limits of systems seem so much more apparent. With modern cloud service platforms though it is frighteningly easy to fall into the trap of assuming near limitless capabilities.
Threat mitigation
So how do you mitigate this risk? In this case there are several strategies.
Firstly isolating critical services in separate accounts with individual resource limits is one approach, albeit that introduces more management overhead.
Another option is to try to prevent the problem in the first place by restricting the number of events that can occur by throttling upstream. Possible, but if the upstream system is not in your control this can be awkward.
Alternatively, and with impeccable timing, a new feature of AWS Lambda has recently been released that allows you to throttle concurrent invocations on a per function basis and consequently gives you the ability to isolate your services within a single account.
In all cases it is essential to understand how the system performs in practice so you can set the right throttling levels.
Great monitoring
Clearly with any of the above mitigation strategies it is imperative to understand how the system is actually performing and even more importantly to know when the system is misbehaving. Having great monitoring of your service in place is therefore an essential.
As a data focused team we are big fans of measuring how our services operate. It has become a prerequisite to build an appropriate dashboard as part of each new piece of work.
Understanding what normal is for our services has become part of the daily routine. As the project has matured, we have learned more and more about the relative importance of the different metrics we measure.
We have found increasingly we are treating our dashboards not as something set in stone, but rather as living documents which we modify regularly as we are presented with new and more compelling information about what is really important. For us good metrics are the ones that tell us something useful about the system, anything else is just noise and can be removed.
Through this data driven approach we are able to tell at a glance if the system is operating within normal limits and be alerted when more or less data than expected is being processed.
With this infrastructure in place it is relatively straightforward to focus on what is important when it comes to serverless functions, namely the number of parallel invocations, the frequency of invocations and the average duration of each invocation. By continuously monitoring these attributes it becomes clear how the system behaves throughout the peaks and troughs of the news year.
Feeding this information back into the throttling mechanisms mentioned before then helps mitigate the security risk of cross-pipeline contamination while still allowing us to benefit from the simplicity and reliability of the new serverless approach.