It’s all about the topology: what FaaS can learn from stream processing
The Functions-as-a-Service paradigm is steadily progressing but it won’t catch its stride until it learns from established paradigms
Let me say up front: I’m a big fan of the so-called Functions-as-a-Service (FaaS) paradigm in computing. I think that it bears a great deal of promise for developers and I’m eagerly tracking its development in OSS projects like OpenFaaS and cloud productions like Google Cloud Functions and AWS Lambda. Years after its inception, however, it still feels like we’re constantly on the cusp of reaping the FaaS harvest and never quite there.
I’m going to venture a guess as to why that is: the FaaS paradigm currently has a fundamental abstraction problem that’s limiting its development and preventing it from flourishing. FaaS remains caught in thinking about functions as atomic units. Sanitize your inputs with this function! Make this function bridge your data layer and your API gateway!
FaaS currently has a fundamental abstraction problem that’s limits its development and preventing it from flourishing
Single-use functions are fine, but I think that FaaS will really take off when it enables developers to easily create function topologies encapsulating complex, end-to-end chains of processing logic. In other words, FaaS needs to provide abstractions above individual functions to truly live up to its promise.
Topology-based stream processing systems
Fortunately, I think that the way forward for FaaS is already here in the form of stream processing systems like Apache Storm, Apache Heron, and Apache Beam. Those systems convert processing logic of any desired degree of complexity into physical processing topologies (or pipelines in Beam) that can be run and managed as single units.
The most well known of these systems is probably Storm, which was initially developed at Backtype (later acquired by Twitter) and remains widely used. Heron is the successor to Storm, providing reverse compatibility with Storm topologies but also a broad array of improvements, such as an updated topology creation API. Apache Beam is an analogous system that currently powers Google Cloud Platform’s Dataflow product.
Deep dive: Apache Heron
I’ll provide a bit of a deep dive into Heron because it’s the system I’m most familiar with (I also think it’s the best existing system but the reasons needn’t detain us here). I created the website, wrote a bunch of the docs, and worked on some of the APIs (such as the Streamlet API).
Heron converts your stream processing code (Java or Python), called a topology, into a logical plan of processing steps. That logical plan is in turn converted into a physical plan describing which parts of the topology will run where. Heron then uses one of a variety of schedulers (Mesos, Kubernetes, Nomad, YARN, and others) to run and manage the topology (and in a fundamentally multi-tenant fashion to boot).
Heron, like Storm and Beam, is a system for translating declarative processing logic in your code, written using special libraries, into easily deployable and manageable processing systems of any degree of complexity. If your processing logic can be modeled as a directed acyclic graph (DAG) then these systems can run it at just about any scale you can image (unless you think that the Twitter timeline is small potatoes, in which case you’re probably big enough to roll your own system).
What FaaS needs to learn
One of the core benefits of Heron and kindred systems is that the entire processing topology is the basic unit of abstraction. The entire end-to-end pipeline is something that you can declare in code; the plumbing that wires those bits of logic together is abstracted away for you.
And that’s what FaaS is supposed to do!! Right? It’s supposed to free us from the plumbing work. But every time I sit down to write FaaS functions, regardless of the platform, I spend most of my time thinking about the routing logic between functions, logging and debuggability, and, most aggravatingly, type serialization and deserialization.
FaaS is supposed to free us from plumbing work; instead, it’s created new and different, but equally frustrating, plumbing work
But let’s be real: most of this should really be handled at compile time. If there’s a type mismatch in a Streamlet API topology for Heron, for example, my topology simply won’t compile. If a filtering step in my topology returns a string but then next processing step is expecting an integer, I’ll know before I even try to debug the thing in motion.
There’s still room for thinking about functions atomically, of course. Complex topologies aren’t always the answer, there’s nothing intrinsically wrong with writing one-off functions that fill in important gaps, and systems like Heron are major overkill for less demanding use cases. But whenever you try to move beyond single functions, FaaS ends up producing a lot of unnecessary frustration. It shouldn’t be that way; FaaS is supposed to be a leap forward, not 1:1 trade of one set of frustrations for another.
We’re missing libraries, not systems
The good news: I don’t think that the currently existing FaaS systems need to be rewritten or recreated or anything like that. I think that they’re probably already amenable to the DAG/topology approach. What’s missing is libraries that enable us to write out complex processing graphs and then automagically translate those graphs into a topology of functions. At that point, the existing magic of FaaS can take over: function auto-scaling, “serverless” management, and so on.
So if you’re working in the FaaS space, come talk to me about stream processing. The FaaS paradigm is at an impasse and the way forward is already here and has been for years.