Serverless computing and data engineering
Serverless computing is set to be a hot topic in 2018. In case you don’t know what is: a simple definition would be a program, a function or functions, that runs in a standalone environment. The Serverless indicates that from the developer perspective there is no server, only a piece of code to run.
Serverless computing still requires servers, hence it’s a misnomer. The name “serverless computing” is used because the server management and capacity planning decisions are completely hidden from the developer or operator. Serverless code can be used in conjunction with code deployed in traditional styles, such as microservices. Alternatively, applications can be written to be purely serverless and use no provisioned services at all. [https://en.wikipedia.org/wiki/Serverless_computing]
Function as service (FaaS) is around for more than a decade, yet it was only in 2014 that a public cloud vendor offered a Serverless computing service: the Amazon AWS Lambda. Today you can choose the service from a variety of vendors, Microsoft Azure Functions, IBM Cloud functions, Google Cloud Functions, and others. The FaaS model relevance to our contemporary demands on data processing and distributed computing is promoting interesting evolutions as the diversity of programming languages in which you can deploy Serverless services.
Instead of building a general list of Advantages and Disadvantages that you can easily find on Wikipedia, perhaps is more fruitful to list some key points of my experience with the Serverless Computing realm — mainly using Amazon Lambdas.
Since I started using Serverless Computing within my data pipelines (and recently in serving websites), I can’t envision architectures that will not benefit from it. To mine and transform data, to perform distributed requests or to even execute simple tasks as transferring data I keep having very satisfactory results with the FaaS model.
One crucial concept design to keep in mind when developing ETL processes with Serverless Computing is to restrain the scope of the actions to a data-nuclear level. The functional service will give you a lot of headaches if you need to perform transformations or even parsing that require a “reduce” sequence or an overall data perspective. For that, something like Spark will be a much better fit. In Serverless world is better to think distribution as parallel computing efforts.
Imagine you have data rows or docs indexed daily, and your data flow from the primary data API to the first persistence point saves the data in monthly buckets.
Now, to make the things a little bit more complicated, let’s say that your Serverless service fetches the monthly buckets from this first persistence service, enrich, transform and send the data to the analytic DB. Unfortunately, the enrichment process calls a free third-party API that takes 5 seconds to respond to each doc. So, you have a process that will take at least 150 seconds in each run, and at Serverless model this will probably represent a good investment. Moreover, the distribute attribute of Serverless computing is not in it’s optimized use — you can have a 1000 processes running at the same time, but you are running just a few.
If you have essential features that relate to buckets as in the monthly case, reducing the process scope to a single row/doc represents more networking and data duplication. To be honest, on the circumstances I faced this kind of challenge, data duplication and more network calls were a better solution in performance and economy.
Serverless computing is a very sexy and easy solution for pipelines integration and simple daily tasks. Imagine you need to move data around, a significant amount of data, that would require you to considerer parallel processes. The function as a service model will let you upload the data transfer snippet and trigger a good number of processes in a matter of minutes. If you have a piece of code that requires complex dependencies, it is also possible to do it. However, the deploy will be more painful.
Programs with complex dependencies requirements are harder to develop and deploy given the service options of code transferring and environment development. In this case, to automate the dependencies setup and code upload, it is a must, so repetition will not grow into a nightmare.
Finally, Serverless computing is close to being a wildcard solution for pipeline architectures and for a significant number of daily tasks, though, like everything in life and especially in data engineering, there is no magical solution.