A Serverless Data Architecture — Is it possible?
Unless you are not up to snuff on your Application Development chops, you have no doubt heard about how all applications should be developed with a serverless architecture. Cloud Platforms are offering more and more services that enable application developers to move further away from the traditional web-server based application. For example, Amazon Web Services has a service called Lambda, which allows you to package together code that is executed when an event triggers it. This means that if you design an application that needs to retrieve a picture, you don’t need to host all the pictures on your web servers, but rather can have an event trigger a Lambda function to go get the picture from S3 (the AWS Storage service) and display it in your application.
Let’s back up for a second and ask the obvious question: What is Serverless? On the surface, serverless means exactly what it sounds like, which is that no servers are involved in the architecture. As you dig further into it, though, it’s not that simple. Really, it means that you have no managed servers that are always running, regardless of the workload. Servers will still be part of the architecture, but only when the compute power is needed, and they are shut down when they are not. This helps to bring down the overall technical cost of the platform.
However, this concept does not apply in as straight-forward of a way when it comes to a data architecture. Databases, at their very core, require a data server for storage and compute. Traditional relational databases, like Oracle, SQL Server, or MySQL, all require database servers that are always on, draining compute resources even if they are not being used. For most enterprises, that’s not an issue because they have workloads that are constantly crawling their databases to retrieve data. However, they will always run into concurrency issues, which is an issue to which a single DBA can dedicate their entire job. Scheduling workloads is a full time job, and never works the way they expect it to. Ad hoc queries, analysts who write inefficient queries that tie up the CPU, and workload failures are just a few things that can bring down an entire data server.
So, the question is, how can an organization leverage the power of a serverless architecture and apply it to their data platforms? And what is the level of development required in order to achieve a “serverless” data architecture?
Compute vs. Storage
As I said in a previous blog post (for those of you who haven’t read it yet, you can find it here), one of the key components of a modern data architecture (MDA) is that it is decoupled. Each piece of your architecture should not rely on any other piece to stay working as it should. Realistically, that means that your data feeds, data processing, and data access layers should all be able to stand on their own, or even swapped out for some other technology solution without significant up- or down-stream development. So what happens if you apply this concept not to each layer of an MDA, but rather to a specific layer?
Traditional databases, as I stated earlier, were built to be run on servers. These servers both to store the data on the server, as well as use the compute resources of the server to run jobs against the data set stored there. But what if you take this decoupling concept, and applied it to a database? You would end up with a need for storage and a need for compute. In a traditional, on-premise datacenter, that would not be efficient, or even effective. You would end up with separate servers for storage and compute, with a lot of wasted resources. However, with the rise of IaaS platforms like AWS and Azure, this is not only effective, it’s incredibly cost efficient.
With storage costs going down every year across all cloud providers (e.g. AWS recently lowered the cost of storing in S3 to $23/TB/Month), the allure of decoupling data storage has become stronger and stronger. Platforms like AWS are working hard to keep up, providing tools that allow developers to read the data in S3 for analytical purposes. Data Lakes are becoming more and more popular as a data warehousing and storage option. However, the compute process is not so easily left behind.
As more and more enterprises have started to move their data storage away from traditional databases, they have had to struggle with potential solutions. The main obstacle is that, from a code perspective, the most efficient language for data operations is SQL. SQL generally requires a database to run, so in order to run SQL scripts in a serverless environment, the scripts need to be wrapped in another programming language, like Python or Perl. This requires a level of knowledge that most data engineers don’t have.
There is no one right answer to the question that was posed in the title. Yes, it is possible to have a serverless data architecture, but it is certainly not easy. The good news is there are many different tools and solutions to cover this type of architecture. Platforms like Snowflake Computing have sprung up from this need, which covers the data warehousing part of the architecture. Nevertheless, there is no one stop shop product that truly provides a complete serverless data architecture.
Still, a true modern data architecture requires a lot of moving parts, with many different tools and components. As progress is made in this space, these tools will continue to mature, including new integration development. This progress will make a serverless data architecture more and more realistic.