Serverless Data Integration — Part II (2019)

A single version of the data truth - Part II

Gaja Krishna Vaidyanatha
19 min readJun 12, 2019

Recap

In Part I of this 2-part article series (Serverless Data Integration — Part I (2019), we discussed the importance of data integration and why every enterprise needs to proactively engage in it. We also discussed the core value proposition of serverless computing. In Part II, we will discuss details related to the architecture, required components and relevant guidelines for deploying serverless data integration platforms.

Architecture of a Cloud-Native Serverless Data Integration Hub

The 4 Layers

There is a common English saying — ‘ Well begun is half done’. Blame it on my craft’s occupational hazard, I say ‘ Well architected is half done ‘. The importance of a solid and scalable architecture enabling a Data Integration initiative, cannot be understated. Figure 4 provides a high-level view:

Figure 1 — High-level Architecture of a Cloud-Native Serverless Data Integration Hub

We start our data journey with the various data sources in the orange bucket of Figure 1. Data flows through the 4 layers in the Hub, with each layer supporting an independent function of the Data Integration Architecture. Let’s take a closer look at the layers, they are:

  1. Ingestion
  2. Persistence
  3. Transformation
  4. Analytics/ML/AI

Ingestion is the first layer (Layer #1) where data from multiple sources enters into the Data Integration Hub. Most enterprises will probably start their data integration journey by ingesting nightly batch extracts first. For batch extracts, the landing zone for the data can be a simple storage bucket service. A microservice processes the data, on the event of data successfully landing in the storage bucket. In due course of time, these batch extracts need to be converted to real-time data-streams, to garner immediate business value instead of ‘ end-of-day ‘ value. In the long run, real-time ingestion is the preferred and recommended mode. Thus, Layer #1 in Figure 1, illustrates the use of a serverless real-time data streams processing service.

Persistence, the second layer (Layer #2) of the architecture, plays the crucial role of storing data in the Data Integration Hub. This is the layer where a serverless NoSQL database service comes into focus. Persistence plays the centerpiece of Data Integration, as it stores, manages and maintains data. When data is ingested in Layer #1, it is transported to Layer #2 and stored in a near-raw-format within RawDB (a logical container with a set of tables/collections for raw data). The idea is to bring in data, perform some basic datatype validations and store it. Additional and more complex data validation routines will be performed in Transformation.

Transformation is the third layer (Layer #3) which handles the complexity of converting raw data into clean data. The transformation layer’s complexity and heavy-lifting, relates to ensuring that data is conformant to data quality rules, business rules and compliance rules/policies. The arrival of data in RawDB of Layer #2, is the next event, that triggers the main transformation microservice and its related sub-microservices to fire. These event-based microservices can be deployed utilizing the relevant serverless Function as a Service (FaaS) of a Cloud Service Provider (CSP) — AWS Lambda/Azure Functions/GCP Cloud Functions. Data that fully conforms to all predefined transformation rules is persisted in CleanDB (another logical container with a set of relevant tables/collections for clean data) and erroneous/non-conformant data is stored in ErrorDB (the 3rd logical container).

Analytics/Machine Learning (ML)/Artificial Intelligence (AI) is the fourth layer where Business Insights (BI) are generated. This layer is triggered by the arrival of data in CleanDB within Layer #2 ( the next event). This layer supports the calculation and storing of Key Performance Indicators (KPIs) for the business and the management of Analytics/ML/AI jobs. The results of all jobs (BI) are stored in InsightsDB (the 4th logical container). Serverless services can also be utilized for the implementation of this layer.

When data passes through the 4 layers, it completes its data integration cycle. The cycle is repeated, as and when new/modified data flows into the Data Integration Hub from back-end systems.

Notes:
a. Analytics is defined as a journey through data that generates questions (BI). When you embark on an analytics journey, you really don’t know what you are looking for. The BI generated by an analytics job, creates additional data mining opportunities. Analytics should not be confused with KPI calculations.

b. In many regulated industries, it is a requirement to maintain detailed accounting of all data transformation functions. This ensures the upholding of the integrity of the data and prevents covert data modification that profits vested interests. This drives the rationale behind temporarily maintaining 2 copies of the Core Data (RawDB and CleanDB).

c. The additional copy of Core Data also ensures that if Transformation goes bad for some unforeseen reason on a given day, there is a fallback option to re-run Transformation from RawDB, once the relevant issues have been resolved. Thus, the temporary cost associated with doubling the data footprint (of Core Data) is justified for operational flexibility, data integrity and regulatory compliance.

d. In the long run, the persistence and management of historical data (RAW & CLEAN) needs to be addressed via standard Information Lifecycle Management (ILM) measures that utilize various tier-based-data-persistence services. All 3 CSP vendors namely AWS, Azure & GCP support ILM. This ensures data persistence costs are maintained at optimal levels while adhering to all regulatory requirements.

e. The rationale behind separating ErrorDB into its own logical container, is to enable the office of the Chief Data Officer (CDO) to manage data quality issues. This visibility enables the process of engaging the relevant data stewards to fix data quality issues ‘at source’. A simple visual dashboard on ErrorDB can highlight the relevant data issues and assist in the proactive data quality management.

Visualization also plays an important role in information delivery within the enterprise. Although not listed as a layer of the Data Integration Hub Architecture, it still needs to be given its due. In maintaining the serverless paradigm, Cloud-native visualizations services such as AWS QuickSight, Azure Power BI (SaaS) & GCP’s Data Studio should be leveraged for the delivery of BI to the business. Breaking news — Google acquisition of Looker, makes its BI and Analytics visualization offering much stronger. The integration of Looker into Data Studio should make things interesting.

ETL vs. ELT

The ingestion and processing of data outlined in this section conforms to Extract Load Transform ( ELT) instead of the classical method of Extract Transform Load ( ETL). The accurate representation of ELT in a Data Integration Hub is ELTn. Why is it so? The rationale behind ELTn is grounded on the real-life-business-need, which requires transformation not to be a one-time process. Business transformation can (and will) occur many times in the life of a business and often does without any forewarning. Data Integration Hubs thus, need to support a business’s requirement to embrace change in a seamless fashion, by supporting data architectures that facilitate flexible persistence mechanisms.

In the traditional ETL model, data that does not conform to the ‘rules’, are left out of the database. Non-conforming data thus found its way to some dark substrata of an enterprise’s data landscape, rarely to be seen after that. I subscribe to the point that ALL data should be brought into the hub and then classified into CLEAN and ERROR in the Transformation phase. Over a period of time, the percentage of data in ERROR will reduce as the quality of data improves (issues fixed at source). Thus, ELTn is a fundamental paradigm shift in data management and data processing, as it departs from the classic ETL method which we’ve adhered to for more than 2 decades.

One more thing while we are on this subject — If you’ve ever attempted to architect Data Integration Hubs using relational databases and ETL, you will be familiar how every new business requirement that warrants a column modification (change), causes cascading failures — From applications that access the changed object, to ETL jobs that ingest data into (or from) the changed object and to finally replication of the changed object to other databases. A change to a single column in a table, can cause multiple failures. Support for frequent changes in a dynamic business environment is one of the fundamental areas where IT/data practitioners have historically struggled. With a document-based approach to data integration using NoSQL databases, and applications processing JSON from REST ( Representational State Transfer) endpoints, that problem is now solved.

Consumption Pattern

On generation of BI, applications need to be enabled to consume data from InsightsDB & CleanDB in real-time. This consumption is done via the Data Access Layer (DAL) as illustrated in Figure 1. Utilizing REST endpoints, consumption of BI and the underlying data for the various business objects can be supported. It is important to note that DAL is the single-doorway for data consumption. This Data API layer virtually eliminates all unwarranted external data breaches and data leaks, by serving data only when the data request is accompanied by the required credentials. These required credentials ( security credentials and data entitlements) are received from an enterprise’s federated authentication systems ( AD, LDAP etc.) in concert with the Identity Access Management ( IAM) service of a given CSP.

The well-defined REST endpoints created as part of the DAL, not only obviates the need for persistent connection management to the Data Hub, but also eases the impact of changes to data structures. For example, an application that is receiving and processing JSON documents from the RESTful endpoints, can continue to do so regardless of any data structure change (adding or dropping of data elements). This ensures runtime flexibility of both the applications and the Data Hub, in an ever-changing business environment.

Note:
It is prudent to run DAL on a Cloud service similar to AWS’s API Gateway, Azure’s API Management or GCP’s Cloud Endpoints. This ensures performance and scalability of the data consumption layer. Equally important is the seamless integration of the microservices and orchestration layers with the DAL. In the case of AWS & Azure, that is well and truly in place with their respective FaaS’s integration across all relevant services.

Important Lessons Learnt — Serverless Computing Guidelines

By this time, I hope you are appreciating the complete paradigm shift in how one has to approach serverless computing, at least as it relates to data management. In this section, I will summarize what I believe are some of the important guidelines to consider while running serverless workloads. These are lessons learnt and experience gathered. Please use them as appropriate.

Get to Know Your System’s Quiet Time

I have already addressed this topic in an earlier section of Part I of this series. Just to reiterate, the more you understand about when and how much quiet time your application has, the better success you will reap with the design and deployment of your serverless architecture. Running a serverless system with zero quiet time may not generate the desired outcomes of serverless computing.

Optimize for Elapsed Time

In the world of performance optimization, the following mantra is gospel:

Response Time = Service Time + Wait Time

Response Time is another name for Elapsed Time. The optimization goal for serverless computing should be no different. Our goal is to ensure that each microservice runs in the shortest possible elapsed time.

In the serverless realm (using the example of AWS), there is direct correlation between the amount of memory one allocates for a function to its projected cost. Cost (per minute) increases proportionately to the amount of memory allocated. However, it is also a well-known fact that larger memory allocations, also result in faster/more CPU and networking capacity to be provisioned to the function. This is the case with AWS Lambda, as the modification of the amount of memory for a function, has a proportional impact on CPU and networking.

At the time of the writing of this article, AWS Lambda allocates a single core of CPU proportionate to the amount of memory (up to 1.8 GB). Beyond 1.8 GB of memory, 2 cores are allocated and proportionally scaled. Multi-threaded microservices can and will leverage the additional provisioned resources for better performance, reduced elapsed times and even lower costs. Net-net, you are charged for the runtime of your program, so please take the time to optimize your resource allocation for your microservices, to ensure that all function runtimes are optimal. Needless to say, there is always the Law of Diminishing Returns to consider when it comes to resource allocation. This depends on the design and inclination of your function’s program (CPU-intensive vs. Memory-intensive vs. I/O-intensive).

Function Runtime Limits

When AWS Lambda was released in 2014, it had a runtime limit of 300 seconds. At face value, this meant that a single execution of a function could not exceed 5 minutes. However, it is very easy to overcome this limitation via orchestration using AWS Step Functions. It is as simple as, keeping track of the runtime from the start of the program’s execution and re-launching the function from the step function, before the 300-second timer runs out. This, for all practical purposes removes the 5-minute limitation.

Further, in October 2018, AWS increased the runtime execution to 900 seconds (15 minutes). This new ceiling gives more than adequate time for microservices to complete. In the end, the following salient points need to be considered: -

  1. Microservices are intended as small simple programs, thus 15 mins is plenty
  2. Orchestration of parallel microservice execution with multiple threads, further allows for runtime reduction
  3. When required, orchestration can support longer runtimes

Latency/Cold Start

There are many design decisions that need to be made, to ensure optimal functioning of serverless architectures. That aside, one of the most frequently-discussed issues in the serverless realm, is the topic of elapsed time to launch an AWS Lambda function. This is called function latency and it relates to the time it takes to reuse an already provisioned micro-container. It is a relevant discussion point when multiple independent executions occur within a short period of time. However, when function executions occur with a frequency greater than 30 minutes, the latency times are even higher, as the micro-container needs to be re-provisioned. This is referred to as a cold start. There is a lot that needs to be learned in understanding how AWS Lambda truly works. To get a complete idea of the ‘anatomy and physiology’ of AWS Lambda, please check out AWS’s documentation in addition to — Best practices and hard lessons learned of serverless applications by Chris Munns (Principal Developer Advocate — Serverless@AWS).

Further, for those serverless implementations where such latencies are completely unacceptable, there is also an open source ‘ Lambda warmer project on GitHub’ that could help. Your mileage may vary depending on your prior design decisions. Philosophically, there is definitely a need to review and rethink the rationale and value proposition of deploying on serverless systems, if your application’s sensitivity to latency is that high and if you are trying to ‘maintain state’. Lastly, if your programming language is Python or Node.js, you may also evaluate Binaris, a Function As a Service (FaaS) that helps applications proactively deal with latency/cold-start issues. For the record, I do not have any vested interest in Binaris.

CSP Vendor Lock-In

Could we lock ourselves with a specific CSP, just because we deployed a serverless application? I don’t think so. Could we lock ourselves with a specific CSP due to hard-coded proprietary SDK API calls in our serverless functions? Oh, most certainly yes! To the best of my knowledge, there is currently no ‘standard’ inter-Cloud SDK that we can leverage for CSP-independent API calls. Although PyPI ( Python Package Index) is a great initiative and a repository developed by the Python community, there is work that needs to be done in the area of a Standard Python Cloud SDK, that is seamless across multiple CSPs. What does this mean to you? An AWS Lambda function, programmed in Python 3.6 with at least one AWS Python SDK call ( boto3 or botocore), cannot be lifted and shifted to another CSP ‘as-is’. Thus, the inevitable has to be confronted — maintenance of multiple codebases (for each CSP). Let’s discuss this further and see whether we can design around this issue.

Consider a simple microservice that takes 2 simple parameters -( storage_bucket_name & object_name) to retrieve and download an object from a CSP’s bucket. The main Python SDK API calls for this task (across 3 CSPs) is illustrated in Figure 2:

Figure 2 — Python SDK API Calls for AWS, GCP & Azure to download an object from a storage bucket

It is clear from Figure 2 that the microservice to perform this simple task, will be different in each of CSPs, due to their proprietary SDKs. You may require 3 orchestration programs for the 3 different CSPs and thus need to maintain 3 different codebases. This seems both painful and inevitable.

Could we handle this in the design of the orchestration and microservices programs? If the microservices and the orchestration designs are kept simple, cross-CSP microservice execution and orchestration could become a reality. This is done with a little bit of JSON juggling along with a simple code deployment configuration. Let’s now find a way to get around the hard-coding a CSP’s SDK API calls in the orchestration program, such that we maintain only 1 copy of the program. Figure 3 provides an abstraction method using JSON, to design a CSP-specific microservice at code deployment time with a single orchestrator:

Figure 3 — Simple Python program abstraction using JSON for a microservice in AWS, GCP & Azure

As noted in Figure 3, the separation of the CSP’s SDK API calls into a microservice, will facilitate a single orchestration program. Here are the steps that do the trick:

  1. Abstract the microservice with the SDK API calls across all CSPs
  2. Create a JSON doc called get_obj_from_bkt with the definition illustrated in Figure 3
  3. Store get_obj_from_bkt in a NoSQL document collection — Let’s call this table/collection as CodeAbstract
  4. During code deployment (within your CI/CD pipeline), run a pre-processing program that generates the correct/relevant microservices for the chosen CSP. The CI/CD pipeline should be well aware where it is deploying the code, but explicitly handing the target CSP is important — Let’s call this Python function — CodePreProc
  5. Pass a runtime argument (aws, azr, gcp etc.) to CodePreProc
  6. Based on this runtime argument, CodePreProc queries CodeAbstract (in the above case specifically, the get_obj_from_bkt JSON doc) to generate the CSP-specific microservice.
  7. Based on the runtime argument passed in Step#5, the code array for the CSP is selected from the JSON object{}. If aws is passed to CodePreProc at runtime, the code array for aws_get_obj_from_bkt is selected from the JSON document.
  8. CodePreProc then ‘concatenates’ newline escape sequences (‘\n’) between the elements of a given CSP’s code array (using the join() function) and creates the appropriate Python program for AWS — aws_get_obj_from_bucket.py file.
  9. This AWS-specific microservice is now ready for testing and deployment in the next stage of the CI/CD pipeline.
  10. So basically, we have encapsulated CSP-specific SDK API calls within the confines of a single function. It is conceivable that creating one JSON doc per SDK API call, for CSP-specific functionality, will do the trick. Again, this method is no silver bullet, but is a feasible alternative to hard-coding CSP SDK API calls in all programs.

There are multiple advantages to the aforementioned approach:

a) First and foremost, it enables us to leverage a single orchestration/wrapper program, that has no CSP-specific function calls. This allows the support for multiple CSPs with the same codebase with independent CSP-specific SDK API microservices (functions).

b) Any change in the SDK (arguments/names etc.) for a given CSP, affects ONLY one specific CSP’s code array within the JSON document. As long as correct syntax of the JSON document is maintained, there is ZERO impact for the other CSPs and the orchestrator program.

c) For those applications that are not latency-sensitive, program abstraction for CSP-specific SDK API calls, can also be done dynamically at runtime. Now that is a pure case of Cloud Microservices Deployment Nirvana :)

In summary, CSP-specific SDK API call abstraction requires upfront design and may not be feasible for all existing projects. However, for new projects it does provide a method to decouple microservices with CSP-specific SDK API calls. This helps in maintaining generic base/orchestration programs while keeping us free from vendor lock-in!

Serverless Functions — Idempotent vs. Duplicate Processing

Idempotent is a fundamental construct of computer science that defines the end state of a program’s execution. Idempotent execution requires that, with the same set of input values, the end state of the program, remains exactly the same, no matter how many times the program is run. For example, if we run a program to modify the address of a customer to ‘123 Main Street’, no matter how many times we run it, the end state of the customer’s address is ‘123 Main Street’. This is an example of an idempotent program/function. Said another way, there are no adverse side effects of running this function multiple times. So, whether a function is idempotent or not, depends on the design and programming of the said function.

There is confusion in some circles regarding this term as it relates to AWS Lambda. Due to the dynamic and asynchronous nature of how a Lambda function comes to life (including launching of micro-containers during a cold start), there are some edge cases when multiple containers can be launched simultaneously for a single execution request. As practitioners of Cloud Computing, we need to design with one important principle — ‘ everything that can break, will break’. Thus, the environmental side-effect of AWS Lambda discussed above, needs to be addressed. It is a simple case of preventing duplicate data processing.

Regardless of serverless computing, I’m philosophically inclined to designs that protect the system from any accidental duplicate processing of the same data. These safeguards are usually factored within the confines of the logic of the program. The scenario of duplicate data processing is similar to consumption management of a message queue. To ensure the integrity of processing, one needs to design one’s programs correctly, such that always once processing is preferred over at least once processing. Thus, the subject of being idempotent is irrelevant here unless of course the processing is similar to the address change that was discussed earlier. Thus idempotent should not be confused with design patterns that prevent duplicate data processing. They are two different aspects of a system and need to be dealt as such.

Hard-wiring functions to a VPC

The basic recommendation for function and VPC association goes like this — Using the AWS Lambda use case, if your function does not access resources within a VPC, do not force it to run within the VPC. As of the authoring of this article, there is a network configuration overhead on the micro-container (related to Elastic Network Interfaces — ENIs) and hard-wiring a function to a VPC comes at a cost. If the function requires access to other resources within the VPC, it is recommended to place it in a private subnet for security and segregation.

Function Testing — Local vs. On the Cloud

Every programmer has his/her favorite Integrated Development Environment (IDE). There is no doubt about the value an IDE brings to program testing. Cloud Functions need to be tested ‘locally’ first, before they are pushed into the Cloud. Let’s face it, testing functions on the Cloud without a Cloud-aware IDE can be very time-consuming. You end up spending the bulk of the time pushing code through the CSP’s code management plumbing versus performing the actual testing.

It is understandable that it is faster to generate the events locally, ensure the validity of an entire execution flow, of a string of microservices, before attempting the push the code and run it on the Cloud and deal with a slew of errors. Mercifully, there is a solution that should make our coding/testing/deployment life much easier. It is developed by Stackery. While you are at Stackery’s website, be sure to check out the video of a serverless function testing/deployment. Again, for the record, I do not have any vested interest in Stackery.

Function Observability — Pain-free Production Monitoring & Troubleshooting

One of the pet peeves of IT professionals who’ve transitioned from a server-based environment to the serverless realm, is the lack of function observability. This is ability to observe the details associated with a function’s execution. In a server-based environment, one always had the ability to observe program execution via various tracing methods. Although there is a default standard set of metrics logged in CloudWatch logs of AWS, for those applications that have many function-calls a day, data mining function-specific details from CloudWatch (or any other monitoring service for that matter) can pose some challenges.

Sifting through CloudWatch logs to find a function’s execution details at a specific hour, minute, second on a given day, especially when there are many other services/metrics logged in CloudWatch, is analogous to finding a needle in a haystack. In my experience, this problem within the context of data integration hubs, can be very easily solved by instrumenting each microservice with a logging function, that writes to a log table of our choice. Yes, there is AWS X-Ray for function debugging and tracing. However, I believe we need better control of our logging destiny.

Basically, we need to take the responsibility to define what we want to observe in a function’s execution. A little investment in design of the log table and its key data elements ( function_name, run_date, run_time, total_elapsed_time, call parameters etc.) goes a long way in solving this problem. Start with a simple log table design and augment it according to your evolving needs. This adds significant production monitoring value for your serverless deployments. Here is how: -

  1. You define the structure of the function execution logs relevant to your application, i.e., you don’t have to live with some default one-size-fits-all imposed standard
  2. You control the amount of logging based on execution criteria and logging levels by defining higher granularity during debugging/ troubleshooting vs. lower granularity during normal times
  3. You utilize visualization services (similar to AWS ElastiCache) to convert logs into relevant production information, that provides clear working health of your application — in our case the health of the Data Integration Hub
  4. You control the amount of log history to what is appropriate in your environment (helps with establishing performance baselines)

Function Observability thus is a critical aspect of production management of serverless applications and the return on investment of a self-sufficient logging model is very high.

Separation of Serverless & State

The topic of serverless applications supporting state seems to also generate a lot of heated debate. Is there a real need for serverless to better support ‘stateful’ applications? That is the burning question for most. I believe that serverless applications, true to their runtime environment, should not have to worry about maintaining state (normally defined as the entire contents of an application’s memory).

Given the ephemeral nature of micro-containers in the serverless realm, I personally believe that it is not a good idea, to force serverless computing environments to save state. There could be multiple unwanted side effects of this, in the long run. If there is a need for an application to maintain temporal state, then the onus is on the application’s programming logic to persist state data in a relevant datastore ( Memcached or Redis as applicable).

Conclusion — Is Serverless Dead?

Now that’s a controversial question! With all the hype that revolves around serverless — what it is, what it isn’t, why it has failed, why it has succeeded, it is useful to revisit its humble beginnings and understand the original rationale behind its creation. Find out here whether — Serverless is Dead. Well, in my humble opinion it isn’t, not by a long shot. But there is no harm in getting a few good laughs as you go through this presentation.

I hope this 2-part article series has provided some food for thought on the relevant topics associated with serverless computing, specifically as it relates to data management. I wish you the very best in your data integration journey on the Cloud. As we part ways, I have these final words for you — Be happy, live long and prosper (à la Spock) :)

Originally published at https://www.linkedin.com.

--

--

Gaja Krishna Vaidyanatha

Passionate about Data Integration | Serverless Data Hubs for Analytics, ML & AI | Cures Compulsive Tuning Disorders | www.cloud-data.biz