How we achieved near real-time analytics with a scalable, serverless architecture

Published in

Vanguard Tech

8 min readJun 29, 2023

Webinars are one of the most popular forms of B2B marketing events, enabling businesses to reach and engage with a large audience in real-time. These channels can help businesses better understand their audience, drive more leads, target marketing efforts, and drive growth. Activity data from a webinar typically includes a variety of metrics, such as events, attendees, registrants, engagement, and other valuable signals. This data can provide valuable insights into customer behavior, such as the topics they (in this specific case, financial advisors) are interested in, how long they engage with the content, and what questions they ask. These insights can help businesses tailor their marketing messages and product offerings to better meet their customers’ financial goals and identify opportunities for new acquisitions, cross-sells, and upsells. However, capturing and leveraging the activity data from these events can be a complex and resource-intensive process.

This blog explores how a serverless-first, event-driven architecture can be used to efficiently ingest and process webinar data in near real-time, while generating actionable insights for marketing and sales teams. The architecture uses AWS (Amazon Web Services) Lambda, a serverless computing service, to process the data in near real-time without the need for the team to manage any servers. The architecture also uses AWS Step Functions, a state machine service, to manage the workflow of the event processing. The cost savings achieved by leveraging AWS Lambda functions over AWS Glue Python Shell jobs for the same process are shown.

Serverless-first, event-driven architecture

To effectively capture and leverage webinar activity data, businesses need a scalable, reliable, and cost-effective architecture. A serverless-first, event-driven architecture can provide these benefits by leveraging cloud computing services and implementing functions as a service (FaaS) for event processing.

For this architecture, we leveraged AWS Lambda, a serverless computing service, to process the data in near real-time. After carefully considering a few serverless options, we chose Lambda over Glue due to Lambda’s quick bootstrap times, cost efficiency, and scalability. This also aligned well with our own internal technology stack that predominantly supported event-driven architecture. To enable near real-time analytics and marketing activation use cases, we aimed to run the process as soon as the data was made available for processing by the vendor (in this case, within 30 minutes). This allowed us to process the data quickly, providing valuable insights into our customer’s behavior as close to near real time as possible.

We also used AWS Step Functions, a state machine service, to manage the workflow of the event processing. This allowed us to have greater control over execution while still maintaining a centralized location to track status, improving the efficiency and reliability of the overall system. The figure below depicts the high-level architecture for processing the webinar data.

In addition, since our implementation relies on making REST API calls to vendor endpoints, we are limited by the number of records that will be returned per page because of pagination. To retrieve all records, we often need to invoke multiple API requests using the right parameters, such as the start and end datetimes and the page offset.

*High-level serverless Lambda architecture for acquiring and processing webinar data*

Managing Lambda timeouts during peak loads

A Lambda function can run for a maximum of 15 minutes before it is terminated, a scenario that we will most likely hit during peak loads. This means that if we receive a large volume of data during a peak activity, we may not be able to process it all in one go. To overcome this limitation, we implemented a variant of the exponential backoff technique. The approach involves dynamically adjusting the correct start and end datetime parameters to filter the right amount of data and safely commit the work before Lambda times out, allowing the next execution of Lambda to resume from where it left off.

*Our variant of the exponential backoff technique to compute the end datetime parameter*

To set the start datetime parameter, we use the last processed timestamp and add one additional second. To determine the end datetime parameter, we use a variant of the exponential back-off strategy. We began by first issuing an API call using the max end datetime, which should return the total number of records that would be available for processing as part of its metadata. We then check the elapsed time and determine the amount of data that can be processed before Lambda times out, which we call Lambda “capacity.” If the data that will be available for processing is less than Lambda’s capacity, we pick the end date and recursively invoke the API, persisting all the data. If the data exceeds Lambda’s capacity, we exponentially back off and pick a shorter time window until we determine the correct end datetime.

For example, the first figure below illustrates how the time window is adjusted if we start at 60 minutes first and exponentially reduce it with each subsequent attempt. To be resourceful, we stop the process after five attempts. The subsequent figure is a high-level pseudocode for processing the end datetime parameter.

*Exponential back-off to determine end datetime*

*High-level pseudocode for processing the end datetime parameter*

Managing vendor throttling and server timeouts gracefully

In our architecture, we encountered another crucial scenario that required careful consideration — devising an effective retry strategy to tackle temporary failures in vendor API requests. These failures could stem from factors like vendor throttling or server errors, which have the potential to disrupt the system. Such errors encompass HTTP (Hypertext Transfer Protocol) response codes like 408 (request timeout), 429 (too many requests), and 5xx (server error). While our primary goal is to minimize errors, it’s practically impossible to eliminate them entirely. Therefore, we must ensure our systems are designed to gracefully handle errors. Thankfully, many errors fall into the category of being transient, allowing us to enhance the reliability and availability of our service by implementing a retry mechanism for failed API requests.

This is where the concept of exponential backoff again comes into play. By utilizing the standard exponential backoff technique, we can fine-tune the initial retry wait time and the scaling factor to determine the retry wait time for each attempt. This pattern lets us avoid overwhelming the vendor with additional calls during a service failure.

Our architecture also handles pagination by saving the current page response and correctly framing parameters for the next API call. We loop until all pages are retrieved. To ensure data consistency, we keep track of the parameters used in the last invocation, allowing the process to figure out where to resume and frame parameters for the next invocation. The parameters are persisted as part of object names in AWS S3. For example, in the figure below, the last two values in the object name are the start and end datetime parameter values respectively.

*Snapshot of how parameters are persisted as part of object names*

Finally, we persist the data into a storage layer (e.g., S3 bucket). We strike a delicate balance between persisting the data too often (resulting in too many small files) and waiting too long (which might monopolize memory and may result in data loss).

*High-level architecture for managing vendor throttling and intermittent server errors*

The results: Cost savings upward of 90%

One of the most significant benefits of our architecture is the cost savings we’ve realized by leveraging Lambda serverless functions over Glue Python Shell jobs for the same process. Here is how we’ve achieved cost savings upward of 90%.

Glue measures its resource allocation in Data Processing Units (DPUs), with each DPU comprising four virtual CPUs and 16 GB of memory. The cost of each DPU is $0.44 per hour, with a minimum charge of one minute for a Python Shell job. Considering four Glue Python shell jobs running every 30 minutes and each data ingestion lasting an average of a minute with 0.0625 DPU, the cost to ingest would be $31.68. For data preprocessing like transforming the files once they are ingested into S3 buckets, assuming an average of two files are to be processed with an average execution time of ten seconds, the cost would be $63.36.

Comparatively, four Lambda functions that are each assigned 128 MB memory and 512 MB disk space, running every 30 minutes, and each ingestion lasting an average of one minute, the total cost for ingestion would be approximately $8.64. Likewise, the total cost of pre-processing, with an average execution time of ten seconds, would be $1.44. The total cost combined for ingestion and preprocessing would yield 90% lower than the cost of leveraging Glue Python Shell jobs. See the cost comparison table below.

Estimated cost comparison between Lambda and Glue

Considering the frequency and speed of the data retrieval and processing steps, Lambda becomes a more suitable choice if it needs to run at a very high frequency or if it completes significantly faster than the one-minute minimum charge in Glue. On the other hand, if your process requires greater memory or CPU capacity than Lambda can offer, a Python Shell job becomes an appealing option.

From a developer’s experience point of view, Lambda offers a broader range of use cases, allowing developers to write code in their preferred language. Use cases can range from implementing microservices and APIs to building event driven processes. On the other hand, Glue is geared more toward data integration and ETL (Extraction, Transformation, and Loading) processes and is limited to just using Python.

Conclusion

In today’s data-driven business world, it is essential to ingest and process large volumes of data quickly and efficiently. By leveraging a serverless-first approach, our data engineering team was able to keep costs down while still providing the scalability, flexibility, and reliability we needed to process real-time data. Furthermore, by using Lambda (or any FaaS offered by other public cloud providers) and exponential backoff techniques, we were able to ensure that our system could handle large volumes of webinar data during peak loads without being overwhelmed by processing times or vendor limitations. The case study discussed above illustrates the benefits of this approach, which can be applied to a wide range of data engineering use cases.

Come work with us!
Vanguard’s technologists design, architect, and build modernized cloud-based applications to deliver world-class experiences to 50 million investors worldwide. Hear more about our tech — and the crew behind it — at vanguardjobs.com.

How we achieved near real-time analytics with a scalable, serverless architecture

Written by Vanguard Tech