How we rapidly churn out production-ready Big Data applications @Groupon
In the past few years, the technology industry has witnessed the rapid adoption of big data technologies, and there is a huge potential to profoundly impact business functions when implemented the right way.
At Groupon, Consumer Data Authority Platform(CDAP) is responsible for processing data of ~500 million users on a daily basis and ensuring appropriate consumer information is made available to all downstream systems easily and efficiently. For this, CDAP has to build and execute jobs frequently to crunch data which will enable downstream systems to drive consumer engagement, personalization, and thereby increase conversion rate.
To achieve this mission, it is crucial to have robust tools and frameworks in place to enable the following unique qualities:
- One-Stop-Shop for Data Applications — developers should be able to focus on the business problems and quickly build solutions for changing market conditions instead of spending unnecessary resources on tackling the engineering challenges.
- Nimble Consumer Insights — since time to market is very critical for consumer data applications, with each passing day having an impact on the revenue, it’s important for developers to have an arsenal of tools to quickly build and deploy solutions that can be used by the business to instantly make decisions.
Need for a Big Data Framework
In general, writing production-ready Big Data applications from scratch is a fairly complex task as a lot of effort is wasted in making sure that the different technologies co-exist together. Furthermore, if the application is not coded modularly, then adding more libraries, upgrading the versions of existing tools & libraries, and subsequent application amendments become a nightmare.
Even when challenges associated with building a Big Data application are overcome, making it production-ready is an added headache involving testing, deployment, logging, alerting, failure handling, monitoring, etc.
Therefore, having a framework that allows developers to build data applications without having to worry about these challenges enables one to focus on the real problem and create high-quality production-ready applications far faster.
And Voila! The Consumer Authority Big Data Framework
CDAP Big Data framework does just that; it allows developers to easily create and deploy production-ready big data applications with a range of inbuilt features across various phases of application development.
- Command-line arguments & logging
- Email/Pagerduty alerting with support for custom channels
- Execution tracking and configurable retry support on failure
- Pluggable processing engines & heterogeneous sources/sinks
- Out of the box connectivity to popular Big Data tools
- Support for streaming data applications
- Cloud compatible with support for popular AWS services
- Seamless integration with Airflow with auto-generation of DAG code.
Plug and Play is in the gene!
The Big Data tech stack is evolving at a rapid pace from the time it came into existence. Multiple processing engines and data stores have entered this space in the last decade, each offering various advantages for specific use cases. This poses a serious challenge to developers as they have to worry about the technicality associated with these tools while solving business problems.
The ability to use pluggable modules like Processing engines and Data sources is a must for any generic Big Data application framework and we have used it at the heart of the framework.
The various modules used in the framework like processing engines, data stores, configuration parsers, alert channels, etc. are designed in a way that enables developers to seamlessly plug in any tool of their choice.
The main pluggable components of the CDAP framework are :
Job Interface — basic abstraction provided by the framework and any application leveraging it will inherit all the in-built features.
Processing Engine Interface — responsible for managing the data transformations in the application. Developers can choose a suitable processing engine depending on the use case at hand.
Connection Interface — allow developers to effortlessly connect to the various data stores in the Big Data tech stack.
Job Interface — Genesis Block!
This is the starting block the developers have to use to build any application using the framework. Below is a sample Job example:
Application configurations are based on the environment and should be provided in the following format. We generally use 3 configuration files (dev, staging, prod) with standard naming convention : application_<env>.conf
The default format is JSON, but developers can use other formats like YAML, XML, etc. as the configuration parser is a pluggable module.
Processing Engine Interface — Spark Support
At Groupon, we use Apache Spark™ extensively as the processing engine for the majority of the applications. Since we constantly spend time tuning and optimising Spark, it is important to enable the developers across projects to take advantage of improvements already added.
The CDAP Big Data framework provides a SparkJob abstraction that enables developers to easily develop and deploy Spark applications in production.
The SparkJob comes with the following features:
In-built SparkClient — introduced by the framework to interact with various spark functionalities and bootstrap spark applications on the fly.
Enhanced SparkConf — provided by the framework, with generic tuning and optimisation parameters already set. The developers can override the parameters to suit their application needs.
Spark Event Listeners — allow developers to write logic on the creation/destruction of spark context.
Connection Interface — Multiple Sources & Sinks!
Big Data applications have to deal with a variety of sources and connecting/processing data from various sources is challenging for developers. To enable developers to build applications faster, it is important to give them seamless integration to different input and output systems.
To enable this, the framework provides a rich Connection interface that can be used by developers to integrate with various data sources. The framework provides out of the box connectivity to the following:
- Apache Kafka
- Apache HBase
- Apache Hive
- Apache Cassandra
Apart from the above sources, the Connection interface allows the developers to introduce additional sources/sinks easily.
Below is a sample job connecting to multiple data sources and the associated configuration:
Modularising Data Applications
One of the things we have noticed while building data applications is that modular implementation has a big impact on the overall stability, extensibility, and management. To help developers in this regard, the framework provides a modular DataJob abstraction that enables them to easily develop and deploy extensible data applications.
Processor is one of the main components of the DataJob. The DataJob can be visualised as a collection of Processors, each performing its own logic on the given data set. This enables the developers to decouple the logic based on data, making it easy to maintain/upgrade the application.
As each Processor logic is independent of each other, maintenance and code readability is improved drastically. This also reduces the unit testing and regression testing efforts to a greater extent.
The CDAP framework also provides an out-of-the-box SparkDataJob abstraction to support the majority of our data applications. In this case, the Processor is a SparkProcessor with the SparkSession injected into it. SparkProcessors deal with spark-specific data types like Dataframe, RDD, Row whereas the generic DataProcessor can process data of any type.
Lambda Architecture in Action!
Real-time analytics allows organisations to derive insights from data as soon as it becomes available. Businesses can easily use fresh data to find new opportunities that lead to more profit, improved customer service, and new customer ventures.
One of the common architectures used to solve real-time use cases is the Lambda architecture. To make it easier for developers to embrace this architecture, the CDAP framework provides a LambdaSparkJob Interface which enables developers to build Spark applications that support streaming and batch layers.
This interface provides an implicit stream_mode flag which enables developers to write specific logic for streaming and batch datasets.
The framework also provides utilities that allow developers to easily process standard streaming data formats. The initial message deserialisation and schema inference are already taken care of for common formats like Avro, JSON, etc, and developers can override with additional custom formats.
The framework also provides a SparkStreamingJob abstraction which enables developers to build Streaming only applications. This interface allows developers to use either Spark Structured Streaming or DStream based Streaming. Basic offset management is implicitly given by the framework for Kafka streaming but developers can plug in custom offset management logic as required.
Testing And Deployment
Another headache for the developers when it comes to Big Data applications is local testing as it is very difficult to set up the environment with all the data required for applications to run locally. It’s a tedious task to debug syntactical errors and validate schema-related issues as the developers have to build, deploy and run the application in the cluster.
If developers can test the applications locally, the overall testing time will be significantly reduced as they have to run the application less frequently in the cluster. To address these issues, CDAP Framework provides a dedicated testing package with all the necessary modules to test the applications locally. Developers have to run the application in a cluster only for data related validations.
Some of the main components of the testing package are:
LocalTest Interface — Since the Spark session is injected by the framework to all the applications, it’s easy to control the way the spark session behaves. When the developers extend the LocalTest, a local SparkSession is injected to the tests, enabling the application to run locally without any code changes.
LocalHiveMetastoreGenerator — Responsible for setting up the local Hive metastore with all the data required for the applications. The developers just have to add the sample data required by the application in a given folder in the specified format and the framework will take care of setting up the environment and linking the proper database and table references in code to run the application locally without any code change.
There are numerous use cases built using the framework and below are a few of the interesting ones.
- Consumer Data Lake with 500+ derived attributes.
- Real-time consumer attributes pipeline.
- Groupon deal analytics pipeline.
- Scale data science models deriving consumer affinity and propensity.
- Derive customer location attributes for millions of users.
As a result, the CDAP framework is currently used by several teams across Groupon to build Big Data applications faster than ever. As we embark on the journey to address more complex challenges, the framework is constantly enriched with more capabilities to harness the sheer power of Big Data.
Hope this blog gave you a glimpse of how Big Data applications are swiftly built & deployed at Groupon. Stay tuned for more updates!!!