Antelopes for pulling Data Carts

Vladimir Elvov
Nov 5 · 6 min read

Real-time customer journeys using Kudu

Offering personalized customer journeys in order to provide an optimized customer experience is the focus of all B2C companies. Normally the contents of customer journeys get stored in a campaigns system database. If a journey is displayed to the customer a webservice retrieves the data for the customer journey from the campaign system’s database. The architecture in this case looks as follows:

Yet the problem with such a solution is that it fails under load. Webservice requests overload the relational database which is also used by the campaigns system for normal operations. A possible solution would be to split the database into two databases where the data from the campaigns system gets replicated via CDC to a database that is used by the webservice to serve its calls in order to retrieve the journeys. Here’s an example of what an architecture like this looks:

Using CDC tools like for example Oracle Golden Gate is however, expensive. Also replicating data adds up complexity to the application system. To solve these problems, it is possible to replace the relational database with a storage technology that offers the core features of an analytical database but does also scale like a NoSQL database while enforcing basic principles of a relational database. Kudu is able to cover all these requirements. The solution we did implement makes it able to share the same database between the campaigns system and the webservice retrieving the journeys thus avoiding duplications and maintaining scalability under load. The architecture now looks as follows:

The campaigns system loads customer journeys to Kudu. An Akka microservice is invoked whenever a customer is on the web and displaying a journey gets required. The Akka microservices publishes a request to Kafka from where a Spark Streaming Jobs retrieves it, looks up the respective journey data and sends it back to Kafka. From there this response gets picked up by Akka and eventually displayed on the web to the customer. The additional components of Kafka and Spark Streaming solve the security issues that arise when using Kudu. Since Kudu does not offer fine grained access control, letting Akka write directly to Kudu would cause security problems. With Kafka as a DMZ and Spark Streaming accessing Kudu in a secure way, this challenge can be solved.

Generally speaking, this use case offers a good example of how to apply Big Data technologies to allow for light-weight data provisioning for a customer journey use case and to solve associated architectural problems.

Machine Learning Data Provisioning with Impala

Getting the benefits Big Data offers works best for analytical use cases when using Machine Learning. Machine Learning can be used for calculating dynamic prices for offers in order to support churn management initiatives.

Machine Learning models require a lot of data for training them. In most cases these troves of data need to be prepared for ensuring the correct data quality, consistency as well as the actual content of the training dataset. This process is called data wrangling and data scientists spend most of their time with it. Therefore, it’s essential to provide them with a querying engine that is able to execute queries fast. Conventional analytics solutions often fail here — sometimes it takes more than two days to prepare a training data set because queries get executed so slow.

A possible solution for querying the data is Impala since it excels at querying speed. Impala is a MPP querying engine, and in most cases, data gets persisted on HDFS, although other storages like e.g. Kudu are also possible. In addition to the requirement of fast queries, the Data Scientist also can wish to have dynamic updates on the data coming from the source system for having fresh data for both training and scoring. However, as it is commonly known, HDFS does not support updates — instead only appends are possible. This constraint means one has to use Kudu for offering updates on the data. At the same time Kudu is actually not designed as a data warehouse. Things get critical, if Kudu gets used for other important transactional use cases that also need the dynamic updates feature. Adding a heavy analytical load (with queries requiring Terabytes of memory) to a storage system that serves as a cornerstone in other important use cases is therefore not the best idea. So how to split the analytical workload from the transactional whilst keeping all use case owners happy?

To address this question, the data coming into Kudu can be offloaded to Impala Parquet tables on a nightly basis with the Parquet files being persisted on HDFS. Parquet is the perfect file format to run fast queries because of its usage of columnar format. The entire Kudu table with all the updates that came over night gets loaded to an Impala Parquet table whilst the old Parquet table gets dropped. Since the entire table gets moved from Kudu no where-clauses are applied. Thus, the offloading operation gets executed quite fast even for tables containing hundreds of millions of records. Now the load of the analytical queries of the data scientists would go to Impala and HDFS instead of Kudu. When all data wrangling ETLs have been applied to the data, it is ready for usage in the data scientist’s workbench. The architecture looks as follows:

What is worth mentioning about the data flow is that data from the actual source (BI tables) can be split in two categories. One type contains only closed records, i.e. ones that will not be updated anymore. The other type contains open records, i.e. ones that with a new update will become closed records — they’re also the ones requiring update functionality. In order to further reduce the load on Kudu all closed records were moved to Impala tables. These tables are then combined the respective Impala tables containing open records that were offloaded from Kudu and thus become available for querying. In addition, on the weekends when there is no analytical ad-hoc load by data scientists, a housekeeping job gets executed on the Kudu tables. It offloads all records that became closed during the week to the Impala Parquet tables that contain the closed records.

It is also important to point out that the data model does not change when moving to the data platform with Impala from the legacy system. Due to this the data scientists do not need to get accustomed to any changes — instead they can directly start out to use the data platform.

With this new architecture the time for preparing training data sets got reduced from 2 days to just 15 minutes. This strongly reduces time to market for Machine Learning projects and is a great example of how a data platform can create additional business value.


About the Authors: Mark and Vladimir Elvov work in different functions in Data Architecture. New technologies are our core passion — using architecture to drive their adoption is our daily business. In this blog we write about everything that is related to Big Data and Cloud technologies: From High-Level strategies via use cases down into what we’ve seen in the front-line trenches of production and daily operations work.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade