Akka Data Stream Processing
Everybody is used to link big data processing with Hadoop (or Spark), that use MapReduce (or its extension). In this article I will tell you about MapReduce drawbacks and about the reason why we decided to start turning away from MapReduce and how we adapted Akka + Akka Cluster to replace MapReduce.
Data Management Platform
The task we need big data handling instruments for is users’ segmentation. Systems class that is used for users’ segmentation purposes is usually called Data Management Platform or DMP for short. DMP receives data about users’ actions as input (first of all it is visiting of certain web pages) and as output it provides “user profile” — his sex, age, interests, and intentions and so on. His profile is further used for advertisement targeting, personal recommendations and overall content personalization. If you want to find out more about DMP visit: http://digitalmarketing-glossary.com/What-is-DMP-definition.
Since DMP works with data from large quantity of users the volume of data to be processed may be formidable. As an example our DMP Facetz.DCA processes data from 600 million internet browsers processing almost half petabyte of data daily.
DMP Facetz.DCA Architecture
In order to process such big volumes of data it is necessary to have a good scalable architecture. Initially we developed our system based on Hadoop stack. Thorough description of the architecture requires a dedicated article but here I shall only describe it briefly:
1. Users’ action logs are stored in HDFS — a distributed file system which is one of the base Hadoop ecosystem components.
2. From HDFS the data goes to raw data storage implemented on Apache HBase — a distributed scalable database based on Big Table ideas. HBase is a very convenient for bulk processing key-value database. All the users’ data is stored in one big table “facts”. One user data corresponds to one HBase line which allows getting all the necessary information about them quickly and easily.
3. Once a day Analytic Engine is run — a big MapReduce job which does the users segmentation. In fact Analytic Engine is a segmentation rules container prepared by analysts separately. For example, one script may mark the user’s sex while another — their interests and so on.
4. Ready users’ segments are stored in Aerospike — a key-value database which is tailored for quick response — 99% of read requests are processed less than in 1 millisecond even at high loads of tens of thousands requests per second.
The developed architecture showed good results — it allowed to scale quickly to all runet users profiles processing and to mark them using hundreds of scripts (each of which can mark a user by several segments). However it had its own drawbacks. The major issue was the lack of interactivity during processing. MapReduce is an offline data processing paradigm. That is, if a user looks up football tickets today they may appear in the “football fan” segment only tomorrow. In some cases such a delay may be critical. A typical example is retargeting — advertising of goods that a user already looked for. On the graph you can see the probability of a user buying an item after having looked at it in the course of time:
We can see that the highest probability of a purchase falls on the first hours. With this approach the system would find out about user’s intention to buy an item of goods only a day after — when the conversion probability almost levels off.
Evidently a real-time data stream processing mechanism is necessary for delay minimization. At the same time it would be nice to save processing versatility — the possibility of creating users segmentation rules of desired complicacy.
After giving the matter some thought we decided that reactive programming paradigm and actor model would be best suited for this task. Actor is a parallel programming primitive that can:
• Receive messages
• Send messages
• Create new actors
• Determine messages response
Actor model originated in Erlang community and currently this model is implemented for many programming languages.
For Scala language which was chosen for our DMP there is an excellent toolkit called Akka. It is used as a base for several popular frameworks and is well documented. Also Coursera has a decent online course Reactive Programming, where these principles are described using Akka as an example. We should separately mention the Akka cluster module that allows scaling of solutions (which use actors) on several servers.
Real-Time DMP Architecture
Final architecture looks like this:
1. Dispatcher reads the messages about users’ actions from RabbitMQ. There may be several Dispatchers working independently.
2. For each online user there is an actor created. Dispatcher sends a message about the new event (read from RabbitMQ) to the corresponding actor (or creates a new actor if this is the first user’s action and there is no actor for him yet).
3. The corresponding actor adds information regarding the action to the users’ action list and launches segmentation scripts (same as what Analytic Engine launches during MapReduce processing).
4. Marked segments data is stored in Aerospike. Information about segments and users’ actions is also accessible via API connected directly to the actors.
5. If no data has been received about a user within an hour then the session is considered over and the actor is killed.
Actors cluster sharding, their life cycle and killing is managed by akka which significantly simplified the development.
- A six-node Akka cluster;
• 3000 actions per second data stream;
• 4–6 million users online (depends on the day of the week);
• One user segmentation script average completion time less than 5 milliseconds;
• Average period between an event and segmentation based on this event is one second.
Our real-time engine showed good results and we are planning to develop it further. Here is a list of steps we are going to take:
• Persistency — currently the real-time engine segments the users based on the last session. We are planning to add engaging of old information from HBase in case a new user appears.
• Currently only a part of our data is switched to realtime processing. We are planning to gradually switch all our data sources to stream processing after which the processed data stream will increase to 30000 events per second.
• After we complete the switch process to realtime we will be able to cut the everyday MapReduce calculation that will allow save on servers because only those users will be processed who really acted on the internet that day.
Links to Similar Solutions
Thanks for your time, I’m ready to answer your questions.