Real-time Stream Analytics and User Scoring Using Apache Druid, Flink & Cassandra at Deep.BI

Hisham Itani
Mar 24, 2020 · 3 min read

Last October, we were honored to join the likes of Netflix, Alibaba, Salesforce, Airbus, Uber, Goldman Sachs, Yelp, Pinterest and many others presenting at Flink Forward in Berlin covering our area of expertise working with Apache Flink in combination with Apache Druid and Cassandra for real-time user and event scoring applications.

One of the hardest challenges we were able to tackle at Deep.BI is how to deliver customizable insights based on billions of data points in real-time, that fully scale from a single perspective up to millions of users.

At Deep.BI we track user habits, engagement, product and content performance — processing up to terabytes or billions of events of data daily. Our goal is to provide real-time insights based on custom metrics from a variety of self-created dimensions. The platform allows the performing of tasks from various domains such as adjusting websites using real-time analytics, running AI optimized marketing campaigns, providing a dynamic paywall based on user engagement and AI scoring, or detecting frauds based on data anomalies and adaptive patterns.

To accomplish this, our system collects every user interaction. We use Apache Flink for event enrichment, custom transformations, aggregations and serving machine learning models. The processed data is then indexed by Apache Druid for real-time analytics and Apache Cassandra for delivery of the scores. Historical data is also stored on Apache Hadoop for machine learning model building. Using the low-level DataStream API, custom Process Functions, and Broadcasted State, we have built an abstract feature engineering framework that provides re-usable templates for data transformations. This allowed us to easily define domain-specific features for analytics and machine learning, and migrate our batch data preprocessing pipeline from Python jobs deployed on Apache Spark to Flink, resulting in a significant performance boost.

This talk covers our challenges with building and maintaining our platform and lessons learned along the way, namely how to:

  • Evolve a continuous application processing an unbounded data stream,
  • Provide an API for defining, updating and reusing features for machine learning,
  • Handle late events and state TTL,
  • Serve machine learning models with the lowest latency possible,
  • Dynamically update the business logic at runtime without a need of redeploy, and
  • Automate the data pipeline deployment.

Challenges with Apache Druid, Flink or Cassandra? We can help.

You can find the full presentation here:

Speaker Information

Michał Ciesielczyk is the Head of AI Engineering at Deep.BI. He is responsible for researching, building and integrating machine learning tools with a variety of technologies including Scala, Python, Flink, Kafka, Spark, and Cassandra. Previously, he worked as an assistant professor at Poznan University of Technology, where he received a Ph.D. in computer science and was a member of a research team working on numerous scientific and R&D projects. He has published more than 15 refereed journal and conference papers in the areas of recommender systems and machine learning.

Sebastian Zontek

Sebastian Zontek is the CEO, CTO and co-founder of Deep.BI, Predictive Customer Data Platform with real-time user scoring. He is an experienced IT systems architect with particular emphasis on the production use of open source systems for big data such as Flink, Cassandra, Hadoop, Spark, Kafka, Druid in BDaaS solutions (Big Data as a Service), SaaS (Software as a Service), and PaaS (Platform as a Service). Previously, he was the CEO and main platform architect at Advertine. The Advertine network matched product ads with user preferences, predicting their purchasing intent using ML and NLP techniques.

Original post published by Hisham Itani on Deep.BI. Get in touch at ai@deep.bi!

Deep.BI

The next generation BI & AI platform

Hisham Itani

Written by

Heading marketing @ datamechanics.co — The simplest way to run Apache Spark through a serverless platform for data scientists and engineers built on Kubernetes.

Deep.BI

Deep.BI

The next generation BI & AI platform

Hisham Itani

Written by

Heading marketing @ datamechanics.co — The simplest way to run Apache Spark through a serverless platform for data scientists and engineers built on Kubernetes.

Deep.BI

Deep.BI

The next generation BI & AI platform

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store