Complete Big Data Solution for Click Stream Events

Setting up a Centralised Data Lake at Deutsche Telekom on AWS

Parteek Singhal
Deutsche Telekom Digital Labs
5 min readJul 13, 2020

--

Photo by Carlos Muza on Unsplash

After the huge success of Deutsche Telekom’s self-care app — OneApp, the data that came along with it was tremendous. A couple of months back we decided to delve into the BigData world to get more out of the data that we were generating and to provide our users with a better Telekom experience.

When you first start to explore the data world you will be overwhelmed to see that there are a lot of approaches you can go for and a lot of tools you can use.

In this post, I will try to help you to understand how to pick the appropriate tools and how to build a fully working data pipeline on the cloud using the AWS stack based on the pipeline we built for the OneApp.

Overview

OneApp generates user click events that are streamed using RabbitMQ over MQTT and stored in ElasticSearch which are available via Kibana.

I know this is a lot of information in one line but don’t worry you can read the article below if you want to go into details

Now, there must be a question that would be coming to your mind, that if there is already a system in place for handling the data, then why is there a requirement to explore another one.

What is the Need for another System?

The previous system provides us with Real-Time Analytics capability and enables the running of our Real-time campaigns. Though this system is ideal for real-time use cases but with the growing data size and hunger for insights, data-driven decision making is not that easy.

Along with the capability of faster Real-time analytics, we needed a system that will enable us to go deeper into our data and provide advance analytics and reporting capabilities

Following are some of the use cases that we had in mind before deciding to go with a Big Data Solution:

  • The capability of storing data over a longer period in a cost and performance effective manner.
  • Aggregating different types of data from different data sources.
  • Executing Complex analytical queries that need multiple joins over huge data.
  • Need for Distributed Processing Engines to scale in accordance with the growing data.
  • Enable Machine Learning use cases
  • General-purpose BI application that should be able to integrate with any data source.

Initial Requirements to be fulfilled by the System

  • Capturing raw data in a central data lake
  • Writing generic ETL to load streamed data from raw storage, clean, transform, and store as processed data.
  • Monitoring and Scheduling of ETLs
  • Distributed SQL engine to enable business owners/data analyst to query this stored data in the data lake.
  • Setting up the dashboarding platform, to query and visualize the same.

Setting up a data-lake for raw/cleaned/processed/aggregate data storage needs and providing the capability to run heavy distributed computational jobs, machine learning jobs are what we are targeting in the longer run.

Architecture

Ingesting Raw Data

Logstash is used here to write raw JSON files to S3. You can configure the size of data files before writing to S3, preferably 128MB as it is the default block size in Hadoop.

Logstash is a light-weight, open-source tool that allows you to collect data from a variety of sources, transform it on the fly, and send it to your desired destination. We have been using logstash already to consume from RabbitMQ and load data to ElasticSearch.

Transforming Raw Data

For processing large scale data you need distributed computation and the clear winner for that is Apache Spark.

We use Spark Batch Jobs to read raw JSON data from S3, then process and clean data. After processing, the data is transformed and written to S3 in the parquet format. Due to the scale of our data and low latency requirements of our analytics, we store data as columns as opposed to rows.

These spark jobs are run over AWS EMR, which is a managed service provided by AWS to run various big data applications without worrying about the setup of big data processing frameworks like Hadoop, hive, spark, and many more.

Why Parquet?

  • Organizing by column allows for better compression, as data is more homogeneous. The space savings are very noticeable at the scale of a Hadoop cluster.
  • I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data. Better compression also reduces the bandwidth required to read the input.
  • As we store data of the same type in each column, we can use encoding better suited to the modern processors’ pipeline by making instruction branching more predictable.
  • Predicate pushdown in Parquet enables Athena queries to fetch only the blocks it needs, improving query performance. When an Athena query obtains specific column values from your data, it uses statistics from data block predicates, such as max/min values, to determine whether to read or skip the block.
  • Splitting of data in Parquet allows Athena to split the reading of data to multiple readers and increase parallelism during its query processing.

Workflow Management using Airflow

  • Airflow is used to schedule and monitor the spark jobs in our data pipeline.
  • Airflow provides us with a better way to build data pipelines by serving as a sort of ‘framework’ for creating pipelines.

Schema Registry

  • Using a central schema registry is very important for data governance in any data related system.
  • We have used the HortonWorks Schema Registry, which is an open-source application. It has been modified to extend new features to meet our needs.
  • It is used by spark jobs to enforce the schema of data while transformation.

Athena — SQL Query Engine

  • Athena is based on PrestoDb, a distributed SQL Query Engine.
  • It has serverless architecture, so you only pay for the queries you run.
  • It enables us to run SQL queries over the data stored in S3 for various analytics needs.

Redash — BI Tool

  • Redash is a general-purpose BI tool that can connect to almost any data source.
  • It helps you build dashboards and reports very easily.
  • You can connect it to Athena for querying data in S3.
Example Visualisation on Redash

Data Warehouse and ML Capability

  • AWS provides Redshift for managing a DataWarehouse. You can write a script to periodically upload data over to Redshift from S3 using Airflow.
  • AWS SageMaker enables data scientists to write, train, test and deploy Machine Learning Models

In upcoming blogs, I will discuss every component in greater detail.

--

--

Parteek Singhal
Deutsche Telekom Digital Labs

Software Engineer at Deutsche Telekom | Budding Data Engineer