Building University of Indonesia’s Realtime Analytics Pipeline

Favian Hazman
5 min readJun 15, 2020

--

How we designed and implemented University of Indonesia’s big data streaming architecture as part of my bachelor thesis.

Faculty of Computer Science University of Indonesia owns a learning management system named Scele. Moodle based Scele hosts all online learning activities such as course materials, assignment submissions, quizzes, discussion forums, and academic announcements with more than 1500 active users each day. On the offline side, we have Presentronik to record our class attendance by tapping our student card. With the current digitalization of the learning process, the Faculty is seeking for analytical solutions to monitor and make use of student activity data.

Welcome to CSUI-DATASTREAM: a big data streaming architecture to monitor student activity in realtime.

Illustration of CSUI-DATASTREAM big data streaming architecture (raw vector from: vecteezy.com)

Based on Jay Kreps’s concept of Kappa Architecture, the system is all stream. Since building an analytical solution is like building lego and we were in time constraint we defined the main requirements of the architecture. DATASTREAM has to be an adequate base architecture that can be scaled and onboarded with new services without writing too much boilerplate codes or revamping the structure.

Structural Overview

The big data streaming architecture is using Apache Kafka as the distributed messaging queue. The cluster consists of three brokers and one Zookeeper instance. We also use Confluent Schema Registry attached to the cluster to manage Avro schema as the main data serialization system.

More detailed diagram on the architecture

We came up with four layers. Acquisition, speed, serving, and visualization. Each layer consists of modules running in Docker environment as a container. We chose Docker to speed up the implementation process.

Acquisition Layer

Acquisition layer handles ingestion processes from data sources into Kafka cluster. Data from the acquisition layer is produced into each topics in Kafka Cluster serialized into Avro data serialization system. There are two ways to ingest data with the acquisition layer:

  • RDBMS Write Ahead Logs: We use Debezium to capture data changes in real-time using PostgreSQL’s write ahead log. This is used for connecting into Scele LMS and Presentronik.
  • REST-to-Message: We built a platform named vroducer to ingest data into Kafka and serialize it into Avro using REST request. Vroducer is built with Python and Flask.

Speed Layer

Speed layer is where we do real-time data processing using Apache Flink framework . This layer consumes data from Kafka cluster, processes it, and sink into the serving layer. The processing job consists of data source module, real-time aggregation, and/or enrichment module, and data sink module.

To speed up data processing job implementation, we created a framework to abstract the boilerplates of Flink named Khodza (funfact: the name is based on the road name of the house me and my friends rented during our uni days). With Khodza, pre-written modules like source and sink are available and it compiles the jobs into a single JAR to easily deploy into the Flink JobManager.

On the current running data processing jobs, we separate fact and dimensional data. Dimensional data are immediately sinked into an Elasticsearch cluster while the fact data are enriched by the stored dimensional data then sinked into InfluxDB for visualization dashboard.

Relationship between the speed layer’s Flink Cluster and other part of the architecture

Serving Layer

Serving layer and speed layer depends each other. The serving layer consists of two NoSQL data storage and a Kafka sink:

  • Elasticsearch: The Elasticsearch cluster is used for storing dimensional data and non time series data to be used for enrichment process or visualization purposes.
  • InfluxDB: InfluxDB is used as a storage for time series data visualization in Grafana.
  • Kafka Sink: Producing the processed data into a Kafka topic allows further use of the data and integration with other system.

Visualization Layer

Data stored in the serving layer are visualized with the help of Grafana. Basic dashboards were implemented to display student activity on Scele LMS and Presentronik in realtime. Faculty members can log in and access Grafana to see the dashboards.

An example of realtime student activity dashboard on selected student and timeframe
An example of realtime student activity on dashboard on selected course and timeframe

Monitoring and Administration

To do monitoring and administrative tasks, we use open-source tools such as:

  • Yahoo Kafka Manager: Used for administrative tasks on Kafka Cluster
  • Schema Registry UI: Used for administrative tasks on Avro schemas
  • Portainer: Used to manage Docker environment and monitor each containers
  • Docker and system monitoring dashboard: This dashboard template displays systems statistics in total and on every Docker container in Grafana using Prometheus, cadvisor, and node-exporter.

The connection between data source RDBMS and Debezium are often interrupted. This causes the data connector to be crashed and have to be restarted which implicates to no data produced. To ensure data uptime, we also implemented a webhook service to act if Debezium sends an alert. We are also sending the alerts to Telegram.

Webhook service attempting automated recovery on Grafana’s no data alert

My Learnings

Some opinions and learnings from building my first big data architecture, and working on the final thesis itself:

  • The COVID-19 outbreak made learning management systems (LMS) essential for any academic institution. Learning analytics systems can help teachers/lecturers examine student’s online activity to measure the effectiveness of online learning activities.
  • Data comes in various shape and sizes. Always think of corner cases when building data solutions. Don’t let a single record breaks the whole system.
  • Found myself as an aspiring data solution architect!

Recommended Readings

Kleppmann, M. (2016). Making Sense of Stream Processing: the philosophy behind Apache Kafka and Scalable Stream Data Platforms.

Akidau, T., Chernyak, S., & Lax, R. (2018). Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing.

Please reach me out if you have any feedback or anything to share about this project or data environment. The final thesis is supervised by Mr. Denny, PhD. and Mr. Maman Sutarman from Faculty of Computer Science, University of Indonesia.

--

--