Change Data Capture for Adaptive Learning

Anirudh Bhardwaj
SplashLearn Engineering blog
5 min readMar 3, 2021

Authors: Naresh Sherwal, Sunni Kumar, Suraj Singh Rathore, Anirudh Bhardwaj

Overview: This article captures how we use Change Data Capture to improve learning outcome at SplashLearn.

The Learning Context

Have you ever wondered the difference between a good teacher and an online course you register to? A good teacher understands the student and adapts his/her teaching style to ensure the student is engaged as well as challenged. Online courses suffer from the curse of unidirectionality, where the flow of information is only in one direction — from the online medium to the student.

SplashLearn removes the fear of learning by creating a similar feedback loop, where the system automatically tailors the instruction based on how the student engaged with the previous instruction.

Learning Loop

Learning Engine is where the magic happens in the SplashLearn world. However, a critical engineering problem that needs to be solved for this magic to happen is real-time propagation of every action taken by the learner to the Learning Engine. At SplashLearn, every intent of a learner is used to improve the learning outcome and that data flows into multiple services, each implementing a specific aspect of the learning journey. Propagating this data across multiple systems in real-time is what this article will cover.

The Change Data Capture Problem@SplashLearn

At SplashLearn, we have three types of user data — (i) user attributes (ii) user learning actions, and (iii) user sessions. The systems that implement these capabilities use MySQL, PostGreSQL and MongoDB respectively to store the data. Our challenge was to capture the change data (termed as binlog in Mysql, oplog in MongoDB), process it through a near real-time data pipeline & solve the following problems:

  • Pushing all CDC data to Redshift that can be used for aggregation and creating derived attributes (e.g., mastery in a topic or regularity)
  • Creating a feedback loop from this data into the learning engine.

Architecture and Solution

The high level architecture of our solution is described below.

Implementation Choices

The ecosystem around data pipelines is very rich but we narrowed it down to three promising options for a robust implementation.

First Approach — Debezium

Debezium is one of the most popular open source change data capture tools that supports multiple databases.

Benefits

  • Out-of-the box solution for capturing change data.
  • Easy integration with kafka & redshift using pre-built sink connectors.
  • Minimal memory & CPU footprint as it streams binlog/oplog of databases thereby not putting additional load on the application.

Limitations

  • Applying additional logic for modification of data before pushing it to the source is not possible.
  • Limited documentation for Managed DB servers (RDS, Aurora) making it difficult to run in our environment.
  • Debezium takes a table lock for first sync and the lock duration can become significant for large tables.

Second Approach — In-house tool

A white box approach to change data pipelines is based on creating columns to track update time for updates to any row in the database. The application is responsible for ensuring that all updates or inserts come with a timestamp and a custom tool queries the database for all data that has changed based on the timestamp. The custom tool can then query the changed data, filter and process it as needed and send it to one or more data pipelines.

Benefits

  • Capability to build customized ETLs to process & transform before pushing to Redshift.
  • Maintenance and modification of the pipeline is self-controlled.
  • Handling unsupported columns by consumer gracefully.

Limitations

  • Development time can be multiple times higher than using available solutions.
  • Higher load on databases as the system would query the DB instead of reading the binlogs at a pre-defined frequency.
  • A query based approach is very difficult to make near real-time as the query frequency would need to become very high, which would in turn increase the load on the database.

Third Approach — AWS DMS

As we are using AWS for hosting our application, we explored Data Managed Services of AWS.

Benefits

  • Setup can be done in a few minutes with proper monitoring as its managed service by AWS.
  • Good for real time migrations of small databases (<100GB)
  • Costing was low when we started using it but got changed at a later stage.

Limitations

  • We can’t do any transformations on data.
  • Some data types are not supported like json in case of mongo to redshift migrations.
  • In case of schema change we need to resync the whole database as we can’t resync single tables.
  • Sometimes DMS jobs failed due to various reasons and then we need to resync the complete job again.

Solution Overview

We looked at the strengths and weaknesses of all the solutions and tried to find a solution that retains the positives from each solution, while minimizing the drawbacks. We ended up building a hybrid solution to achieve our goals.

Our hybrid solution splits the CDC problem into

  • Data Capture: Debezium has in-built support to read bin logs for all our databases and emit change events in near real time. We used Debezium only for data capture and emitted the change event stream into Kafka.
  • Data Processing: We needed control over processing and transforming the change event stream. Hence, we implemented our own processing layer that reads events from Kafka, processes them, and then forwards the processed data to the Learning Engine as well as Redshift.

The above diagram captures our solution, where debezium pushes the change data via Kafka to our CDC Engine, which then processes and sends the data further to the systems that need it. Our system is currently processing 5000 events/second and the end-to-end latency of data is <100 seconds.

Due to historical reasons, we got the chance to see all three systems in action at the same time. This allowed us to assess their relative strengths and then build a hybrid and robust solution that achieves the right tradeoff between speed of development and performance.

--

--