Simplilearn tale of decoupled architecture — Solving Learner Operations challenges using Kafka
Simplilearn is the world’s #1 online bootcamp and one of the world’s leading certification training providers. Our focus is on upskilling learners on the key skills in the IT industry. Traditionally, the core learning platform at Simplilearn was broadly divided into the Learning management system (LMS) which rendered the learning content to the user and the learner record store (LRS) that stored the learner progress.
The operational activities that pertained to trainer profile management, training assistants mapping, scheduling of live sessions, assignment of courses and projects to learners were being performed manually. This led to a multitude of challenges as both the number of course offerings as well as the number of learners accessing the systems at Simplilearn increased over a period of time. There was a need for a system (lets call it System X) that could centralise these operational activities seamlessly — and be used by the key stakeholders — the Faculty relation groups, the teaching assistants, the trainers and the learners at Simplilearn.
Key System Challenges
The key challenged faced we as follows:
- Inefficient Manual Activities: Excel based trackers were used to manage the contact details of the stakeholders — trainers, teaching assistants. Scheduling of sessions was being done over emails. All the details of the sessions along with completion details, projects, assessments, NPS etc., were being stored on spreadsheets. Progress updates were tracked manually, and this was all a logistical nightmare.
- Engagement issues: The trainers and assistants lacked real time knowledge of classes and courses to be able to drive up progress. There was an absence of a centralised touch point to get the details of project submissions and evaluations. Learner feedback capture involved reference to multiple data sources, affecting the engagement levels in the process.
- Lack of standardisation and transparency: Process lacked an audit log as well as a record of the performance of the trainer. Invoicing and payments of the trainers were managed over emails.
Hunt for a solution
The challenges above necessitated creation of a centralised application at Simplilearn, System X, that could serve a touch point to fulfil the needs as charted out in figure 1. Speaking architecturally, (1) there clearly was a need for System X to be decoupled from the LMS given that any outage on the System X should not prevent the learner from progressing with their learning goals.
(2) System X had to support high volume of real-time learner metrics without impacting its performance.
(3) System X needed capabilities to gather data real time from upstream systems that consisted of MySQL tables (holding course specific and learner data), Firebase RTDB (holding learner progress record) and MongoDB.
Also, (4) given that learner metrics are vital data points, the System X needed to be resilient to failures without any loss of data.
The platform development team at Simplilearn decided that System X be built on AngularJS for the front-end and NodeJS for the back-end leveraging these powerful frameworks. The design had to manage inherent complexities given the multitude of expected features, view and integrations. For coherence, and the rest of the article focuses on the architectural aspects only — particularly the need for a highly scalable decoupled architecture. In order to track changes to the courses, learner or class objects, we evaluated the options of database polling every few seconds, using DB or application level triggers and using database transaction logs to record changes. We also explored message broker solutions. While technologies like RabbitMQ and AWS SQS seemed to meet some of these requirements, the volume and scalability requirements pushed us towards a real time streaming solution.
Enter Apache Kafka.
Apache Kafka is an open-source distributed event streaming platform. Kafka stores key-value messages that come from various source processes called producers. The incoming data can be partitioned into different “partitions” within different “topics”. Within a partition, messages are ordered by their offsets (the position of a message within a partition), and indexed and stored together with a timestamp. Recipient processes called “consumers” can read messages from these partitions. All of this happens through the use of the core Kafka API’s.
Some of the key benefits that Kafka brings to the table are:
- High throughput with low latency
- Scalability up to trillions of messages per day and petabytes of data (yes, you read that right!)
- Stream data storage in durable, distributed and fault tolerant clusters
- High availability through cluster stretch across regions or availability zones
- Built-in stream processing features — using event-time and exactly-once processing
- Connects to almost anything — supports hundreds of event sources and sinks
Kafka runs on a cluster of one or more servers (called brokers), and the partitions of all topics are distributed across the cluster nodes. Additionally, partitions are replicated to multiple brokers. The cluster nodes configuration and synchronisation is controlled by the Zookeeper service.
The devil is in the detail
Kafka as a service is free, and the only cost incurred is for the compute resources (and the associated block storage) that are provisioned to run the service. However, the set up of the Kafka clusters and Zookeeper the first time, determining the right scale, configuration, periodic maintenance and recovery activities do demand a substantial degree of expertise and committed time from the technology teams. This can sometimes be a put off and steer the decision away from using Kafka as a solution. It also makes the use of other streaming platforms that are fully managed, a very lucrative proposition (read AWS Kinesis and Azure Events Hub for instance) and that will be a discussion for another day.
The role of Managed Kafka services
While the extent of being “managed” varies, there are now providers who provide the option of managing the Kafka service and in the process abstracting all the overheads of cluster setup and maintenance.
Two of the key players in the market are Confluent Cloud and AWS Managed Streaming for Kafka (MSK). We’ll not be delving into comparisons as the scope of this article.
For our solution, Simplilearn opted for a managed Kafka service provided by Aiven. Aiven provides an interactive GUI to provision the Kafka cluster, configure the settings and create the pertinent topics. This avoided the need to manually provision the zookeeper, the nodes and the usual manual activities that follow — enabling us to focus on building the application.
And now the Solution
The System X schema involved MySQL tables pertaining to the courses, learners, users and their relationships with the focus on their utility to the users of System X. The application needed real time data to flow into it from 3 kind of sources — (1) upstream data changes “as is” from MySQL tables, (2) data from MongoDB and Firebase RT DB and lastly (3) “processed and massaged” data real time from the MySQL source databases.
This was achieved by setting up a Kafka Cluster. Relevant topics were set up on this cluster and our System X consumed data from these topics through subscription. The publishers to the cluster broadly included the below:
- MySQL Debezium Connector: This was meant for the incoming data that needed to flow through to System X as it whenever inserted into the source. Debezium is an open source CDC connector that listens to the MySQL bin logs and publishes the changes to Kafka — for processing through the designated consumers. Do check this link out for further details on the connector.
- Rest API calls from MongoDB and Firebase RTDB: Database and application level triggers were configured to make a Rest API call to Kafka to produce data to the relevant cluster topic.
- Rest API calls from the processed data logic: Data from MySQL tables upstream that needs to be processed before it can be sent to the Kafka cluster. The CDC connector cannot be used here, and the Rest API performs the required processing before sending data to the topic.
The below figure 4 illustrates this set up involving multiple data sources and the set up of the Kafka cluster.
Kafka is resilient by design and as a layer on top of that, the developers at Simplilearn also built a robust retry mechanism . Each topic configured on the clusters also had a corresponding retry topic as well as a delinquent queue topic, with scripts automated for the processing of any failed data records.
With the applications nearing development, the build teams at Simplilearn were posed with another challenge — the migration of historical data to this new application. This was a beast by itself, given the scattered nature of the data that needed to be evaluated for the migration, and the multiple rules and use cases that the load process needed to cater to. A script was written for this purpose and tested thoroughly before the set up was launched in production.
Also, in order to accurately monitor this set up, alerts were set up on Datadog that monitored the health of the cluster nodes as well as the consumer lag on the topics. These monitors were set to alert the team to action in case the lag increases or there is a general issue with the cluster nodes.
Outcome and the road ahead
The launch of System X was a major breakthrough with respect to automating manual activities that were performed by the users of the platform. It also brought in a high degree of accuracy in the transactions and sped up operations, with limited involvement of the customer facing teams at Simplilearn. In addition, none of the overhead from this build was passed to the customer given that the architecture was decoupled — and system performance was as good, if not better.
As of today, this architecture has been operating for close to eighteen months without any significant issues or blockers. There have been some operational challenges in the past, and the team has found solutions and standard operating procedures to work around those. There will be a part 2 of this blog that will focus on some of the operational lessons and best practices while working with the managed Kafka set up — including upgrades, pricing and managing the CDC connector.
Meanwhile, if you found this interesting, I would encourage you to read up further on Kafka through the various helpful resources that are available online. Additionally, you could also check out our well-structured Apache Kafka Certification course at Simplilearn which dives deep into the Kafka architecture, integrations and administration.