Spark Streaming at PS

Intro

In the Pluralsight product and engineering world there are many central principles of organization. Our product teams are cross functional in that they include engineers, a product manager, product designer, and often a data scientist or analyst. This product team (PXT) owns one or more bounded contexts (BCs), where a BC is a narrow domain of the business and encompasses a full tech stack (including its own database, front end, back end, and data).

Team autonomy is another central product principle at Pluralsight. We organize around bounded contexts and microservices such that teams can deploy code without inter-team release dependencies. Relatedly, the fact that BCs are isolated this way makes it such that one team can’t take the entire application down. This autonomy also means that teams choose their technology stack. For example, some use C#, some use Node, some Python. On the data side this applies as well. While we described some of our teams’ ksqlDB efforts here, there are yet other teams working with Spark. Yes, both ksqlDB and Spark are streaming analytics solutions. You might be wondering — doesn’t that make it such that lessons learned aren’t as valuable across the business? Yes, that occasionally is the case, but the autonomy and ability to quickly test and incorporate new technologies is a huge tail-wind to engineers and data scientists.

The Tech Leader

While much of the Pluralsight platform is focused on individual learners, many PXTs focus on the tech leader, who is often a CTO or engineering leader looking to skill up their teams.

My team, the Plan Analytics PXT, focuses primarily on providing reports for these leaders. We provide insights into an organization’s use of the Pluralsight platform so that they can inventory the skills of their engineering teams and identify areas for improvement. These descriptive analytics show tech leaders how Pluralsight is providing value to their company and how their teams are improving. Which leads us to our product data tools.

Why Kafka

Most product data at Pluralsight has historically been transmitted via RabbitMQ. We are currently transitioning from RabbitMQ to Apache Kafka, a highly resilient storage system that stores a log of messages in key/value format. Data is organized into Kafka topics within the cluster and these topics can be subscribed to with different streaming technologies that can consume the data or produce data to a topic for others to consume.

As teams increasingly consume from Kafka, highly effective tools have been developed to replicate any Kafka topic to a Postgres table. My team has a slightly different use case than just using raw data, however. We have found that we need the ability to control how our data is shaped and aggregated before it is ready to be replicated. Some of the core elements of what we need to do are:

  1. Migrate data between plans when a plan merge occurs
  2. Associate any content usage with a plan
  3. Provide accurate analytics for our customers in real time

Most data at Pluralsight does not have a plan already associated with it and is frequently too large to query in its pre-aggregated state. Basic replication doesn’t handle these situations, unfortunately. Enter Spark.

Why Spark

Over the course of a year or so, my team and I have experimented with different streaming technologies; we briefly looked into Kafka Streams and KSQL, but they didn’t seem to support all the features we were looking for (at the time). We even dug into a streaming .NET library, but it wasn’t mature enough (once again, at the time) to suit our needs. We ended up choosing Apache Spark; it had the ability to join multiple streams of data together, it offered both batching and real time processing, stateful operations were attractive, and it came highly recommended from our Data Platform team.

Many of the use cases I find on the interwebs for Spark involve handling different types of events and associating them together within a timeframe, something akin to fraud detection or error rate increases. What we want to do was do something very similar with one caveat: associate a plan with a user that is consuming content and maintain a historical record. This was important as it would enable other teams in similar spaces to re-use the aggregated data and reduce duplicative skilling up on streaming technologies.

Another significant benefit to having aggregated streams of data published into Kafka would be data consistency: if other teams are using these aggregated datasets it would reduce the amount of divergent business logic around how the data should be handled.

Skilling Up

Maybe now is a good time to say this: My team is almost exclusively a .NET team and none of us had any experience in streaming technologies, so this was an entirely new adventure for us. While a lot of the concepts from software engineering move over into streaming technologies, there is a mindset change. We have found over time that these technologies may not be liked by all software engineers; some don’t want to make the mindset change or just aren’t interested in the potential career shift. Streaming technologies are probably best categorized as data engineering and this is an extremely important consideration to make.

That being said, this created a unique opportunity for my team and I to approach the problem with, perhaps, a different mindset than a data engineer might take. We knew we needed to figure out how Test Driven Development (TDD) worked in the Spark world, we needed to have a full CI/CD pipeline, and we needed repeatable patterns so that all team members could participate in the development of Spark jobs. Being in the .NET ecosystem and moving to the Java Virtual Machine (JVM) ecosystem wasn’t necessarily difficult, but it was a change for my team. We started using Scala and SBT from the command line with VS Code and found that using IntelliJ provided much needed value (debugging, syntax highlighting, etc).

Another change was to our CI/CD pipeline for where we would host our Spark jobs. We only seriously looked into Amazon Web Services’s (AWS) big data service called Elastic Map Reduce (EMR). We use EMR currently to host our Spark jobs and it has been great. There were additional learnings around how to manage resources on those EMR clusters so that our jobs were performant, as well. If we were to seriously look into something else that was a little more hands off, we would probably take a harder look at Databricks.

Moving Beyond Hello World

While I wouldn’t say that getting a word count program up and running is particularly difficult in Spark (there are a million word count examples out there [=), I will say that it was challenging to understand some of the more complex concepts and features that Spark provides and how to properly architect data pipelines. We found limited information on, or references to, errors we would encounter, architectural patterns that would work for our problem space, and a lack of understanding how to guarantee data integrity in this new streaming world. Data integrity is critical to what we do and we strive to be consistent and accurate.

If I were to summarize my team’s largest hurdle with confirming data integrity with Spark, it would be understanding how data is aggregated and joined together. I find it helpful to consider these joins just as I would a SQL statement. A join can be made from one stream of data (Kafka topic, file system, etc.) to another (another stream, a database) and Spark offers plenty of features that determine how long that data is available in memory (e.g., watermarks, windows). Given that my team deals with historical records that go back 10+ years, some data sets simply will not fit into memory. To alleviate this, we began by joining incoming content usage (a kafka topic) with a Postgres table containing Plan data. We called this method of joining a stream of data to a database a “stream to cache” join. That has worked okay but came with complications; we found that because we were joining with a database, we couldn’t trigger any changes in our job. This resulted in us having to “restart” the job from the beginning when large events occurred (such as a plan merge).

We took that learning and started looking into some of the built in features that we didn’t fully understand during our first foray into Spark. We found that certain datasets could actually be stored in memory and compacted with stateful operations (e.g., flatMapGroupWithState). Our original concern was how large these datasets would become, and after testing we realized that Spark did a lot of that heavy lifting for us and managed state really well. This allowed us to source our plan data from a Kafka topic instead of using a Postgres table, creating opportunities for “stream to stream” joins and triggering events from both sides of the join. This concept of stream to stream joins allow us to process in real time.

This is only part of the solution. As I mentioned previously, not all data sets fit into memory and there needs to be some process to handle “old” or “late” arriving data. Our proposed solution to this problem is to leverage both “stream to stream” and “stream to cache” joins. This allows us to maintain the goal of having usage be available in real time while maintaining a historical record for these larger data sets. All data happening in real time would be handled by our concept of “stream to stream” joins, and anything that is not happening in real time, or that is considered old/late, we will send to another data pipeline for a “stream to cache” join. This latter pipeline will process at a slower rate but will handle historical records.

This pattern of a slow and fast pipeline working in parallel might sound familiar to anyone who has read about Lambda and Kappa architectures; we are taking the Lambda approach to handle these larger data sets and the Kappa approach where we can with smaller data sets.

Conclusion

We envision that the creation of these data pipelines will empower teams here at Pluralsight to create new experiences that delight our customers consistently, accurately, and with greater speed. The difficulty of aggregating data will be abstracted from those teams so they can focus on our customers’ jobs to be done. With this abstraction, more mature analytics can become the priority.

--

--