Real-Time Funneling of Incremental Changes Into S3, Using Spark Streaming and Kafka

Engineering Team
ZipRecruiter Tech
5 min readNov 7, 2022

--

How ZipRecruiter brings live data into an at-rest data lake

This article shares some insight into the challenges the ZipRecruiter Tech Team works on every day. If you’re interested in working on solutions to problems like these, please visit our Careers page to see open roles.

ZipRecruiter’s mission is to actively connect people to their next career opportunity. We serve as a marketplace for hundreds of millions of job seekers and employers. Using advanced machine learning algorithms, we find the best match between jobs and job seekers, saving valuable time and focusing efforts.

One of our most important assets is the data describing open positions offered by our various employer customers. This Jobs dataset alone holds almost 20 billion historical and current job versions, reaching a size of more than 100TB.

Every time new jobs are created or updated they are immediately made available for serving to our users as well as ML models used in matching. However, in order to update ML models and learn from these new jobs, the changes to our data need to be reflected in our data lake, where models are trained in batch. By the end of 2020 we were only updating changes to the data lake in cycles of 24 hours. As a result, our ML matching engine and analytics were operating against a relatively stale snapshot of our data and missed changes that occurred during the day.

This article explains how we went from daily snapshots to frequent updates where data is delayed by no more than an hour. Our goal is to provide our users with the best matches as soon as possible. In hiring, every minute counts.

A problem of freshness, completeness, and costs

Our team’s mission was to implement a data pipeline that would introduce every new or changed job into our data lake with a latency of no more than one hour.

We use DynamoDB, an AWS-hosted NoSQL database service, as our online, random-access, data store and source of truth for all job listings. DynamoDB (DDB) does a great job serving all our online web, mobile, and search applications, but we did not find it suitable for aggregations, analytical reports, batch analysis of large amounts of data, or training or batch applying ML algorithms requiring lots of historical data. This is where the data lake and data processing and analytics engines tend to be the right tools.

Prior to implementing this project, we were unloading a snapshot of our jobs dataset once daily from DDB into AWS S3 and then also loading only the day’s active jobs into AWS Redshift which our analytics team uses for business reporting.

Unloading the data only once a day made it difficult to develop big data and machine learning applications that react to changes within the day, a ‘freshness’ problem. We also missed changes like job activations and inactivations when querying data for analytical purposes, a ‘completeness’ problem.

In addition, because the complete dataset was in DDB, storage costs were very high and data consumers were frequently querying it directly to retrieve historical data, even though that data, by definition, didn’t change. That is not an efficient use of DDB.

Capturing incremental changes in S3

Unloading incremental data from DDB directly to S3 is not technically possible (at the time of writing), so we needed to capture all changes to the Jobs DynamoDB store in a different way.

A good solution for this is to use AWS DynamoDB Streams, a service that creates a message for each change, with some metadata like the type of change and the old and new representations of the data. This is known as Change Data Capture records or CDC.

We streamed these messages into an AWS hosted Kafka service (MSK). We chose Kafka because it easily serves CDC data in the form of a Kafka topic that can be consumed by online services. Kafka also acts as our intermediate short-lived store to be later unloaded into the data lake. In addition, Kafka has significant support within the big data ecosystem, making integrations with streaming and batch-processing applications straightforward.

Architecture and data flow overview

Using best practice frameworks to go from inflight to at-rest data

To achieve our goal of having all new and changed jobs in the data lake within an hour, we had to capture our streaming data from Kafka in short intervals of no more than a few minutes.

To do so, we chose Spark Streaming as our unloading library. Spark (and more specifically Spark on Kubernetes) was already our main technology for Big Data batch applications. Since Spark Streaming easily supports persisting data from Kafka into AWS S3 data lake, adding it to our tech stack was natural.

Another library we chose was delta.io, an open-source version of Delta Lake, which provides some really neat capabilities enabling us to treat S3 as an ACID-like datastore and perform actions like merge/insert/update on top of S3.

From CDC to complete jobs tables

Once we had all the changes in the data lake it was time to do some transformations that turn the CDC events into real job entities.

At this point, all our CDC data was already located in S3, with a Delta Lake table created on top of it. This allowed us to calculate some of the in-row transformations using Spark Streaming and store the result as an intermediate incremental dataset of job versions.

This data was then loaded into 2 final datasets — one storing only the latest representation of a job, and another storing the full history of all job changes.

Using Delta Lake allowed us to deduplicate the data prior to loading it (using the MERGE action), and to update an already existing job with a new representation.

The bottom line

Using a streaming approach for loading jobs now provides our ML algorithms with a full, up-to-date view of the data, including all changes and events in near real-time. We can learn faster and match our job seekers with more accurate and relevant jobs; Increasing their satisfaction and shortening the time to their first day on the job.

By removing older job versions from DDB, and a one-time backfilling of S3 with 5 years worth of historical data, we have significantly reduced our DDB storage costs. We are now migrating our internal batch data consumers to use the S3 data lake instead of DDB, which will eradicate slow and expensive unloads of large datasets.

The planning and execution of this project took 10 months, and the Jobs team is now a fixed part of the organizational structure tasked with facilitating access and usability of jobs data. For ZipRecruiter, it was an experiment in introducing new technologies, which we can now implement in other projects.

This article shares some insight into the challenges the ZipRecruiter Tech Team works on every day. If you’re interested in finding solutions to problems like these, please visit our

Careers page to see open roles.

--

--