Building A Nearline System For Scale And Performance: Part 1

Published in

Glassdoor Engineering Blog

7 min readJul 24, 2020

How we designed and built a production-scale close-to-real-time system to improve search relevance at Glassdoor

Note: This post is the first in a two-part series. Part 1 introduces the challenge at hand and provides a high-level overview of the design, while Part 2 fleshes out some of the technical considerations and tradeoffs.

Building production systems can be daunting, often due to the overwhelming number of design choices. In this post, I will discuss Glassdoor’s process for building a nearline system to prevent us from continually showing a user the same job listing. Such a feature could be used in numerous contexts, including desktop job search, email campaigns, and mobile notifications; our goal: to improve overall search relevance. I will examine options we considered at each stage of the design, including the cost and performance considerations of different data storage patterns and data processing technologies. For the sake of demonstration, I will be focusing on how this system can improve our email campaigns, but the same logic can be extended more broadly to delivering relevant content of all kinds to users.

What are we solving for?

At Glassdoor, we want to provide users with the most relevant information to aid in their job search. One way we can do this is through email campaigns. For instance, a user can subscribe to a job alert email for a specific search query; Glassdoor then surfaces relevant jobs to them through a daily or weekly digest email. These emails help users discover new jobs and help employers reach more prospective employees.

We want to surface consistently high-quality job listings that are tailored to each user. At the same time, we want to avoid showing the same job listings over and over. In the past we addressed this problem by only surfacing job listings created within the last day. This simple approach presents two challenges:

Relevant job listings created from more than a day ago are being filtered out
Users can potentially see the same job listings across separate daily email campaigns

Consequently, there existed an opportunity for a system which could improve upon the simple approach above.

The (Offline) Tortoise and the (Nearline) Hare

To launch our design process, we determined at a high level which approach would achieve the results we wanted. We didn’t want to spend time building a complex system if our requirements didn’t call for it. Conversely, we didn’t want to under-engineer a solution, only to learn that what we built wouldn’t achieve our goals.

Starting with the Basics

A straightforward implementation would be an offline system that filters out sent jobs:

Log the jobs that we show for each user during email generation
Schedule a batch process job to read from those logs
Write the data we need to some database

The next time we generated an email for a user, we would just read from that database and filter out all the jobs the user had already seen.

This type of offline processing is common for many data systems because it can handle large amounts of data at once (i.e., high throughput), is straightforward to implement, and is relatively fault-tolerant.

This offline system allows us to increase our lookback window for jobs, since we now have a mechanism for filtering out sent jobs — challenge 1 solved. Before our project began, there was already such an offline system to help filter out repeated jobs for users.

Need for Speed

Why would we need to do additional work if we already had an offline system that functions reasonably well?

The problem is that users often are subscribed to multiple different email campaigns, and may receive more than one email in a day. Our offline pipeline has a relatively high data latency — sometimes as long as a day — since it runs infrequently and is computationally intensive. This means that if recent data hasn’t been fully processed yet and the user receives multiple emails in a day, then we might end up showing the same job listing across these different emails. Hence, challenge 2 from above is not actually addressed.

To remedy this data latency issue, we needed a process that incurs much lower latency. We needed a real-time (online) or close-to-real-time (nearline) system.

All Systems Online (Nearly)

First things first: our system wouldn’t have to be fully realtime. After all, we probably wouldn’t be sending two emails to the same user within seconds of each other. When designing this part of the system, it was useful to know that the latency requirement was less stringent. We didn’t want to spend unnecessary time or system resources trying to optimize the data flow. This is why we ultimately ended up with a nearline system instead of a fully online one.

The nearline system needed to function similarly to the offline system: it should be processing some raw data and writing the needed information in some database table that we could easily query from when we needed to show jobs to that user. But instead of reading from logged data on a periodic basis, the nearline system would need to be continually processing event data, and writing that data to the table whenever there was new event information.

The focus of this project, and this blog series, is on the nearline portion of this system. I will also discuss how I’ve connected these two different parts of the system.

The Power of 2

With a nearline system in place, we would ideally be able to retire the offline pipeline in order to reduce system complexity and avoid redundant data processing. However, one important reason prevented us from doing away with the offline system: determining whether a user has opened an email.

Glassdoor uses open tracking to track the emails users have opened. We base this determination on data we receive from Sendgrid. If an email hasn’t been opened for some period of time, then the user has not actually seen any of the job listings in the email. It therefore doesn’t make much sense to filter those jobs out in future emails. To give users ample time to open the emails, the offline process handles open tracking. Currently, we don’t have any near-realtime support for such information, and therefore cannot tackle it solely with our nearline system.

Consequently, we needed to utilize the power of both approaches. The offline system provides ground truth data for open tracking, whereas the online system allows us to have quasi-realtime for the job listings we have sent a user. If we could use the nearline system as a way to prevent showing the same jobs in the short-term, while leveraging the offline system to prevent showing the same jobs ONLY if they have been seen, then our system would address both of our initial problems.

A Quick Peek at the Solution

The Eagle Eye View

Before I present technical details of our full system design in the second article in this series, I want to share the big picture. Below is the representative architecture of the system I will speak to:

Full architecture for the representative nearline + offline system

The data flow is fairly straightforward. Let’s break down what’s going on:

API Layers — Each of our email campaigns performs a search when we are generating emails for users — just like our web search. Similarly, we also perform searches when recommending jobs to users. This post focuses on the email generation, but it’s useful to note that we can apply this system to all of our different searches.
UI/UX Layers — The part of the display we’re interested in is the job listings a user sees and interacts with, either on our website/app or in their email. We keep track of what the user has seen, in order to avoid showing them the same jobs in the future. As with the API layer, this post focuses just on the email display.
Offline processing — We log, process, and store the data from a user’s interactions and Glassdoor’s sent emails in a data warehouse.
A scheduled batch job (using Airflow) processes the data. We then store the data in a fast queryable format using Cassandra tables. We have four different versions of the table, which we call variants, to store different collections of impressions with possibly different processing logic depending on the use cases. We retain around two weeks’ worth of data in each table.
Nearline processing — An event queue stores data from all the emails that we send. We process this data and store it in a Cassandra table in a similar format to those belonging to the offline process. I will discuss this process in more depth in the next section.
Search Backend — The backend powering all of our search functionality and retrieves all the job listings we want to show a user on the website/app or in an email. Part of this functionality includes filtering out jobs that the user has already seen. We built our search backend using Apache Solr.

Final Destination

When designing systems, it’s often useful to start with the end goal and figure out how to build a system based on those requirements.

For this particular pipeline, the desired end result is the ability to support real-time queries for all the jobs a user has seen. This means that there are certain latency/throughput requirements for this system — namely, querying the final database tables shouldn’t add more than 10–20ms to the search time.

Additionally, we wanted the nearline pipeline to have minimal latency, although we could certainly tolerate a couple of minutes of delay.

With this understanding of the challenge at hand and nearline system we need to build, I can start delving into some of the technical details. In the second post of the series, I will cover design considerations such as the data schema, technologies we used, and the tradeoffs we made.