3 Powerful Features of ZipRecruiter’s Search

Engineering Team
ZipRecruiter Tech
Published in
11 min readOct 2, 2023

The intricacies of autosuggest, how even low yield queries continue the journey, and considerations when ranking the results page.

1. Autosuggest search with popularity sorting

1.1 Why autosuggest?

1.2 Ranking autosuggest results by popularity in a given timeframe

1.3 What you suggest affects your brand — filter it

2. Ensuring results for job-scarce locations or positions

2.1 The Related Searches feature and predictive model

2.2 Measuring the impact of Related Searches

3. Ranking search results with GBDT and predicted CTR

3.1 Step 1 — Decision tree scoring

3.2 Step 2 — Predicted click-through-rates (CTRs)

3.3 Ranking results is the single most influential driver of engagement

4. Zip’s balance between quick testing and robust pipelines

On ZipRecruiter’s job marketplace, the search bar is a popular starting point for job seeker journeys. It enables effortless interaction with our massive database of jobs, salaries and career insights, yet its complexity is often taken for granted.

Because we have millions of job listings currently active across the US, helping each person cut to the chase and find the right employment opportunity as quickly and hassle free as possible, is our core value proposition.

The Search Development team constantly improves the search experience and we’re excited to share 3 features that have contributed to making our product so successful. These methodologies can be applied to any search tool in almost any domain.

1. Autosuggest search with popularity sorting

1.1 Why autosuggest?

Perhaps the most common feature of any search these days is autosuggest. A user starts typing and 10 relevant suggestions pop up based on the letters thus far input. It may seem like a given, but let’s recap the ‘why’ for a moment.

The rationale is simple — anticipate what the user wants, help with standard phrasing, and provide examples of an expanded range of possibilities. In this way, we save a job seeker’s time while educating and adding value.

As it turns out, many users are in exploration mode — not exactly sure what they want or how to describe it. So, a few tips along the way are very useful. When a user types in ‘data’ they’ll quickly see options for ‘data entry’, ‘data analyst’, ‘data scientist’, ‘data engineer’, and more. If they’re not sure what a data scientist is or how much they’d make, the career pages will help them learn more.

Additionally, our research shows that autosuggest promotes a job seeker’s confidence in the system. When a user sees their intended search as part of the suggestion list, they know they’re on the right track, and by choosing from a system-suggested term we avoid embarrassing outcomes with no results.

To gain more insight, we launched two autosuggest variants with the same suggestions but in a different order. We found that people tended to click on whatever was higher on the list. There seems to be an intuitive assumption that whatever’s at the top is in some way ‘better’.

People tend to click on whatever is higher on the autosuggest list

The cherry on top is that some users who choose from the suggestion list may add even more words, thereby increasing the accuracy of their search and the likelihood to end up applying for a job.

Image 1 — ZipRecruiter’s autosuggest tool with highlighting, popularity ranking, and multiple intent types.

1.2 Ranking autosuggest results by popularity in a given timeframe

For the upgraded version of our search bar, we used an out-of-the-box open-source tool called OpenSearch which text-matches the input string to an existing search index.

When developing the new search bar, we took advantage of an OpenSearch feature which allows weighting the suggested results for sorting purposes. In our case, we chose to sort suggestions based on how popular they were amongst user searches in the last 2 weeks.

We sort suggestions based on the last 2 weeks’ popularity

The logic behind this is that popular suggestions are more likely to be useful for the most people. For example “work from home” is a more popular search query than “wordpress developer” or “water polo instructor”. So, given the letter “w”, we prioritize the first option in order to serve the most people.

We chose 2 weeks as the ideal window to capture behavior and create the search index after some hyper-parameter-tuning. Analyzing our data revealed that 14 days is the shortest period to catch all head and torso queries as well as a good fraction of the trailing tail. Head queries are searched all the time (e.g. ‘entry level’ or ‘full time’), torso queries are searched very often (e.g. ‘nurse practitioner’ or ‘warehouse driver’) and tail queries are quite unique (e.g. ‘remote senior python engineer’). A shorter time period would mean missing out on tail queries and for a longer time period we wouldn’t be adding much value but just lengthening the training run time (which is typically 30–60 minutes).

In addition, we expanded the search index from only job titles to more fields, like companies, skills, job type (e.g. full time, part time, or remote), workplaces (e.g. hospital, movie theater), and industries. Notice some of these examples in Image 1.

As far as the UI goes, we made sure to follow autosuggest best practices like highlighting differences, keeping the list short, emphasizing the cursor location, and avoiding scrollbars.

1.3 What you suggest affects your brand — filter it

The results of our initial pilot made it clear that what we suggest affects not only the user experience but also the way our brand is perceived, and even how we influence the job market. Filtering out terms you don’t want suggested to a user, even though they may be ‘frequent’, is crucial for providing a professional world-class experience. Examples would be curses or legally questionable jobs.

These four filters clean up our historical query database prior to autosuggest index building:

Image 2 — Data pipeline and filtering for autosuggest
  1. Infrequent queries — the core of our autosuggest is popularity, so the first thing we do is filter out queries with less than k=5 unique users in the given period.
  2. Un-tagged queries — Overall, we avoid suggesting queries for which we know it will be hard to produce very relevant results. One type of tough-to-serve queries are the ‘entity-less’ ones — those in which we can’t discern a clear job title, skill, company or other entity. For example, although the intention in a search for ‘good pay’ is clear to a human, the query doesn’t contain a string that fits into any of our entity categories. Our Query Tagger algorithm (which you can read about in an upcoming post) reviews the queries from the last n weeks (n=2 in our case), filtering out those in which we can’t identify any entity. This is a data quality check to balance between popular queries and our ability to serve them. Of course, a user will still be able to type ‘good pay’ voluntarily, and the results page will be based on text matching alone. Ideally, we could serve a list prioritized based on salaries, for example, but dealing with unique (yet surprisingly frequent) queries is an entirely separate project.
  3. Inappropriate or disallowed words — We maintain an internal taxonomy of common words or phrases that are recognizable in our field. Queries with these words are obviously allowed and augmented with a periodically updated list of additional “valid” and “invalid” words curated during data exploration. For instance, an additional allowed phrase was “ventilating” as in “heating ventilation and air conditioning technician” which we realized is indeed a frequently searched and valid term title that simply hadn’t made it into the company’s taxonomy yet, for whatever reason. Explicitly disallowed autosuggestions include phrases referring to illegal or unethical work which if ever appeared in real job postings, we avoid actively suggesting.
  4. Transposed phrases and similarities — In order to keep all the 10 suggestions that pop up (or down) distinct and useful, we identify transposed phrases and highly similar ones, and show the only most popular version.
Image 3 — Popularity ranked autosuggest lifts CTR by 6% versus no autosuggest.

Thanks to popularity sorting, smart filtering and adding multiple entity types to our autosuggest, we achieved a 6% lift in autosuggest engagement over the out-of-the box autosuggest system.

Looking forward, we are considering incorporating the CTR of actual job descriptions related to the specific input query already in the autosuggest ranking. Meaning, suggesting in the query input stage based on how likely people are to click on a link to a job post on the search results page.

2. Ensuring results for job-scarce locations or positions

2.1 The Related Searches feature and predictive model

Some specific searches or locations don’t produce many results. For example, a “Proctor” position in “Redding, CA” may only have a handful of job listings or none at all. For us at ZipRecruiter, finding opportunities for everyone, especially those in job scarce areas, is a must. If people are willing to be flexible, Search must be too.

We created the ‘Related Searches’ feature which appears on sparse results pages and offers new search suggestions that are guaranteed to produce at least 5 results.

Image 4 — Related Searches feature on a low yield results page.

The Related Searches are produced by one of the following two algorithms:

  1. Order aware algorithm — A machine learning model takes into account previous users’ sequence of behavior in the time domain to predict what a job seeker might want to search for next, given their current search query. The vast scale of historical search data that we’ve collected over the years is key to achieving robust predictions.
  2. Order un-aware algorithm — A separate algorithm produces potential queries semantically similar to the low yield query. Any company with a search history database of any size can offer this.

The resulting static lookup tables hold all related positions to each of the most frequent ‘low result’ queries. ‘Low result’ queries with less than 5 active job listings are removed and the remaining ‘related results’ are ordered by descending popularity (or frequency).

Interestingly, thus far our A/B testing has shown no noticeable difference in performance between these two models. Therefore, a given user will be served Related Search results from one of the models at random.

2.2 Measuring the impact of Related Searches

When testing the impact of the Related Searches feature we measured the likelihood of a user to click on a job during their search session, in three use cases. These use cases differ in the amount of jobs displayed around the Related Searches box.

Image 5 — Impact of the Related Search model on click likelihood for three use cases

As you can see, although the usefulness of this feature is clear for zero to five jobs, above six it had no significant impact on user experience. The reason for this is that if a user’s journey is going well, i.e. they’ve been served many relevant jobs, they probably won’t be starting a new search.

We are aware, however, that our UI design choice may have impacted case #3. During testing we chose to insert the Related Searches box after the 5th result even on pages with many more results. Perhaps users just kept scrolling and if we had put the Related Searches at the end, it would have drawn more engagement.

3. Ranking search results with GBDT and predicted CTR

For any typical query, we fetch up to 1,000 jobs listings. The order in which they are presented influences which listing the user will engage with. Ideally, we want to serve the most fitting and likely successful results first.

3.1 Step 1 — Decision tree scoring

Behind the scenes, a gradient boosted decision tree (GBDT) machine learning model takes into account different parameters related to the job listing and the user’s query, to come up with a score within a fixed range, say [-5,5]. Some examples of the parameters are: salary offered, employment type (part-time, full-time, remote), distance from job, text match of job, and semantic match of job. The score signifies the ‘relevance’ of each position to the query intent and the higher the score the higher a listing should be in the ranking.

In essence, we have a series of decision trees with yes/no questions. For a question like ‘Is the salary of this job greater than 80k?’ the model continues down a different decision branch depending on the answer. The next question may be ’Is this job less than 5 miles from the user’s location?’ and so it goes. Each node, or question, adds a value and the outcome is the score.

3.2 Step 2 — Predicted click-through-rates (CTRs)

Although the previous step may seem enough, it is only one side of the coin. Besides the relevance of a listing to the user’s query, business purposes require us to calculate the likelihood a user will click on any listing (i.e. CTR) after they see the results, and preemptively incorporate that likelihood into the ranking.

It turns out that a direct correlation between the [-5,5] range to CTR can’t be made. People searching ‘registered nurse’, ‘truck driver’ or ‘work from home’ could each result in the exact same GBDT score for a given job listing, but not have the same CTR after they saw the results. Usually, vague queries like ‘full time’ have lower CTRs than specific ones.

Image 6 — different search queries have very different baseline click-through-rates

To solve this, we use the known CTRs for millions of historical queries to set a baseline for the specific query at hand. This combined with the GBDT ranking and business considerations gives the ultimate search result ranking.

3.3 Ranking results is the single most influential driver of engagement

When we tested the impact of a serving search results ranked by a ML model versus a rudimentary text scoring method, we saw an astounding 7% lift in the ‘Impression Set CTR’ metric. This means the likelihood that after a user searches for something, they’ll click on at least one of the search results.

This was a huge success and one of the biggest engagement lifts we’ve seen from a single feature.

4. Zip’s balance between quick testing and robust pipelines

All of these features rely either on a ML model, a trending query and behavior dataset, or the current active job dataset, all of which require refreshing in some periodic cadence.

At Zip, we strive to build PoCs fast and test them quickly! Once the value and impact of a feature is clear, we then set out to plan a more robust training pipeline and cadence for refreshing our models and datasets.

Obviously, this has to be balanced with other priorities. So, to decide when the time is right to start building a recurring training pipeline, we always have a primary metric. As long as that metric is exhibiting lift relative to the original baseline, we’re good as is.

If you’re interested in working on solutions like these, visit our Careers page to see open roles.

About the Author

Ritvik Kharkar has been a Data scientist on ZipRecruiter’s Search team for over 2 years where he builds ML models and tools that improve how jobs are served to users. He loves digging into new challenging projects every day, and is passionate about producing data science videos on his YouTube channel.

--

--