How to Search for Data We Don’t Have

Published in

CarMax Engineering Blog

7 min readJul 29, 2024

Search Facets

The CarMax website and mobile apps allow customers to search for vehicles based on a variety of data points. These include a few different types of data:

Strings — These are simple fields where a vehicle has a single value (or null). Examples include Model (e.g. “4Runner” or “Civic”) and Car Size (e.g. “Compact” or “Midsize”).
Collections — These fields are also strings, but a single vehicle could have multiple values. Examples include Features (e.g. “Rear Seat Heaters” or “Parking Sensors”) and Body Type (e.g. “Convertibles” or “Sedans”).
Numbers — These fields could be integers or doubles. Like strings, each vehicle has a single value (or null). Examples include Price (e.g. $22,998) and MPG Highway (e.g. 31). For numbers, instead of searching for specific values, customers generally want to search with a maximum and/or minimum value.

Our search is powered by Azure Search. Each of the above items is used in Filters to limit the set of vehicles returned to those that the customer is most interested in. They are also used in Facets. Facets allow us to show a count of documents (vehicles) that match each value. For example, of the 100 cars we are showing you right now, 37 are “Compact”, 52 are “Midsize” and 11 are “Large”. The customer knows that if they choose to add a filter down to Compact cars, they will have 37 results to consider. For number fields, these Facets are able to work on intervals of a consistent size. For example, we use an interval of 1000 for price, so you can get counts of cars with a price of $20,000–20,999, a price of $21,000-$21,999, and so on.

Missing Data

This works well for us most of the time. But it turns out that our data isn’t always perfect. Sometimes we have missing data points. So you might see that, of the 100 cars we are showing you right now, 35 are “Compact”, 49 are “Midsize” and 8 are “Large”. This only adds up to 92 cars, with 8 that have no assigned Size value at all. Once you start filtering by size, there’s no way to see those 8 vehicles. It seems a little strange, but for certain filters, if you check off every possible option in the filter list to include them all, the number of results you see will decrease from where you started. Because those vehicles with null values are no longer returned.

Some fields we know won’t be populated on all cars. For example, Battery Range is applicable to Electric Vehicles and Plug-in Hybrids, but not to Gas and Diesel powered vehicles. Cab Style will only be populated for Pickup Trucks.

For many data points, every car does have a valid value in the real world, but sometimes we don’t know what that value is. This is obviously not ideal. Our first choice is to fill those gaps, and have every data point populated on every vehicle for which it is relevant. We are always working towards improving our data quality and making sure we give the customer all the information they need to make the right purchase for them.

Existing data gaps are a problem for several reasons. When a customer is looking at a specific car, we can’t give them all of the information about that car to help inform their decision. They can visit the store or reach out by phone or chat to get questions answered, but this is not very convenient.

When a customer is using our search page to filter down to the cars they are interested in, some cars that are actually relevant to their search may not show. In our previous example, say a customer filtered their results down to those 35 Compact vehicles. It may be the case that a couple of the 8 vehicles missing values are also Compact, but the user won’t see those.

Finding Vehicles With Missing Data

Some customers only want to see cars that they know have exactly what they want. That’s the way search on CarMax has worked for years. If you search for a car with All Wheel Drive, you don’t want to see any results that are not All Wheel Drive. Other customers are more concerned with making sure they don’t miss cars, especially if they are looking for something uncommon. For these customers, it’s more important to make sure that every single car with All Wheel Drive is returned for their search, even if it means a couple other cars slip in too.

We recently decided that we need to enable both scenarios. We want to leave the default behavior as it always has been — we only return cars that we know are a match. But also enable customers to opt-in to seeing cars where we aren’t sure. We are using the term “unspecified” for this data. And it’s specific to individual properties on a vehicle. A car might have most of its data complete, but have an unspecified value for Drivetrain, for example.

What we want to provide is another option in the filter that tells the user the number of vehicles that have an unspecified value that otherwise match their current search, and a way for them to choose to include those vehicles in search results. Azure Search didn’t make this easy.

Idea Number 1 — Just look for nulls

The first thing we considered was just leaving the data as it currently exists, with null values in the fields we don’t know. This is the most natural way to store things. Unfortunately it makes it hard to query.

When searching for a list of values using something like the search.in function, values must be real strings. It isn’t possible to include ‘null’ as one of the values in the list, so you have to add an extra clause to your search. Our searches are already a complicated mess of clauses ANDed and ORed together in various combinations, so this would only add to the mess.

And Facet counts don’t include null values. To get the count of vehicles with an unspecified value, we’d have to do a separate query for each field. We really didn’t want to do that — performance of the search page is important to us. This performance issue, far more than the complicated query syntax, led us to discard this idea before we even got to implementation.

Idea Number 2 — Special Unspecified Values

The next idea we tried was putting magic values in place of null. This would have actually been a fine solution if we were only dealing with string properties. We could store the string “unspecified”, and show it in the list of values alongside all the others. It doesn’t work as simply for numbers, but it was possible for us to build something functional.

For strings, we’d store “unspecified”; for collections of strings, we’d store [“unspecified”]; and for numbers, we’d store -99999. All of our numerical fields happen to only allow non-negative values, so this -99999 could never be confused for a real value.

In some ways this worked great. It was easy to search for this value alongside other values we were querying for, since it was the same type. It gets included in the Facet counts we get back from search. It provided just what we needed.

But it also required us to be really careful. Every single query we did on either vehicle data or Facets needed to have its results filtered so that these values weren’t shown to customers, or used in any backend processes. We obviously don’t want them showing up in the list of regular Facet values to choose from, or in the data we list about a car. (An MPG Highway of -99999 somehow wouldn’t look quite right.) And we had to be careful when filtering numbers by only a maximum value. If a customer is looking for cars with Mileage < 50,000, we don’t want to always include unspecified results even though -99999 < 50000.

And it’s not just the interface for customers — we have a lot of other support processes that use the same search indexes. We don’t want these values fed into our search typeahead suggestions, caught in text search processing, counted in certain monitoring jobs we run, etc.

We did implement this approach, but ultimately decided it was too risky and the values were going to leak out here and there, leading to unexpected and undesirable behavior.

Idea Number 3 — Track Unspecified Data in a New Field

As often seems to happen, our third idea turned out to be a winner. We ultimately added a new property to vehicles, a collection of strings, that holds the name of each field that is unspecified. So if a vehicle has null values for MpgHighway and Drivetrain, then the UnspecifiedFieldNames field would be [“MpgHighway”, “Drivetrain”]. This takes care of all of the issues we had with previous ideas.

It keeps the Unspecified data management in its own new space, separate from all the existing data, so we are confident it won’t leak through where we don’t want it.
It’s easy to filter for vehicles with Unspecified values for any field, regardless of the data type of that field, by simply looking for the field name in this collection. We can set up the filter the same way, whether the base field was a number, a string, or a collection of strings.
When getting Facet counts, we now always get counts for this field as well. That tells us how many vehicles have unspecified values for each field that may be in the list without any additional queries or work necessary.

This is a recent addition, and the UI to support it is still being built. But we are feeling great about having the functionality in the API in a way that will be easy to sustain in the long term. And this solution isn’t specific to vehicles in any way, it can be applied to any domain where users are filtering and faceting on a set of fields that may not always be populated.