Autocomplete with Crabs 🦀

Published in

comsystoreply

18 min readDec 22, 2020

Dear reader,

The times are dire! It’s cold, it’s dark, Corona keeps knocking at our doors and there is not a single Christkindlmarkt in sight. What better way to warm up one’s heart and soul during such times, than to do an exciting little side project and learn a thing or two about crabs in the process.

Not so long ago, a group of public transportation enthusiasts within our company Comsysto founded a focus team called “Public Transport”. Together, we spent a lot time working on ticketing and routing systems for our customers, and realized that all those systems had one thing in common: the need for a source and destination search. As the lazy software engineers that we are, we immediately asked ourselves the question: can’t we just extract and reuse that component? Yes we can! And to make it even more fun, let’s do it in Rust!

Note: In case you just want to play around with the end result before reading through the whole article, you can check it out here. 😉

Rule #1 for starting a side project: set some goals for the project, so you don’t get sidetracked or lost in details. So here we go:

Goals:

Build a working (and reasonably smart!) autocomplete engine specifically for Public Transport nodes in Germany
Learn Rust 🦀

But first things first, let’s define some of the terms for our use-case:

What is an autocomplete engine?

An especially interesting aspect of the source/destination search is the autocomplete feature. Autocomplete is so widespread and common nowadays, that we don’t event realize anymore that we are using that feature on a daily basis. Every key press we are performing on a search engine, map app or even within a chat app is immediately sent to the provider, analyzed and transformed into a (hopefully) helpful suggestion on what to enter next. With the term “autocomplete engine” I’m referring to the back-end application that is actually receiving and processing such queries. In order to be useful for the user, the response must be swift. Ideally, there shouldn’t even be a noticeable delay for the user between key press and seeing the suggestions returned by the autocomplete engine, but in reality this is hard to achieve. Besides the inevitable network latencies (around 10 to 300ms depending if the destination is close, or maybe on the other side of the planet), the actual processing time of the autocomplete engine is crucial. It should be built in a way, so that regardless of the content or size of the query, the engine’s processing time remains steady, which is essential for offering a consistently good user experience.

What are public transport nodes?

A public transport node is any scheduled stop position of a public transport vehicle (subway, bus, train, tram, ferry, cable car etc.), including additional meta information about this stop position, such as id, name, city, district, district code etc. There is an open source data set called ZHV, which contains all transport nodes of Germany. In the next section we will have a closer look at that data set.

Did I forget something? Oh yes, the 🦀

The crab actual has a name and it is called “Ferris”, the unofficial mascot of the Rust programming language . I will use Ferris throughout this blogpost to mark sections that are Rust specific in contrast to sections that are about the general concepts of an autocomplete engine. If you are not familiar with Rust, some parts of the code snippets will most likely leave you confused (e.g. life time annotations like &’a, which are a concept unique to Rust). I will try however to select code snippets that can also be read in a “pseudo-code” fashion, so you still can get a general understanding of the concepts, even when you don’t understand the Rust specific parts in full detail. In case you might acquire a taste for Rust while reading this blog post and you want to learn more, I can fully recommend this introductory tour for your first steps. You can also checkout Ferris’s personal website in case you want to see more of him.

Building blocks of an autocomplete engine

In this section we will have a closer look at the most important “moving parts” that our autocomplete engine is composed of. Note that there are a lot of different ways to implement such an application and the architecture and algorithms presented here are by no means intended as a paramount example. In hindsight some of the technical decision made during the implementation turned out to work quite well, and others still need some improvement. You can see the following as a collection of ideas and approaches that were already tried out, so you don’t have to.

The data

ZHV data set

If we want to offer smart autocomplete suggestions for public transportation nodes in Germany, then we need a suitable data set first of all. Luckily for us there is the non-profit organization DELFI e.V. , which is the result of a cooperation between the German government and public transport companies from all federal states. The goal of DELFI e.V. is to create consistent travel information, routing and ticketing services across federal state borders and, with a view to the future, also across European country borders. You can think of DELFI e.V. as a large data sink, where local public transportation companies from all over Germany ingest their data. This can be e.g. timetable information, pricing information, real-time traffic information and also information about public transport nodes. One of the data sets provided by DELFI e.V. is called “Zentrales-Haltestellen-Verzeichnis” (ZHV), which means “central station directory” and contains exactly the data we are interested in.

Example item from the ZHV-dataset:

The data set not only contains name and position, but also additional meta information like the city or DHID, which is a special unique identifier for a public transport node. The DHID itself also encodes some additional meta information, which we will use later on to calculate some interesting metrics for public transport nodes. All in all, the data set contains more than 730.000 public transport nodes.

2. OSM data set

Now that we are good to go with the ZHV data set that contains all public transport nodes, the question is if that is already enough for our purpose? The clear answer to that is: No! When we are looking at a state-of-the-art example for a spacial autocomplete service, like Google Maps e.g., we automatically expect to find street addresses and special point-of-interests (POIs) like shops, restaurants, sights etc. as well. Same goes for public transport specific searches of course, people not only want to go from station A to station B, but also get routing information from their current position to their grandmothers place e.g. This means we need more data! Luckily for us, there is one place that we can always go when we are in need of map data: OpenStreetMaps!

Beginning with the complete OpenStreetMaps data set of Germany , we use the opensource command line tools osmfilter and osmconvert to first extract the desired data (streets and POIs) and then convert it into a CSV file. The resulting file has a size of about 220 MB and still contains large amounts of duplicates, which is due to street segmentation within OpenStreetMaps and our tag selection during the filter step. After an additional cleaning step the final data set contains around 620.000 entries. Note that the final selection does neither contain every single street in Germany, nor every single POI, but it is a large enough sample for testing purposes.

Example items from the OSM data set:

Both data sets combined contain around 1.35 Mio. data points. Inside the application, both data sets are stored separately, so we can index, query and rank them individually, depending on the received query, which offers more flexibility. Some of the metrics and filters we will be using on the data sets later on, are specific to one of the data sets and cannot be used on the other. For reasons of simplicity, I will just use the term “public transport nodes” in the following and want to refer to any possible data point from either ZHV or OSM data set.

How it works

This sections is about the heart of the engine, the algorithm that transforms the user’s query (a sequence of chars), into a list of autocomplete suggestions, containing the most relevant public transport nodes. While playing around with different approaches and data structures to make the data sets searchable in an efficient way, it became clear very soon that the solution wouldn’t consist of a single metric alone. To achieve acceptable results, it would be necessary to apply different combinations of metrics to the data sets. After some back and forth we ended up with the following architecture for the autocomplete engine, which gives a lot of flexibility for calibrating and “rewiring” the individual parts.

The diagram below illustrates what happens inside the autocomplete engine when a user sends a query containing the letters “Ba”. On a very abstract level, the data flow through the individual components is the following:

The “Ba” query arrives at the autocomplete engine
Depending on the query content (e.g. query length, special keywords contained in the query), the engine selects a Strategy, which defines how to process the query further. A Strategy consists mainly of two building blocks: a Matcher and one or more Rankers.
The query hits the Matcher first, where it is transformed into a list of suggestions. A suggestion for the query “Ba” could be “Barbarastrasse” e.g. since it begins with “Ba”.
The list of possible suggestions now hits the Rankers. Job of a Ranker is to decide how well a suggestion fits the given query. This “how well” is indicated by a number between 0 and 1, where 1 means perfect fit and 0 means no fit at all.
From the ranked list of suggestions, the top 10 best fitting suggestions are returned to the user.

Data flow of a user’s query through the autocomplete engine

Here are some more detailed definitions of the terms and concepts, which in combination make up the autocomplete engine:

Matcher: a Matcher is a data set that can be queried for a list of suggestions (= data points from the data set), given a user query. Which suggestions from the data set are “matching” a query is up to the Matcher, the current implementation supports “starts with” and “contains” semantics. In order to return suitable suggestions as fast as possible, the indices use special data structures that are optimized for this purpose. E.g. to find out quickly if a query string is contained in one of the 730k ZHV items, we use a suffix array data structure provided by the awesome suffix crate.

Ranker: a Ranker takes the output of a Matcher (a vector of suggestions) and assigns a floating-point number from the range [0,1] to each suggestion, which indicates how well a suggestion fits the query. A Ranker can rank a suggestion based on any of the available attributes of this particular suggestion. This means that two different Rankers can have a very different opinion on the same suggestion. E.g. for Ranker-A a good fit might be a suggestion that is spatially close to the user, whereas for Ranker-B it could mean that the query contains the keyword “Flughafen”.

Selector: a Selector consists of a Matcher and one or more Rankers. The vector of suggestions that the Matcher outputs is given to every single Ranker of the Selector. Every Ranker computes a ranking value for every suggestion it receives. This means a single suggestion ends up with n ranking values if there are n Rankers defined for the Selector. The final ranking value for this suggestion is then simply computed as the weighted average over all n ranking values. After all ranking values are computed, the Selector sorts the suggestions and returns the highest ranked results.

Strategy: application of one or more Selectors to a user query, where the outputs of the selectors are merged into a single vector of ranked suggestions in the end. Depending on the query string (e.g. how many characters?) and if there is a current GPS position for the user available, a different Strategy can be chosen to answer the query. The application’s algorithm that transform a query string into a list of suggestions is therefore implemented as a “Strategy”.

Here is an example of a simple Selector configuration 🦀

Matcher::ZhvLandmark means that this particular Selector uses a Landmark-Index (see next section for details) computed over the ZHV dataset as Matcher. This also means that this particular Selector will only ever return elements from the ZHV data set. As you can see there are 4 different Rankers defined for this Selector. These are also no regular Rankers, but WeightedRankers, which means the ranking value that each of these WeightedRankers produces, does not equally contribute to the final ranking value. In this configuration for example, the rating given by the PrefixPattern Ranker is considered more valuable and decisive, compared to the rating given by the SpecialKeywords Ranker. Let’s have a closer look at one of the Rankers to better understand how it works.

Here is the implementation of the Ranker called “Distance” that is used in the Selector above: 🦀

This Ranker only uses the user’s GPS-position (position: Option<Location> parameter of the function) and the GPS-position of the suggestion (item.latitude and item.longitude) to calculate the distance between the two. The closer they are together, the higher the ranking value that this Ranker assigns to the particular suggestion. In case there is no user GPS-position available (the None branch of the pattern matching), the Ranker returns the value 0.5, which represents a “neutral” ranking value.

Any Ranker must implement the “Rankable” trait, which is defined as 🦀

To wrap everything up in the end, a Strategy is simply defined by the following trait, which any struct that wants to be a Selector must implement: 🦀

The required apply() method takes three arguments:

the query string that was submitted by the user,
an optional GPS-position of the user, which some of the Rankers can use to compute their ranking value,
the application state containing all data indices.

It returns a sorted vector of tuples containing a SaftItem (= internal representation of a user suggestion) and a floating-point number representing the ranking value. This vector of suggestion can be more less directly mapped into JSON and returned to the user.

As the data flow diagram from above shows, a user query can invoke different strategies within the autocomplete engine. Strategies can differ in the following points:

Matcher: which data set(s) should be used to answer the query?
Rankers: which Rankers should be used to decide which suggestions are most relevant?
Weight of the Rankers: how decisive should individual Rankers be for that decision?

This leaves us with an adaptable and flexible way of generating autocomplete suggestions, while maintaining a simple and straightforward program flow.

Landmark Indexing

In this section we will have a look at a concrete Strategy called “Landmark Indexing” that is used every time the user’s query string is really short. The name is somehow borrowed from a routing strategy called “Landmark routing”, because the idea behind is a bit similar. But before we go into depth here, let’s describe the problem first that “Landmark Indexing” tries to solve by comparing two different user queries, a very short query and a longer one:

“B”
“Oskar-von-Miller-”

Executed on the OSM and ZHV data set, query (1) has more than 100k matches, which means that the names of more than 100k public transport nodes start with a capital “B”. The matching itself is not a problem in this case, the used data structures (mostly the suffix array that was mentioned already in the last section) also provide excellent performance, when there is a large number of matches. The expensive part is the ranking however. To rank each of the 100k public transport nodes by multiple Rankers and also to sort them afterwards can become very expensive depending on the number of items. Although a single ranking operation might be cheap (e.g. check if a special keyword is contained in the query string), in total it takes time to process all matches. Running a full index search for single letter queries like this turned out to be quite inefficient and let to processing times far beyond 1s for some characters. Query (2) on the other hand only has around 100 matches across both data sets, which leads to lightning fast processing times of around 2ms. To sum this up: the longer the query, the cheaper it is to process it!

So what can we do to avoid such long running full index searches on very short queries? We can try to treat very short queries differently. Looking at query (1) again, the single letter “B” alone does not contain much information about what the user is looking for. Gathering every single public transport node starting with the letter “B” and then trying to rank all of them does not create much value for the user. Instead a smarter approach is to keep a separate smaller index of public transport nodes that are “more likely” to be selected by users and then just serve very short queries through this index. This not only reduces the processing times drastically, but also increases the probability that a user receives helpful suggestions upon his first key press. The next problem is then however: how can we know which public transport nodes are “more likely” to be selected? We don’t have any data on the frequency of public transport searches, but as a workaround, we used a metric based on the ZHV data set instead. Using the DHID of a ZHV item, it is possible to compute a size metric for public transport nodes. The hierarchical structure of the DHID makes it possible to count how many “sub-nodes” a public transport node has. Each platform of a large train station would be considered a “sub-node” for example. The logic behind it is the following: the more sub-nodes a public transport node has, the more transport connections are being served through this node, which means more people will be interested in this node. Our working hypothesis is therefore: on very short user queries it is most helpful for the user, if we just return the largest public transport nodes that we can find in the user’s proximity. These “largest transport nodes” are what we call Landmarks.

We haven’t talked about the proximity part though! Since we want to build an autocomplete engine that makes good use of the user’s GPS position, it makes sense to include it in this Landmark Indexing approach. Instead of building a single index, we chose the 200 biggest German cities and build an index for each by taking the 10k biggest public transportation nodes within a 30km radius around each city. If a user now submits a very short query together with his current GPS-position, we select the Landmark index that is closest and only consider the public transport nodes contained in this index for this query. By applying this strategy, we can achieve processing times as fast as 20ms, depending on the GPS-location and query string.

The following image is a visualization of the Landmark Index distribution across Germany. As you can see, there is a lot of overlapping, especially near Cologne, which means we are storing duplicates in this area. This approach could very well use some fine-tuning to be more memory efficient, but to demonstrate the concept, this coarse partitioning using 200 indices should be sufficient:

To summarize the idea of Landmark Indexing: for very short queries that are not distinctive enough yet, serve the user with some common Landmarks that match the query. As soon as the user entered enough characters for a more precise search, take all available data into account.

Performance

Let’s have a look at the overall performance of the autocomplete engine by doing some benchmark tests. Since it makes a big difference if a query is served using the Landmark Index or not, we also conducted two different benchmark tests to distinguish between both cases:

Queries of length 3 or shorter, which are always handled through a Landmark Index
Queries of length 4 or longer, which are never handled by a Landmark index

The Strategy configuration of the engine that was used for the benchmarks makes sure that all queries with more than 3 characters always trigger a full index search. Given this configuration, we took the “worst case” query length from both groups, which is query length 1 for the group (1) and query length 4 for the group (2). This means the most expensive query in terms of execution time for the Landmark Index is a single character query, because this results in the largest possible number of matching suggestions. The same holds true for queries of length 4. There are no shorter queries that trigger a full Index search, therefore length 4 queries will generate the largest possible result sets. Note that both benchmarks use randomly generated alphabetical characters in their queries. Also note that the “cut-off point” for Landmark processing was set to 3 characters, because for queries of length 4 or longer, the full index search always completes fast enough for our data sets.

The benchmarks are executed using the awesome criterion crate, which not only provides the wrapper code to conduct performance evaluations, but also outputs very nice reports with all the charts and statistics that you need.

The following two graphs are taken directly from the criterion reports and show the runtime probability distributions for both benchmarks:

On the left: random single character query evaluation, On the right: random 4-character query evaluation

Note: the benchmarks were executed on my local developer machine with an Intel® Core™ i7–8665U CPU @ 1.90GHz × 8 processor.

As the graphs show. the single character queries are still a lot more costly compared to the 4 character longer queries, although they are handled through a Landmark Index. The performance of both query types is somewhat acceptable however, the user will hardly notice any visible delay with an average execution time below 30ms. As long as there are no outliers!

One problem that the Landmark Index approach causes however, and that can lead to massive outliers potentially, is not reflected in the benchmarks. What happens if a user searches for “Y” and the target Landmark Index contains no public transport nodes that start with the character “Y”, but the complete data set does? In that case the user would receive an empty list of suggestions, although there would have been matching suggestions. This problem is currently solved by running a full index search in case the Landmark Index search didn’t retrieve any suggestions. This however can still be very inefficient in some cases and lead to unacceptable processing times. Maybe a better solution here would be to introduce some kind of caching for very expensive queries.

Conclusion

Having no prior knowledge about autocomplete engines before starting this side project, it was really fun to research and assemble the bits and pieces to make it work. Getting a glimpse of what is going on behind the scenes, makes me appreciate event more how well this feature actually works when we use it in our day-to-day life. Take one of the usual map service providers for example: a key press inside their search box does not only search for suggestions inside Germany, but worldwide, and still manages to offer good response times for the user.

The initial goal for the side project is also reached, we have a working prototype and I also managed to pick up some basic Rust skills on the way 🦀 . Still there is a long Todo list of optimizations and improvements, that I couldn’t implement for lack of time. Some major items on this Todo list that I would like to keep working on in the future:

parallelization of query processing: I already put some effort into parallelizing the initialization steps of the application startup to accelerate the feedback cycle during development, but I haven’t optimized the actual query processing yet. By using the rayon crate on some of the ranking operations, it should already be possible to achieve a noticeable performance increase.
more Rankers: the set of Rankers that is currently used is still rather limited and relies heavily on distance and public transport node size calculations. To make the autocomplete engine smarter, we can add more Rankers that focus on different attributes. One example: combining the ZHV data set with timetable data, we could calculate a “centrality” metric, that tells us how many public transport lines a single public transport node serves.
weight calibration: in its current state the autocomplete engine already consists of a lot weights and switches internally that need to be configured. Doing this manually by trial and error and coming up with appropriate test cases and heuristics is quite cumbersome and not effective at all. An interesting alternative approach would be to automate this through the use of machine learning models, maybe in terms of a Bayesian Optimization. The problem however: where to get suitable test data from?
user feedback: the best way to improve the system is still asking the users of the system, right?. In our case, it would be pretty easy to gather valuable user feedback by just saving the public transport nodes that a user selected in the end. And wait a minute! That data could then be used as test set for our automated weight optimizations 🙂

If you made it this far you must really be interested in autocomplete engines! The topic is indeed very exciting and there are a lot parts that I didn’t write about in this short blog post. Especially a technical deep dive into the data structures that we used for the indexing are another interesting topic worth writing about.

You can also try out the autocomplete engine yourself by using our Saft Engine front-end. To simulate a user’s current GPS-location, you can right-click on the map. The position is then indicated by a blue marker. Have fun playing around with it!

Any questions, suggestions and/or constructive criticism is always welcome! This autocomplete engine is also not the only public transport solution that we are currently building at Comsysto, so check out our public transport landing page for more!

Thanks for reading and happy coding over the holidays!

This blog post is published by Comsysto Reply GmbH