NeurIPS 2021 Announcement: The Billion-Scale Approximate Nearest Neighbor Search Challenge

George Williams
Big-ANN-Benchmarks
Published in
8 min readMay 18, 2021

We are excited to announce that this year’s NeurIPS 2021 Conference will host a first-of-its-kind competition in large scale approximate nearest neighbor search (ANNS). We will be inviting teams from the search community, both from academia and industry, to enter and submit their algorithm. Please see our official call-for-participation.

During the competition, participants will be evaluated across a set of challenging datasets, each containing at least 1 billion vector records. We have chosen benchmarks that will compare algorithms by hardware cost, accuracy, and performance and we’ll be posting the results and rankings on public leaderboards. Finally, at the NeurIPS 2021 Conference in December, we will announce the winning teams.

In this blog, we kick things off by providing some background in large-scale ANNS while touching upon our motivation behind organizing this competition. We’ll also discuss some of the competition logistics, the timeline of events leading up to NeurIPS 2021, and, importantly, how you can participate.

What is Approximate Nearest Neighbor Search?

Approximate Nearest Neighbor Search (ANNS) is a problem of fundamental importance to search, retrieval, and recommendation. In this problem, we are given a dataset P of points along with a pairwise distance function, typically the d-dimensional Euclidean metric or cosine similarity with d ranging from 50 to 1000. The goal is to design a data structure that, given a target k and a query point q, efficiently retrieves the k nearest neighbors of q in the dataset P according to the given distance function.

The challenge for ANNS algorithm designers is to craft a data structure that enables fast retrieval of the k nearest neighbors even as the database size grows. Several contemporary large-scale ANNS algorithms employ a class of data structure called an index which eliminates the need to compute a pair-wise distance between the search query and every record in the database. It can do this at the expense of a loss of accuracy, which is where the “approximate” term in ANNS comes into play. The trade-off between accuracy ( measured typically as recall) and search latency (measured typically as query throughput) is an important consideration when designing and evaluating ANNS performance.

In the past few years, we’ve seen a lot of new research and creative approaches tackling large-scale ANNS, including:

  • Hash-based, partition-based, and graph-based indexing strategies ( as well as hybrid indexing approaches. )
  • Mixing RAM and SSD storage to efficiently store and process large datasets that exceed the size of RAM
  • Running on accelerator hardware such as GPUs, FPGAs, and other custom in-memory silicon.
  • Leveraging machine learning for dimensionality reduction of the original vectors.

While the academic interest has seen an uptick, many of these algorithms have also made their way out of the research lab and into the real-world. Several implementations of large-scale ANNS now appear in production and high availability datacenter contexts- powering enterprise-grade, mission-critical, and web-scale search applications. In these deployment scenarios, benchmarks such as cost, preprocessing time, power consumption become just as important as the recall-vs-latency tradeoff.

The following are just some examples of the applications that leverage large-scale ANNS:

  • Document Ranking Retrieve the most appropriate document for a user query from a large corpus.
  • Question Answering Highlight the sentence or paragraph from a large knowledge store (e.g. Wikipedia) that best answers a question.
  • Entity Linking Match user query to a database of known entities.
  • Ads Retrieval Match user query to a database of ad keywords.
  • Image Match Find the best match to an image from a pool of known images.
  • Text to Image Search Find the images that best match a given query.
  • Map Match Find the best match to a picture from satellite data.
  • Extreme Classification Find appropriate labels from amongst a billion for a datapoint (labels here can be text categories, relevant pages, users, etc.)
  • Link prediction using graph embeddings.
  • Contrastive Sampling Find hard negatives for training models using sparse data.
  • Molecule Similarity Search Find structurally similar molecules to a virtual query molecule in computational drug discovery

Why Do We Need This Competition?

Clearly, large-scale ANNS is a very important real-world problem cutting across application domains. Despite the broad range of algorithms and approaches for ANNS, most empirical evaluations of algorithms have focused on smaller datasets (1 million points). However, deploying recent algorithmic advances in ANNS techniques for search, recommendation and ranking at scale requires support at billion, trillion or larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale.

We believe that this challenge will be impactful in several ways:

  • Provide a comparative understanding of algorithmic ideas and their application at scale.
  • Promotes the development of new techniques for the problem and demonstration of their value.
  • Provide a compilation of datasets to enable future development of algorithms.
  • Introduces of a standard benchmarking approach.

We also believe that by unifying many of the interested parties in this problem (ranging from theory to possible applications), we will foster more cross-collaboration and collectively advance the field at a more rapid pace.

We hope to share what we learn during this competition in a research publication which will be publicly available.

What Datasets Will The Competition Use ?

We will evaluate participants on the following billion-scale datasets:

  • Microsoft-Turing-ANNS is a new dataset being released by the Microsoft Turing team for this competition.
  • Microsoft-SPACEV1B is a new dataset being released by Microsoft Bing for this competition.
  • Facebook-simsearchnet is a new dataset being released by Facebook for this competition.
  • Text-to-Image search from Yandex is a new cross-model dataset, where database and query vectors can potentially have different distributions in a shared representation space.
  • The BIGANN dataset which contains 1 billion SIFT descriptors extracted from a large image dataset.
  • The Yandex DEEP1B dataset consisting of the outputs of a DNN for a billion images on the web.

The datasets, as well as the ground truth nearest neighbors of a query set, will be made available for download prior to the start of the competition. Please note that datasets are subject to change. More details about the competition datasets can be found at our competition website.

How Are Participants’ Algorithms Evaluated?

Participants will submit their ANNS implementation as a docker container (the required format of the docker container will be provided to participants before competition begins.) During the evaluation phase of the competition, we will download the docker container onto a private machine and run the participant’s code on each of the aforementioned datasets. We will support up to 10 different search parameter sets for each dataset.

The competition will leverage datacenter grade machines that will be in complete control of the organizers. The exact hardware specification depends on the competition track. In the T1 track, we will constrain the size of RAM to 64GB. In the T2 track, in addition to 64GB of RAM, the machine will host a 1TB SSD drive.

In the T1 and T2 tracks, we will collect the following benchmarks:

  • search time ( measured as recall vs throughput )
  • index build time

Note that we will exclude participants that are unable to build an index within a certain period of days. This time limit, the exact details on how we collect and compute the benchmarks, how the benchmarks translate into leaderboards rankings, as well as specific machine and operating system details will be made public soon, before the competition begins. The latest updates can be found on the competition website.

What If I Have Custom Hardware?

The T3 track of the competition will support non-standard hardware such as GPUs, AI accelerators, FPGAs, and custom in-memory silicon. In this track, participants will either 1) send their hardware (such as PCIe boards) to us at their expense 2) or participants will give us remote access to their systems.

In the T3 track, we will collect the following benchmarks:

  • search time ( measured as recall vs throughput )
  • index build time
  • cost and power ( measured as throughput/watt and MSRP/watt )

We will ask the participant to provide details and proof of their hardware’s cost in the form of manufactured suggested retail price (MSRP).

As in the T1/T2 track, we will exclude participants that are unable to build an index within a certain period of days. This time limit, the exact details on how we collect and compute the benchmarks, and how the benchmarks translate into leaderboards rankings will be made public on our competition website before competition begins.

What’s The Competition Timeline?

We roughly aim for the following timeline:

  • May: release of data, guidelines, and a call for participation. The competition registration web site goes on-line.
  • Mid-June: Baseline results, testing infrastructure and final ranking metrics released.
  • June 30th: Participants in need of compute resources to submit an expression of interest.
  • Mid-July: Allocation of compute resources.
  • July 30th: Final deadline for participants to submit an expression of interest through CMT.
  • October 22nd: End of competition period. Teams to release of code in a containerized form, and complete a pull request to the eval framework with code to run the algorithms.
  • October 29th: Participants submit a brief report outlining their algorithm and results.
  • Mid-November: Release of preliminary results on standardized machines. Review of code by organizers and participants. Participants can raise concerns about the evaluation.
  • Early December: Final results published, and competition results archived (the competition will go on if interest continues).
  • During NeurIPS: Organizers will provide an overview of the competition and the results. Organizers will also request the best entries (including leaderboard toppers, or promising new approaches) to present an overview for further discussion.

For the latest updates on the timeline, please visit the competition website.

Who’s Running This Competition?

The following people have generously donated their expertise, time and resources to organize and run this competition:

  • Harsha Vardhan Simhadri, Microsoft Research India
  • George Williams, GSI Technology
  • Martin Aumüller, IT University of Copenhagen
  • Artem Babenko, Yandex
  • Dmitry Baranchuk, Yandex
  • Qi Chen, Microsoft Research Asia
  • Matthijs Douze, Facebook AI Research
  • Ravishankar Krishnaswamy, Microsoft Research India and IIT Madras
  • Gopal Srinivasa, Microsoft Research India
  • Suhas Jayaram Subramanya, Carnegie Mellon University
  • Jingdong Wang, Microsoft Research Asia

How Do I Enter The Competition?

Registration is now open! Please visit the official call-for-participation for the details.

What’s Next?

We will be blogging through-out the competition on various topics, including:

  • Registration instructions and more details regarding competition logistics and rules.
  • Detailed specification of the evaluation machine hardware and operating system.
  • Further details on how we collect and compute the competition benchmarks.
  • Descriptions of various kinds of ANNS algorithms and how they work.
  • Updates on leaderboard activity.
  • Other late-breaking news and announcements.

So please stay tuned here or get the latest updates at the official competition web-site.

--

--

George Williams
Big-ANN-Benchmarks

ML and Computer Vision at Smile Identity. Previously at GSIT, Sophos, Apple, Motiv, NYU.