Design a proximity server like NearBy or Yelp Part — 1

Salil Arora
The Startup

--

What are Proximity Servers?

Proximity servers are applications like NearBy or Yelp which are helpful in discovering attractions like restaurants, theaters, temples, or recreational places that are adjacent to your location.

Functional Requirements

  1. Search functionality is the core of a proximity server. Users should be able to search all the proximate places within a given radius for any location (latitude and longitude).
  2. Users can add new places or edit the old ones with the basic details like images, brief description, etc.
  3. Users can give ratings(1 to 5 stars) and reviews(text and images) to places.

Non-Functional Requirements

  1. The system should be highly available with real-time search experience and minimum latency.
  2. We can expect this application to be read-heavy as there will be a myriad of search requests compared to the frequent addition of new places.

Scale Estimation

We can estimate the total number of places in the system to be 200 Million with 100K requests per second. Considering the future scale for 5 years with 20% growth per year we should build our system for at least a scale of 5 years.

  • Our system should be ready to manage the mammoth scale of 400 Million.
  • Our system should be able to handle the load of 200K requests per second.

Basic Database Schema

APIs Design:

1. Search API:

GET /searchRequest: query_param = {
search_terms: "Pizza",
radius: 10, (in kilometres)
category_filter: "Italian Restaurants", (optional field)
filter: "5 start rated restaurants", (optional field)
maximun_no_of_results: 20 (optional field)
}
Response: A JSON containing information about a list of popular places matching the search query. Each result entry will have the place name, address, category, rating, and image.

2. Add places:

POST /placesRequest: body = {
place_name: "Julia's Kitchen",
latitude: 85,
longitude: 10,
category: "Restaurant",
description: "Best Italian restaurant",
photos: ["url1", "url2", "url3"] (Array of photos)
}
Response: Response of the newly created place with place_id

3. Add reviews:

POST /reviewsRequest: body = {
user_id: "uid121",
place_id: "place1",
rating: 4, (integer between 1 to 5)
review: "Restaurant was good and clean",
photos: ["url1", "url2", "url3"] (Array of photos)
}
Response: Response of the newly created review with it's id.

System Design

Before moving to the detailed component design, we need to decide where are we going to store the data. Let’s see what are different ways to store this data and also find out which method will suit best for our use cases:

1. Simple SQL based storage

Places, reviews, User details can be stored in SQL DB easily and latitude and longitude can be indexed for search optimization.

Basic Search Flow:
To find all the places nearby for a given location with latitude and longitude (100, 85) within a radius 10Km, the query will be :

Select * from Places where latitude between 100–10 and 100+10 and longitude between 85–10 and 85+10);

Considering our scale 400 Million places, this query will not be efficient for such a mammoth load as we have two separate indexes and each index can return a huge list of places and performing an intersection on these two lists will be slow.

If we can reduce the size of these lists then the query performance could increase. Another approach to solving this problem more efficiently is by using the concept of 2D grids.

2. Grids

We can divide the entire world map into small squares. Considering the area of the earth is 500 Million square km, and having a fixed search radius is 10km. We will have 500M/10 = 50 Million squares with a fixed grid size of 10km. With fixing the grid size to the query radius, we will only need to search within the grid and it’s neighboring 8 grids.

Every place with location (latitude and longitude) will belong to a specific grid. Each grid will have a unique id which can be indexed and stored in the places table.

Source: https://www.earthdatascience.org/courses/use-data-open-source-python/intro-vector-data-python/spatial-data-vector-shapefiles/geographic-vs-projected-coordinate-reference-systems-python/

Basic Search Flow:
As our grids are statically defined with search radius being equivalent to the grid size, therefore, we can find the grid id for any location and it’s neighboring 8 grids. To find all the places nearby for a given location with latitude and longitude(100, 85) within a radius of 10Km.

  1. Get the main grid id and it’s 8 neighboring grid ids :
Left:         ([x1-10,x2-10], [y1,y2])
Right: ([x1+10,x2+10], [y1,y2])
Top: ([x1,x2], [y1-10,y2-10])
Bottom: ([x1,x2], [y1+10,y2+10])
Top-Left: ([x1-10,x2-10], [y1-10,y2-10])
Top-Right: ([x1+10,x2+10], [y1-10,y2-10])
Bottom-Left: ([x1-10,x2-10], [y1+10,y2+10])
Bottom-Right: ([x1+10,x2+10], [y1+10,y2+10])

Execute the following query:

Select * from Places where Latitude between 100–10 and 100+10 and Longitude between 85–10 and 85+10 and GridID in (mainGridID, Left, Right,Top,Bottom, Top-Left, Top-Right, Bottom-Left, Bottom-Right)

This will definitely reduce and improve the overall runtime of the query as the scope of the search query execution got reduced to just 9 grids.

Data Caching
We can make it faster by storing the grid’s number and the list of its places in memory. As discussed above, we will be having 50 Million grids and we can assume that each grid id will be of 6 bytes and as with the mammoth scale of 400 Million places, we can assume the place_id to be of 8 bytes. Therefore the total memory required to cache grid ids and place ids would be 4 GB.

(50M * 6) + (400M * 8) ~= 4GB

Problems with this approach

  1. This approach will execute slow for popular places like Dam Square(Amsterdam) or Sentosa (Singapore) and this can cause an imbalance in the grids where some grids will be densely populated and others will be sparsely populated with places like coastal regions or islands.
  2. If at all we can dynamically adjust the grid sizes by keeping a threshold of a maximum number of places in a grid but this approach will not be a cakewalk and will furthermore add more implementation complexities.

What’s Next?

A surfeit of doubts may definitely arise after seeing the challenges with the grid-based design approach such as:

  1. How can we solve the problem of grid imbalance?
  2. What should be the final system design?
  3. How we can solve the extremely critical non-functional requirements of high availability and fault tolerance?

Don’t worry, all these questions will be answered in the coming part.

Coming up in this series

There are many more topics we need to discuss in detail. In the next few pieces, I’ll write in detail about the following topics:

  • Design a proximity service like Nearby or Yelp Part — 2.
  • Design Notification System.
  • Design a movie ticketing system like BookMyShow or TicketMaster.

Thanks for reading.

--

--