Why we didn’t use the Google Search API on App Engine to search through 200,000 meals.

Jοachim
Devs @ FOODit
Published in
4 min readApr 13, 2015

--

A new feature we’re working on required us to learn how best to implement fast location and text-based searches. For example, finding a restaurant within walking distance that has tofu on the menu.

Because the number of restaurants in a certain area can be very high (For example, over 21,000 dishes in a 2 mile radius in Central London) and the menu for each extensive, we wanted to find something that would perform well in simple tests.

Options

Based off initial research we have a few options to perform this search;

Using our existing App Engine datastore.

Currently we use Google Datastore for all of our data. Google Cloud Datastore is a fully managed, schemaless database for storing non-relational data.

It is fast, it can grow and scale as we need it to. However it has not been created to be searched.
To be able to search for a particular record will require a number of indexes and it does not have the ability to query using coordinates or an address.
We need something better for searching..

Google BigQuery

We are using BigQuery for data warehousing, and analytics. It has the ability to do location queries, and it can do text search.
The limitation that rules this out is the speed that the results are returned. As part of the design BigQuery will take around 3 seconds to return a resultset, this is too long for a user facing search!

Google App Engine Search API

One of the included products in the app engine family is the Search API. It is a fully managed document store designed for full text and geo location searches.

This could be a viable option, we needed to test its feasibility…

To simulate restaurant menu information I generated 100,000 documents containing made up names, descriptions (thanks baconipsum.com) and geo coordinates. Then loaded these into the search API as separate documents.

Then we searched the records to find all meals within a 5 km range of a set point:

and timed the responses.

After some investigation and tweaks we noticed several things about the search API that would rule it out from:

  • No ability to do partial matches within the text search.
    Therefore if someone is searching for “cheese” we will not return “cheeseburger” because only full text search is possible. (http://stackoverflow.com/a/13171181)
  • The results are inconsistent.
    Especially when limiting results to a distance often we receive back results that do not fit the criteria. The workaround for this is to manually validate and remove invalid results. We also noticed some inconsistency around text searching, looking for “beef” sometimes would not return “beef”.
    (https://code.google.com/p/googleappengine/issues/detail?id=8824)

Lucene on App Engine

Still wanting to remain in the App Engine ecosystem we attempted to run a project that allows Lucene to operate on App Engine (https://code.google.com/p/gaelucene/).

Although this gets around the above issues of text and location searching, we now have datastore as the bottleneck.

The project takes the index files that Lucene creates and stores them as blobs within datastore. Due to the size restrictions within Datastore, the Lucene port is effectively sharding indexes across multiple rows.

When we increased the number of records within the Lucene index the response times increased dramatically (doubling the records double the response times), and the only way to avoid this was to use a large instance size within App Engine.

Instance Size — Average Search Response Time (milliseconds)
F1 = 664.3
F2 = 682.6
F3 = 406
F4 = 302.4

This variation in response times, inability to tune or properly backup Lucene, the age of the supported version (2.3.1 where current version is 5.0.0), along with the general hacky-ness rules this option out.

ElasticSearch on Compute Engine VM Instance

Although throughout all development we have attempted to keep dev-ops to a minimum running our own search application within a virtual machine is the current preferred option.

Using Elasticsearch allows us to perform full text searches with like queries, partial matching and reducing words and phrases to their minimal form. (http://www.elastic.co/guide/en/elasticsearch/guide/master/languages.html)

Because it is a standalone product there are various resources for backing it up, customising it and tuning it as we need.

Running Elasticsearch on a Compute Engine VM Instance means that traffic from App Engine to our search engine is still within the same the google ecosystem. Also a new feature of Compute Engine to allow logging to be visible in the same location as App Engine makes analysis and debugging easier. (https://cloud.google.com/logging/docs/)

Now what?

We are currently using Elasticsearch, to search through 112,818 real restaurants meals.
As we add data, features and requirements to our search APIs we may need to reassess what search platform we are using. Especially around the growth and scaling of Elasticsearch.

But for now the feature set, response times and customisability that Elasticsearch on Compute Engine has outweighs the auto scaling and ease of management that the Google Search API would give us.

Joachim Davies is a Senior Java Developer at FOODit — a young startup with a passion for food and tech, creating software to help independent restaurants grow. FOODit is always on the lookout for talented developers and is currently hiring. Connect with us via LinkedIn and Twitter.

--

--