How We Lowered the Cost of Fuzzy Matching Hybrids at Scale by 60% (and made it faster!)

Zac Oler

Published in

Granular Engineering

5 min readOct 9, 2020

Zac Oler (Software Architect) and Jeff Glover (Senior Software Engineer II)

Farmer pouring a bag of Pioneer corn seed into the hopper of a planter.

The Problem

In the Ag Tech industry, it is necessary to store canonical catalogs of the type of crops grown on commercial farms and inputs, the seed, chemical, and fertility products used in farming. These catalogs allow the accurate representation of records, what is planned to occur or what has occurred on the farm. These catalog services have standard listing endpoints and get by ID endpoint for retrieval of crops or hybrids. Additionally, they have a matching endpoint to perform fuzzy matching of the names typed in by the operator of a tractor, harvester, or applicator, also known as machine data.

To build our platform internally, our first pass of these services used a kops based EC2 hosted Kubernetes (K8s) cluster for hosting the API and an Amazon RDS database running PostgreSQL. Furthermore, we built Amazon Kinesis streams using the event sourcing pattern for data to be replicated to other systems that needed it, such as our pricing service. This design got us through the initial launch of our platform on which we released Granular Insights.

After the release, it became clear that our catalog services would not sufficiently scale and they required large relational databases. Additionally, load tests that we ran against these APIs left us with insufficient response times. Under a load of 100 requests per second, we saw our response time for our inputs service climb to over 3 seconds on average for simple get and list endpoints, and greater than 10 seconds for our matching endpoint (and similar response times with our crops service). These were unacceptable longterm, and we hold ourselves to higher standards, so we went back to the whiteboard. As we deliberated about the current state of our catalog services, we decided that we needed to rewrite the catalog services to be more scalable, performant, and ready for our future needs.

Enter Serverless and AWS Hosted Solutions

Even though we had made investments in running our API’s on K8s, we discovered during a recent POC that using Kong API Gateway to invoke Amazon Lambdas was a more cost-effective and scalable way of hosting our new API’s. While it would “cost” time (and thereby money) to redo our catalog services, we knew that this might be the only chance before we launched any new products on our platform (increasing the risk of changing it). Note: we recognized there is no such thing as a free lunch (except on Mondays and Wednesdays in the office, when we could be there before the COVID-19 pandemic!) with Lambda-hosted APIs. Using Lambda, we would have “cold-start” costs that would add additional latency to our APIs. To help mitigate the “cold-start” latency, we used provisioned concurrency. Combined with better choices in our data layer, this architecture would considerably improve the response times of our catalog services.

For the data layer, we used Amazon DynamoDB and Amazon Elastic Search Service. This proved to be a match made in the cloud. We wanted the performance and simplicity of DynamoDB for the core datastore because it provided fast key/value access patterns (common for catalog services). It allowed for a consistent, transactional database from which we could easily create an event sourced stream. We chose Elasticsearch to handle our needs for searching, matching, and querying our catalog data because well, it is in the name, “search.” Elasticsearch, the industry-leading tool for plaintext searching and fuzzy matching, was perfect for the high request-per-second load of fuzzy matching calls from processing machine data.

Architecture diagram pictorially representing the connections described in the article. — Figure 1 — Service Architecture

To build the event sourced stream of the data from DynamoDB, we connected an event source mapping from DynamoDB Streams to our “producer” Lambda. The producer’s job is to translate changes made to rows in DynamoDB to ProtoBuf messages and then publish them onto a Kinesis stream. Next, we hooked up another event source mapping from Kinesis to an “ingest” Lambda (this helped us dogfood our Kinesis event sourced stream). Ingest’s job is to populate our Elastic Search index. Our initial implementation allows us to fuzzy match the names of crops and products via trigram analysis. This index yields similar and comparable (though not identical) results to the trigram index we had used in RDS. However, unlike our RDS implementation, under load, we were able to exceed our desired response times.

The punchline, we CRUSHED it. We averaged under 200ms avg response time across all endpoints with 600 requests per second for our crops service. On the inputs catalog for our get by ID endpoints, which go direct to DynamoDB, our responses are under 100ms. For fuzzy matches, we average about 2,400ms under a load of 300 requests per second.

Service Response Times Plot as describe in article (https://gist.github.com/zwing99/f1cab7306c86af02f0fad528ebe385f9) — Figure 2 — Service Response Times

Success

While the biggest goal of the rewrite was to meet the scaling demands, the ginormous auxiliary win was cost savings. We went from requiring two expensive RDS databases to several inexpensive DynamoDB tables (using pay per requests) and one shared Elastic Search Service for both catalogs. This architecture was “more” infrastructure in terms of individual pieces deployed, but it was right-sized and single-purposed; therefore, it saved us BIG money. All-in-all, it was 60% cheaper than our original approach.

In conclusion, we believe that using serverless technologies and AWS hosted solutions provided us with a superior advantage over self-hosted K8s and Postgres’ trigram indices for our catalog services. Out of the box, Elasticsearch Service offered us scalable, robust, and accurate text searching capabilities for our catalog services. We know that in the future, as the need arises and we launch more products on our platform, we could extend Elasticsearch with features such as language analysis and typeahead completion to make our catalog services even more powerful and useful. Finally, the most significant learning was the knowledge (and proof) that Lambda-hosted APIs, with limited exceptions, should become the new standard and target for our platform services.

How We Lowered the Cost of Fuzzy Matching Hybrids at Scale by 60% (and made it faster!)

The Problem

Enter Serverless and AWS Hosted Solutions

Success

Written by Zac Oler