My Little Over-Engineering

Gal Koren
skai engineering blog
5 min readMay 2, 2018

I love to write new code. Therefore, I was very happy to be assigned a new project, called “Channel-Bee”, which had to be designed and coded from scratch (and it is a microservice too!).
This story is about how my urge to code and my new knowledge of Amazon’s cool building blocks (like Kinesis) inspired me to create a great design. So great, in fact, that its complexity could be reduced by half…

The new software is about downloading data from Google AdWords accounts (using the AdWords API) to our Data Lake. This data is later queried by our research group to build a statistics model.

So the challenge is scale

We performed some back of the envelope calculations and got these results:

  • We mange about 60K AdWords accounts
  • In which we have about 50M “interesting” Ad-Groups (the objects to fetch)
  • All of this needs to be fetched every hour
  • It needs to support future growth of 100X

To make life easier:

  • Delta is enough (bring only changed data), which can be detected directly in AdWords
  • It’s OK if downloading an account fails in any specific hour because the data arrives the following hour

Milestone 1 — Small Scale

The research group wants to start experimenting with our data ASAP for a few AdWords accounts. Therefore, we don’t need to support scale right now.
However, in Kenshoo we have an ecosystem called “Microcosm” that creates and sets up new microservices for you within an hour of work.

This includes:

  • Auto CI/CD to Amazon Elastic BeanStalk
  • MySQL
  • Redis

So scaling out by increasing the number of nodes (machines) is already out of the box.
We chose Quartz to distribute tasks to the nodes, and defined a task as downloading a single AdWords account.
And… Voilà! We came up with this:

For now, the research group are happy having their three accounts running once an hour.

Milestone 2 — Large Scale

60K accounts means 60K tasks on Quartz… can Quartz handle the load?
We used our existing system to test that, and the answer was… no.
At about 15K tasks, Quartz becomes amazingly slow.
Actually, this is a known limitation of Quartz (yet, it is more fun seeing it myself). [1] [2]

One immediate solution is grouping the accounts and redefining a task as a Kenshoo customer (including all its AdWords accounts).
We have about 500 customers. This should be a piece of cake for Quartz.
But let’s hold that thought while we consider some more interesting limitations, this time coming from the Big Data group:

  • For the queries to be efficient, the data needs to be partitioned by the customer id column. This means that the CSV files we send to our data lake need to be divided into folders, with one folder for each customer.
  • The group doesn’t want many smalls files in their file system. 500 files an hour is too much… can we make it less than 100?

These limitations would require me to aggregate the data in my system for several hours before I can send it to our data lake.
And to make it even more fun, I can’t use my machine’s local disk because Elastic BeanStalk is really elastic, meaning, machines can be added and removed at any time.
But enough about the problems. Let’s get some solutions.

Distributing tasks between nodes at a high scale can be done with a task queue. I chose Redis + Jesque library for Java. This is the easy part.

Aggregating data requires a storage service. I chose Kinesis because:

  • It’s a stream of data with 24 hours persistency
  • It’s cheap
  • It can scale out by adding shards

I just need another microservice, an “aggregator”, to consume from Kinesis every several hours, group the data by customer, and send it to our data lake.

I was very proud of this design because it theoretically supports any scale we’ll ever need.

Our first rejection came from our SRE team regarding Kinesis.
Unlike Redis or MySQL, Kinesis is an Amazon proprietary product that cannot be installed outside of the AWS cloud. Using it will vendor-lock us to Amazon.
Unfortunately, back then we didn’t have any other streaming solution that is as easy to install, so I struggled to convince SRE to approve it.
In a meeting with team representatives and managers about this issue, a smart man asked me this:

What is the actual traffic?
Gee… that’s embarrassing — I don’t know.
I mean, I know we have 50M AdGroups across our AdWords accounts, but I didn’t measure the traffic after applying all our filters, which are:

  • Only AdGroups that changed in the last hour
  • Only AdGroups with actual performance (clicks) on AdWords
  • Only AdGroups with a certain bidding strategy

We agreed to schedule a new meeting after I came up with the numbers.
So I modified the existing Channel-Bee system to produce metrics by loading all the accounts into it and disabling the module that sends files to our data lake.

And the answer was… 10MB per hour.
So maybe I don’t need Kinesis with its endless scaling capabilities…
And maybe I don’t even need a new dedicated “aggregator” service (one of the existing Channel-Bee nodes can definitely handle this extra task).

The final design had no Kinesis or aggregator service.
Instead, I just added a new queue for each customer on Redis for aggregation, and ended up with one microservice and one Redis server.

Am I happy with the solution?
Well, the little engineer in me still thinks that Kinesis is the real streaming solution and not Redis (“Hey, it’s not by the book!”) .
But as time passes, I realize that the real art in our job is writing less code (less code is cleaner code).
Also, having a single service instead of two makes the end-to-end testing automation much easier. And I have a component test that checks the entire download process.

So, the most valuable lesson for me was get real metrics before making a complicated design… my “back of the envelope” calculations taken from database queries needed to be validated along the way.

--

--