How Circles.Life Created A Solution To Analyze An Obscene Amount of Data

Published in

Circles.Life

6 min readOct 18, 2019

Our custom-built solution, Wormhole, crunches stored data.

The story of Circles.Life began in 2016 when we disrupted the telco industry in Singapore, but our journey doesn’t stop there. Beyond our telco offerings, we have gone on to provide personalized digital services — Discover events and Movies are the first few of many more — and have since embarked on growing our presence internationally.
In case you haven’t heard, we’ve expanded outside of Singapore: Circles.Life is now live in Taiwan and Australia!

Introduction

Here is the challenge we face: there is a lot of data to process each month — and this amount is steadily growing. With an increasing customer base and the launch of various ecosystem products that fall outside of our telco realm, our data grows at an alarming rate of 4X every month.

At this scale, ordinary data warehouse solutions fail to process data faster, which meant we had to innovate and come up with an alternative viable solution.

Now enter Wormhole, an open-source Dockerized solution for deploying Presto and Alluxio clusters for blazing fast analytics on filesystems (in our set-up, we use S3, GCS, OSS).

This blog article will go through the process in which we decided on and implemented our solution, Wormhole.

Research and Proof-of-Concept

We began by exploring multiple OLAP solutions and came across two promising candidates: Presto and Ignite.

Upon further exploration and testing, we noticed a few limitations of Ignite: it is difficult to set up, it exists as a monolithic component, and there is no certainty of it being cloud agnostic.

As a result, we decided to go with Presto. Before doing so, we had to produce data to back our choice — and thus we compared Presto with common solutions like Hive and Spark, and collected the results as shown below.

***Query time comparison on various tools for the same dataset***

Perfect! With these results, we could see that Presto was definitely the way forward. As we explored the possibilities of Presto, we also came across Alluxio.

Alluxio, a distributed cache on Filesystem, is meant to work in tandem with Presto. What Alluxio does is accelerate the subsequent fetch time for Presto workers to get data from buckets, as it caches data on the first run. We compared the queried results again and we obtained the telling chart below:

**Presto+Alluxio setup outlays every other engine in querying the same dataset**

Making it work at scale

The proof-of-concept could be considered a success, but we ran into the next problem: deployment of this scattered giant.

It was becoming extremely painful to deploy the various masters for High Availability (HA) and horizontally scaling the worker nodes manually.

We decided to Dockerize all the components which could later go hand-in-hand with Kubernetes. As we were not deploying the beta setup on Kubernetes, we faced the issue of container discovery in the cluster because IPs get replaced by ContainerIDs. Therefore, we utilized the capability of Docker overlay network and used Consul for dynamic discovery.

To help you visualize it better, here is a high-level architecture diagram of our solution:

The Wormhole Architecture

Want to set up your own Wormhole Architecture? We’ll explain each component in the order in which they should be set up.

Consul — Starting with Consul, this is a service networking solution to connect and secure services across any runtime platform and public or private cloud. It helps in identification of a container by its containerID to other containers. You can find setup instructions here.
Docker — Docker is a set of platform-as-a-service products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels. In our setup, we have configured every service as a Docker container. You can find setup instructions here.
Alluxio Master — Alluxio masters should be deployed in High Availability (HA). These are responsible for making a query execution plan, distributing the query to workers,joining the individual results back and then sending it back to the requester. You can find setup instructions here.
Alluxio Worker — Alluxio workers are the actual cache storage of Alluxio. When data is queried from the Alluxio filesystem (FS), it fetches data from underlying FS (could be S3, GCS or any configured FS) and stores it in the RAM in LRU fashion. On the next query, it doesn’t need to go all the way to the FS, but can return data blocks from its own cache. You can find setup instructions here.
Hive metastore — Hive metastore is the collection of metadata of all the hive tables. In our setup, we create the metastore on MySQL for storing Alluxio located data. You can find setup instructions here.
Presto Coordinator — The Presto coordinator should be deployed in High Availability (HA). Similar to the Alluxio Master, these are responsible for making the query execution plan, distributing the query to workers, joining the individual results back and sending it back to the requester. You can find setup instructions here.
Presto Worker — Presto workers are responsible for crunching the data from underlying Alluxio. These pull up the data, perform operations on it and send the individual results back to the coordinator. You can find instructions here.

Apart from the components above, a Zookeeper quorum setup is required for making the Alluxio master and Presto coordinator highly available (HA). For the complete documentation of this setup, please visit this link.

Time for some action

Now that we have set up Presto on top of Alluxio, how do we make it available for everyone to use? Luckily for us, we use Metabase, which has Presto connectivity out of the box. We only needed to add the appropriate configurations and it began working like a charm for all analyses.

Presto and Alluxio also provides a UI to track the real-time status of the implementation, which can be really useful.

Business Impact

Let’s take a step back for a moment and consider this: by scaling up our analytical capabilities, we are able to keep up with (perhaps even stay on top of) the massive data produced every minute.

As Circles.Life is a growing company, equipping our analysts with timely and relevant analyses allows them to not only be able to continue making iterative improvements to existing products, but also to create and test new ideas and solutions to existing problems.

Conclusion

With modern solutions like Presto and Alluxio today, we’re able to harness its power through a customized setup like Wormhole, as demonstrated above.

As excited as we are, this solution is still in the process of being adopted across the entire organization. The reduction in latency of already-migrated queries has enabled faster business insights and more of such use cases have surfaced since then. We believe that through continued innovation in optimization techniques, it will greatly help us in our mission to make personalized services across the world a reality!

We are on a mission to give power back to customers and deliver great personalized digital services.

To succeed in this mission, we are looking for heroes who want to join our journey to solve challenging problems across a complete engineering paradigm. If you fancy yourself part of the Circles.Life family, apply here!