Introducing Wormhole: Dockerized Presto & Alluxio setups for blazing fast analytics
This blog introduces Wormhole — open source Dockerized solution for deploying Presto & Alluxio clusters for blazing fast analytics on file system(we use S3, GCS, OSS). When it comes to analytics, generally people are hands-on in writing SQL queries and love to analyse data which resides in a warehouse(e.g. MySQL database). But as data grows, these stores start failing and hence arises a need for getting the faster results in same or less time frame. This can be solved by distributed computing and Presto is designed for that. When attached with Alluxio, it works even more faster. That’s what Wormhole is all about.
Here is the high level architecture diagram of solution:
Let us explain each component in the order which they should be setup in wormhole:
- Consul — Consul is a service networking solution to connect and secure services across any runtime platform and public or private cloud. It helps in identification of a container by its containerID to other containers. Setup instructions here.
- Docker — Docker is a set of platform-as-a-service products that use OS-level virtualization to deliver software in packages called containers. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels. In our setup, we have put every service as a Docker container. Setup instructions here.
- Alluxio Master — Alluxio masters should be deployed in High Availability(HA). These are responsible for making query execution plan and distributing the query to workers and then the joining individual result back and sending it back to the requester. Setup instructions here.
- Alluxio Worker — Alluxio workers are the actual cache storage of Alluxio. When data is queried from Alluxio filesystem(FS), it fetches data from underlying FS(maybe S3, GCS or any configured FS) and stores in RAM in LRU fashion. Next time it doesn’t need to go till FS and returns data blocks from its own cache. Setup instructions here.
- Hive metastore — Hive metastore is the collection of metadata of all the hive tables. In our setup, we create metastore on MySQL for storing Alluxio located data. Setup instructions here.
- Presto Coordinator — Presto coordinator should be deployed in High Availability(HA). These are responsible for making jiquery execution plan and distributing the query to workers and then the joining individual result back and sending it back to the requester. Setup instructions here.
- Presto Worker — Presto workers are responsible for crunching the data from underlying Alluxio. These pull the data, perform operations on it and send the individual results to coordinator. Setup instructions here.
Apart from above components, we require a Zookeeper quorum setup which is required for making Alluxio master and Presto coordinator highly available(HA). For complete documentation on setup, please refer here.
Time for some action
Now since we have setup Presto on top of Alluxio, but how to make it available for everyone to use? So the answer can be some other tools like Metabase, which provide connectivity to Presto. Just we needed to add the appropriate configurations and it works like a charm for all sorts of analysis.
Presto and Alluxio also provides UI to track the current state of things and that helps a lot.
Next focus will be mostly on making the solution self-serve through a user interface and make it self scalable(possibility of deployment on Kubernetes).