Real time data processing using Google Cloud and OpenTSDB

Dennis Mårtensson
Greta.io
Published in
3 min readJul 11, 2017

This is the first of a number of posts on how Greta.io performs high scale realtime streaming data processing. We will outline the architecture of the system and write a short summary of the different parts.

If you have read any of the previous posts by us, you will know that we are running on the Google Cloud Platform and that we have been very successful with using different offerings from Google to be able to run Greta.io at high scale.

First, some of parts of this were inspired by Server density’s setup. You will see that our system is not the same but has a few parts in common.

Before we start, we would like to thank Daniel Bergqvist from Google for helping out and sharing insights with us! 🙌🏽

Client: The system we are building is collecting data from all the users of sites that are using Greta, building up a large number of data flowing into our system.

HTTP(S) Load Balancing: The users of Greta are literally distributed all around the world, which requires global presence. We achieve this by using the globally distributed http load balancer from Google. This gives us very quick and reliable data ingestion from clients all over the world.

Container Engine: We run all our services in Kubernetes with the help from Container Engine.

Ingestion: We currently have 4 clusters, EU west, US west, US east and Asia east, where we are ingesting data. We constantly evaluate where to add new presence. This is where we consume http-requests and execute some data enhancements before publishing to the queue.

Cloud Pub/Sub: Is used to move and distribute data. This service is hosted and runned by Google. To us the most important part is that it’s globally deployed so we can use it to move data from all the ingestion points around the world back to EU for streaming data processing.

Apache Beam/Cloud Dataflow: Allows for realtime and scalable data processing. We use beam’s functionality for windowing, grouping and a number functionalities which requires global presence. We achieve this by using the globally distributed http load balancer from Google. This gives us the possibility of collecting a number of different output values, fast and easy add new values and new computations on the stream of data.

BigQuery: We write all the data that flows on to Dataflow in to BigQuery which allows us to perform analytics and query the data at scale. We use BigQuery extensively to extract data and optimize data delivery for our customers, and give them insights to very specific parts of their data delivery network.

OpenTSDB: Is a time series database that has a number of functions built in, allowing us to query data very effectively. We can very rapidly iterate on queries and use the multiple aggregations, downsampling, expressions, tags and more to visualize the data, giving insights in how the data distribution is preforming.

Cloud Pub/Sub to OpenTSDB writer: This is a small static writer that subscribes to Cloud Pub/Sub and sends a write operation to openTSDB. The biggest benefit of doing it this, is that it’s enabled us to continue writing to the different data sources and to recover the data as long as it is remains in cloud pub/sub.

BigTable: We run openTSDB with BigTable as the storage backend, which allows for basically linearly scaling in writes and reads. It is also completely managed and performing operations really quickly.

Read API: We run an API that does authenticate the users and allows our front end to query openTSDB for data, that will then be displayed in our dashboard.

All in all, it’s a lot of parts but it makes the data processing very scalable and resilient. We will go in to the different parts of the system in different blog posts, to eventually provide a good in-depth overview of the system.

Greta.io is dedicated to helping users increase site performance via an innovative approach to content delivery. We use machine learning to make content routing decisions in the client based on real-time network data. For every piece of content, Greta automatically decides if the content should be delivered over multi-CDN, peer-to-peer, client cache or an overlay network. As a result, customers experience shorter content load times, higher content quality and server offload during peak traffic.

--

--