Collecting User Data and Usage
This a blog post I wrote on my company’s tech blog (Rounds). Since then our company was acquired by Kik, and the blog has been somewhat hidden away. Nevertheless I put it here for documentation sake.
I’m Ory, a Backend developer here at Rounds. Today I’d like to present our internal data and event flows.
Knowing what our users are doing with our app is important — What they like, what they don’t, quality of our video calls, etc. Gathering and storing this information however, is quite a task — especially when we have more than one million events reported every minute. At Rounds, we are using two data stores for live monitoring, search, and BI. One is indeed for immediate, live data, and the other for long-term data warehousing and long-term research.
Live Data with Elasticsearch
For live data and monitoring, we’re using Elastic‘s (formerly elasticsearch) ELK stack — Elasticsearch, Logstash, and Kibana. Elasticsearch is a document-based data store, allowing text search with a RESTful interface for insertion and querying. It is based on Lucene, a powerful text search library. We’re using it, along with Kibana, to view live data and seeing what is happening right now with our system all over the world. Kibana is a magnificent front-end UI for Elasticsearch — allowing to view, search, and filter Elasticsearch data in real time straight from your browser. It’s mobile-friendly as well.
In addition, all our instances report logs to Elasticsearch via Logstash and logstash-forwarder (formerly lumberjack) — and they too can be viewed and searched via Kibana.
Big Data with Google BigQuery
For BI and data warehousing, we are using Google BigQuery. BigQuery is a fast SQL-like big data store. It is simple to use and set up, and its queries are asynchronous and fast. Very fast — having to wait no more than a couple of seconds for one to return. Furthermore, data can be inserted in massive amounts either in asynchronous load jobs, or via live streaming. It’s a SaaS, saving you the (mental) cost of having to manage your store yourself.
For us, BigQuery was a perfect match — capable of storing the huge amount of data we receive every second.
However, having all our back end services and mobile clients report directly to Elasticsearch and BigQuery is not applicable, and even if it was — it’s not a good idea. Storing authentication data in mobile clients, and having too many unsupervised inputs report at the same time to a single point of entry can have bad consequences. We needed something to manage all these inputs, multiplexing it into Elasticsearch and BigQuery and controlling which data goes where.
For this purpose we have created a micro-service called Event Collector. This service is capable of receiving tens of thousands of requests per second, and reports them directly to Elasticsearch and BigQuery. It is written Go, because of its high concurrency, extraordinary tool chain, and great syntax. Event Collector is capable of controlling which data goes where, mutating it along the way if necessary. It can also retry insertions on failed insert requests to Elasticsearch and BigQuery.
This gives us the flexibility to change the underlying data stores we use, and even add more — without having to change anything in our reporting services and clients — to them it looks like they are always reporting to the same middleman.
In addition, multiple event collectors can be deployed to multiple data centers around the globe, and are operating independently of one another. This provides us the scalability we need as our data throughput grows.
For inserting to Elasticsearch, we’re using elastigo, an Elasticsearch client library written in Go.
For inserting data into BigQuery, we use streaming — which is BigQuery’s term for live inserts in bulk. This saves the task of having to dump data into a file and upload it using a load job. When the Event Collector receives an event report from one of our services or clients, it queues it up in an internal queue. Once a certain data or time threshold is reached, this data is flushed in bulk into Elasticsearch and/or BigQuery.
Open Source: go-bqstreamer
For the purpose of streaming into BigQuery, we are happy to release our go-bqstreamer Go package. This package provides highly concurrent, fast, and durable streaming into BigQuery. It is production ready and thoroughly tested — We have been using it in production daily for a couple of months now and are very happy with the results.
It provides two types: a
Streamer and a
MultiStreamer. A Streamer is a single worker which reads rows, queues them, and streaming them in bulk into BigQuery once a certain threshold is reached. Thresholds can be either an amount of rows queued, or based on time – inserting once a certain time has passed.
This provides flush control, inserting in set sizes and quickly enough. In addition, the Streamer knows to handle BigQuery server errors (HTTP 500 and the like), and attempts to retry insertions several times on such failures. It also sends errors on an error channel, which can be read an handled.
MultiStreamer operates multiple Streamers concurrently (i.e. workers). It reads rows and distributes them to the Streamers. This allows insertion with a higher insert throughput, where numerous workers are queuing rows and inserting concurrently.
Streamer, errors are reported from each worker and sent to a unified error channel, where you can decide to read and handle them if necessary.
This concludes our data workflow for today. Feel free to contribute to go-bqstreamer — we welcome pull requests!
Until next time,
Ory @ Rounds
Originally published at rounds.com on March 26, 2015.
by the author.