Stream Highlights — Data Processing Architecture

Published in

devconnected — DevOps, Sysadmins & Engineering

3 min readDec 1, 2018

Deep diving into Stream Highlights architecture

In the first part, we have seen how the web application was built. The second part will focus on the stream processing backend. The purpose of this block is to connect to the Twitch API, analyze its data, and extract clips.

I — Current Architecture

Our stream processing pipeline is made of different components:

A node service, called the emitter which has an IRC client embedded. It connects to the top forty streams on Twitch. Using event-based methods, all messages sent through the IRC channel are received by our client and sent to a time series database, InfluxDB. The content of the message is not stored, as it would not be relevant in our case, but an entry for the message is inserted.
InfluxDB is composed of two collections: the live_messages and the downsampled_live_messages. A continuous query is downsampling the data from the first collection to the second in order to have a meaningful window for us to work on. The continuous query aggregates live messages on a ten second window.

Kapacitor is a stream processing engine. It listens to the metrics posted on InfluxDB and applies custom scripts to it. It is mainly used as an anomaly detection system, coupled with built-in functions that manipulate time series data. In our case, as described in the first article, we are going to use Kapacitor to join time series database and perform the comparison between the moving average and the derivative. Kapacitor’s script is the following :

Kapacitor is equipped with a built-in alerting system. As soon as our condition is matched, a new alert will be raised to our custom endpoint called a receiver. When receiving a new alert, the receiver ensures that the credentials used are valid (no token expiration) in order to create a clip. If the credentials are valid, the receiver will immediately ask for a new clip to the Twitch API. The clip is then stored into MongoDB.

II — About Scalability

I built my application aiming at monitoring every single stream on twitch. Being currently limited in terms of VMs, I chose to focus on the top forty streams.

Influx is very powerful when it comes to writing points in its databases. I use a standard Javascript API to write down my points. To improve that, we could use the line protocol, which is way faster, coupled with batch processing. Another architectural possibility is to use Influx clusters.

The architecture simply consists of a cluster of node services, sending all the messages received to a central load balancer. Influx is using internal relays in order to process to the different Influx instances in the cluster. The relay is connected to every single Influx instance in the cluster and is ensuring that data consistency is spread among the cluster.

The final product is accessible using this link: Stream Highlights

As a dedicated and passionate engineer, I would really value your feedback on my application; feel free to reach out to me using the following social networks.

SCHKN

GitHub

Stream Highlights — Data Processing Architecture

I — Current Architecture

II — About Scalability

Written by SCHKN