WHY DEVOPS NEEDS DATAFLOW ANALYTICS

Jut
7 min readAug 25, 2015

--

by Henri Dubois-Ferriere

At Jut, we’ve placed a dataflow language at the center of our platform for analytics and visualization. In this blog post, we want to tell you why we think dataflow is so well suited to data analytics.

While we’ll use Juttle, our dataflow programming language, to illustrate our points, most of them are equally applicable to other dataflow languages and libraries. We just happen to think that Juttle is the sweetest and simplest dataflow language around (no surprise there!), but we’re fans of many other incarnations (such as Riemann, Spark Streaming, or Storm), and think they’re great in their respective domains. The target domains of Juttle are:

  • log analytics
  • metric analytics, and
  • user analytics.

These are the kinds of analytics that matter to “devops” and the other data-centric ops, for example the data-hungry users sometimes referred to as “marketing ops” and “growth ops”. A common thread for all these kinds of analytics is that the interplay between unstructured (or semi-structured) data such as logs and alerts with structured data such as system metrics and application metrics makes analysis of different data types within one framework a requirement for the sanity of all those involved. That’s what dataflow can provide.

Dataflow this, dataflow that?

The term ‘dataflow’ comes in many flavors, with related but different meanings in contexts that include hardware architecture, signal processing, and programming languages. So let’s clarify what we mean by ‘dataflow’. The context we’re interested in here is programming languages, and this definition (from wikipedia) captures it well:

“Dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations”

Simple and general. But also very different from imperative or functional programming paradigms, which (very roughly) express computation as a sequences of steps that modify program state, or as the evaluation of expressions — in neither case does the concept of a data flow graph appear in any way.

A simple example

Let’s look at a devops-inspired example to illustrate dataflow. In this example, we have a live stream of web logs coming in, and we want to compute and visualize two key statistics:

  • the number of errors for each individual user, updated every minute, and
  • the number of errors for each account (each user is part of one account; there are more accounts than users), charted over time.

The Juttle program to do the following (which you can see in action in the Juttle Playground) is:

(
events response_code = 500
| reduce -every :1m: count() by hostname;
source metadata
)
| join hostname
|(
sort first_name
| @barchart;
reduce company_error_count = sum(error) by company
| @timechart
)

In the above Juttle program, we have a number of ‘processors’ that are chained together using the ‘|’ operator. We also split and merge the stream in a couple of places using parentheses. A visual way to represent the dataflow topology is like so:

Dataflow allows you to take metrics, logs, and events and combine them into streaming analytics in a flexible way

In the above, each node represents an operation, and data points flow from the left to the right. A directed graph of data flowing between operations! Now, let’s see why that’s a good thing.

Easy mix of batch and live

With dataflow, an analytics computation is expressed in the same way whether it operates on live or historical data. The runtime behavior is not the same: in one case, the flowgraph is long-lived and incoming data is continuously input into it, and in the other, the relevant historical data (from storage or from a database) is input, processed, and the computation is complete. But that’s all happening under the hood at runtime, and just like you don’t need to worry about distributed operation when writing a dataflow program, you don’t need to worry about live vs batch data. Except for specifying which kind of data you’re interested in of course! Take a look at this example in the Juttle Playground to see a flowgraph that processes both live and historical data.

A match for streaming, time-ordered data

Most data in devops is temporal in nature. Streams of metrics, logs, or events, all live in the time domain. And dataflow is a great match for streams of ordered data, because it abstracts away the time ordering and makes it implicit. Since points stream through the flowgraph in order, there is no need to worry about indexes or sorting of points. For example, let’s say we have a stream of web request logs and want to count the number of requests every 5 seconds. A really simple computation that should be equally simple to express. And in Juttle, it would be written as:

... | reduce -every :5s: count() | ...

whereas in SQL, it might be written as:

select from_unixtime(floor(unix_timestamp(time)/5) * 5),
count(*)
from table1
group by 1
order by 1;

(This isn’t to say that computations expressed in SQL are inherently more complicated than in a dataflow language like Juttle. For the most part, a relational query on non-temporal data would be simpler to express in SQL than in Juttle. The point is: different domains, different abstractions, different languages!)

You might be wondering, what does dataflow do for data that is not temporal in nature, such as asset listings, or customer information stored in a SQL database? Well, we’ve made sure that Juttle can use and maniplate that data too,including individual records (‘points’ in Juttle parlance) that are not timestamped. For example, in the Juttle example above, we’re using a (streaming join) to annotate our primary stream of temporal events with data coming from a set of customer records.

Declarative

Declarative programming allows the user to focus on the “what”, not the “how”. Mapped to the domain ofdataflow, a declarative approach encourages you to think of your data processing as a sequence of transformations on your data. You don’t need to worry about all the mechanics of flow control, pushing vs pulling data, buffering, iterating over individual records, distributed operation, etc — you just need to declare the sequence of transformations that is needed to get to your result. (Of course, that sequence of transformations itself might be complex, if you’re doing complex data processing — and with a declarative dataflow language, you can focus primarily on computation, and not on its mechanics).

Of course, the more declarative the language is, the more removed programs in that language become from the actual execution of the program. And that gap must be filled by… a more advanced compiler and run-time. We’re working hard to make things easy!

Composable

Dataflow is inherently composable: each processing node has the same kind of inputs and outputs, and in the same way as individual processing nodes can be assembled to form a chain (or other form of flowgraph), flowgraphs themselves can be assembled together into larger flowgraphs. And in Juttle, we’ve addedsubgraphs, a language construct that makes it easy to factor out and stitch together flowgraph elements in a modular way. For example, as we’ve noted in our inaugural blog post, a core pattern for Juttle dataflow programs is:

query | analyze | view

Each of the three ‘stages’ above might be a single processor, or it might be a more complex sequence of processing nodes. For example, we might have a module with a number of sources that each extract a different type of customer event from our transaction events, and join them with customer information:

sub get_events(action) {
(
events apptype="transactions" && action=action;
source http://path/to/customer/data.json| join cust_id
}
export sub purchase_events() {
get_events -action "purchase"
}
export sub refund_events() {
get_events -action "refund"
}
export sub cart_add_events() {
get_events -action "add_to_cart"
}

and we might have a simple analytics computation that computes the hourly count of each, followed by a timechart visualization. How do we assemble these? Just by importing the above and using either source ahead of our analytics and visualization pipeline, such as:

import 'sources.juttle' as sources;
sub analyze() {
batch :1h: | reduce count() by user_id | sort count | head 5
}
sources.purchase_events | analyze | @table

Easy. And of course, we could switch out the purchase_events source with another one in the above example, or similarly change the analysis or the view portion of the flowgraph. (If this feels like flowgraph Lego, you might not be surprised to learn that Lego is headquartered in… Jutland!)

Easy distributed operation

Big data, fast data, … it’s pretty much a given that any data analytics system needs to be able to scale out. Even if you start out small, with just a small deployment with a couple of nodes, you’ll likely be growing to multiple nodes over time. Irrespective of the number of nodes in your system, you definitely don’t want to think about the underlying distributed operation when expressing your analytics computation.

And that’s where dataflow is a really nice match, because each processor in a dataflow computation is (conceptually) a standalone element, there are a number of boundaries where the computation can be split, distributed, and even dynamically migrated around in response to changing load. You declare your analytics computation, and the runtime figures out how to make that run.

The good old Unix shell model comes to mind: a fully concurrent solution that can leverage multiple cores/CPUs, with a sophisticated ( under-the-hood implementation), all abstracted away from the shell user writing piped commands. (Full disclosure, while the Jut Data Engine will be clustered at launch, it won’t initially be able to run a single flowgraph in a distributed way and each program will run on a single node of the cluster).

Wrap-up

We think that the dataflow paradigm is a natural fit for doing “* ops” analytics, where the ability to process and manipulate data of multiple types (logs, events, metrics) in a unified way, across both historical (batch) and live domains is invaluable. Devops is the prominent instance of these “* ops”, but marketing and growth are also increasingly asking combined questions across live+historical, event+metric data that can be best expressed with dataflow. Please check out the Juttle Playground to get a taste of Juttle-flavored dataflow for yourself!

--

--

Jut

Operations data hub for streaming analysis of logs, events and metrics.