What is data flow?

James Urquhart
Digital Anatomy
Published in
6 min readOct 18, 2016

This publication, Digital Anatomy, is intended to help me (and anyone interested in following along) dive deeper into a concept I’ve been contemplating since I joined my current employer, SOASTA. Within a couple of weeks of joining that company, it was amazing to me how much the space of application performance management (testing, monitoring, and optimization) has changed since the last time I ran an enterprise development team.

What changed? Two things. One, more data than ever is being collected. The other thing, however, is that more data than ever is being combined — correlated, aggregated and statistically modeled for the purpose of understanding more of how our applications are really performing — in business terms.

The concept of data “moving” through the business — and between businesses — is something that I call (for the lack of a better term) data flow. Data flow isn’t about the data itself as much as it is about how the use of data evolves as stakeholders in that data find new ways to glean value from it.

Why does data “flow”?

Years (oh, so many years) ago, I wrote a post about software fluidity. At the time, one of the biggest topics of discussion in “datacenter automation” — this was actually before I was using the term cloud regularly — was the ability to move VMs between servers, and ultimately data centers. Many in the industry thought this would be huge.

(It wasn’t, but why is an interesting story for another time.)

In defining the term, “software fluidity”, I said:

The term I want to introduce to describe the ability of application components to migrate easily is “fluidity”. The definition of the term fluidity includes “the ability of a substance to flow”, and I don’t think its much of a stretch to apply the term to software deployments. We talk about static and dynamic deployments today, and a fluid software system is simply one that can be moved easily without breaking the functionality of the system.

Even though server fluidity failed to be a critical feature of cloud, software fluidity still has its place. Not so much in terms of moving an entire system without shutting it down, but certainly in terms of moving software components without rebuilding them. Containers are a great example of trends supporting this, as are standard OS components, etc.

Data, however, might turn out to be the most “viscous” of digital artifacts within enterprise digital software systems. We might move applications between clouds occasionally, reuse open source components more frequently, or even deploy code elements to production many times a day. However, data is often constantly flowing; within any digital business, thousands or millions of individual data elements are processed, stored, retrieved, correlated, aggregated, etc, every second.

Or, at least, they should be. But there are significant barriers to successfully enabling data flow within a business. In my opinion, most of this has to do with economics, but there are also social barriers created by organizational communication patterns — aka Conway’s Law.

But to answer the question posed in the heading of this section, data flows because if it is used correctly it creates information, which creates knowledge and insight, which creates competitive advantage.

Flow and Economics

As I noted earlier, perhaps the biggest thing hindering free flow of information within an enterprise has simply been economics. The technologies available to collect, store, transfer, analyze, etc data have traditionally been significantly expensive. ETL (Extract, Transfer and Load) and other data integration technologies were an estimated $2.5B business alone in 2014, growing at 10%, according to Gartner.

Open source software certainly had impact on data flow, but I’d argue it was less about tools than it was about the practices that the open source communities were driving. APIs, for example, made getting access to data sources more intuitive and predictable. Even streaming APIs are seeing growing adoption, which enable realtime access to some data sources.

But I’d say the biggest impact on the economics of data flow so far has been the reduced cost of actually analyzing the data. This started with Hadoop (followed by Apache Spark), which enabled massive processing of large amounts of data at rest, but is now being disrupted again with real time data analysis and processing technologies, many coming from the Internet of Things (IoT) industry.

And, of course, all of this is available today in the cloud, as a service, which greatly reduces the upfront financial risk of experimenting with these technologies.

The other thing that is changing, albeit much more slowly, is the availability of data analysis expertise, visualization artists and those with “data narrative” skills. As more people begin to understand when, why and how to extract value from data, more innovation will happen, and…

Data flow and Jevon’s Paradox

William Jevons (Public Domain, Wikipedia)

Lowering the cost and increasing the availability of a technology (or capability) often has an interesting effect. It actually increases the overall demand for the technology, thus increasing overall value of the technology.

This effect, known as Jevon’s Paradox, is critical in the case of data flow, as I believe the easier and cheaper it becomes to move, integrate and analyze data in new ways, the more common it will be for businesses to explore how different data sets they can access are related to better business outcomes.

Add in machine learning and other AI techniques, and it is possible that data relationships will begin to be explored digitally, and that data flows will be created and evolved independent of explicit human decisions. Think about that for a second…imagine computers routinely finding optimal business knowledge and insight efficiently, with minimal human direction.

In the end, efficient flows will bring a huge explosion of both positive benefits and negative challenges. As I’ve said, I’m excited about the possibilities here, but I’m also concerned that maybe there are some possible emergent behaviors of which we should be cautious.

Why flow will crawl into being

Now, lest you think I’m saying the day of automatically shifting data flows are upon us, let me point out some serious roadblocks in the path to true flow simplicity:

  • Data gravity is still a thing. The cost of moving large amounts of data, both in terms of money and time, is still much too high to simplify anything but the most basic flows. Some are willing to pay that price (such as high speed traders) but there is work to be done here. Interestingly, I think the three big cloud providers (AMZN, MSFT and GOOG) will be the most likely to solve this particular issue.
  • Data formats don’t make things easy. Yeah, you can more easily use tools and services to build connections between data sources, and between data sources and analytics platforms. However, there is a cost when it comes to processing data that is significant. Code must often be written to manipulate values and types, aggregate data into consumable “chunks”, and to map data sets into expected data structures and formats for receiving systems.
  • We have very little understanding of what metrics matter. Some industries have a better understanding than others, but in truth we often use simple to get at metrics as a proxy for the much more complex metrics structure that better represents business conditions. Most analytics today is custom built for each specific use case, although some common approaches are being productized today. There are no really good commoditized analytics sets out there, unless you count cloud operations data (e.g. Amazon CloudWatch) or perhaps government analysis. (PLEASE, prove me wrong here. I’m eager to learn more.)

I think there are other key issues (realtime vs. retroactive data, for instance) that will have to be thought through, but these three are big enough problems to start with that I think the rest are bridges to be crossed later.

Oh, and I have said nothing about security, which (as it is for the IoT world) is a huge problem to be addressed in data flow.

Believe me, the age of enabling true data flow has just begun. There are many, many years to go in this journey.

Let me know what you think about my concept of data flow and the opportunities and issues I’ve raised. Is this something you’ve been thinking about, yourself? Do you see elements of this in your work today? Are there specific topics you’d like me to explore?

Let me know in the Messages section below, or at James Urquhart on Twitter.

--

--