State of the data

Looking at the landscape of data processing pipelines currently around it seems that everything converges towards two extremes — naive time-series datapoint filtering and counting exercises (e.g. IoT stuff, splunk, dashboards) and heavy Big Data monsters meshing Hadoop, HANA, R, F# and liberal amounts of python to arrive at the equivalent of 1GB of executable code required just to sum two integers in a text file together.

First has the problem of data being mostly useless. About the only useful metric I’ve seen being applied in the wild is %Errors, everything else (especially historical trending) still mandates human interpretation and a reaction. Makes for nice colorful graphs though. IoT is interesting (okay, okay, emerging field and everyone is getting to grips with field of the play) as the state of the art starts to touch If->Then scenarios. Ingress wars will take a decade to settle down though.

Second just saddens me. Although possibly useful in unstructured ML data world dealing with actual map/reduce problems, majority of applications I’ve seen are treating time series data as blobs in distributed storage to produce — bloody useless graphs again! And industry players are making (good) money from this! Mind boggling.


“Real-Time” is slowly morphing to mean “Not An Once-Daily Batch Job”. Didn’t think that anyone would associate the term with unpredictable latencies that average in tens of seconds (and that is in a relatively well designed system case), but enterprises seem to be buying it.

Worrying side-effect of real-time label dilution is that people are increasingly assuming that store-and-forward systems like e-mail or mobile messaging (which isn’t just SMS anyomore) are “Real-Time™” an can be relied upon. That IoT house system notification of your dog being on fire? Arrived 3 hours later.


Data-hoarding is a fantastic driver for hardware and services sales (and costs), and nobody seems to question the sanity of it.

I itch to be in a position to ask a reasonably large sample of decision makers how often they have ended up re-interpreting data they have hoarded that’s older than 1 second, an hour, maybe a day or a month.

Data is quite like money. Stationary data is not even worthless, it’s a very real cost.

Crux of the problem — lots of options for acquisition and accumulation, few, custom, and expensive options to arrive at useful actions (not to mention the difficulty of defining what constitutes an useful signal or action).

If a business needs a multi-hour job on 100TB dataset to highlight a trend that still needs an analyst and a business decision to act upon it… something isn’t quite as rosy as promised.