How to make the Elephant sing!

“two brown elephants near body of water” by Archie Fantom on Unsplash

Recently I had an interesting discussion with some Asian Banking thought leaders about how Big-Data projects are losing their steam. Many projects are reported to create little or no value. Does the recent merger of Hadoop distributors, Cloudera and Hortonworks, is also a bellwether of this? Maybe.

But what happened to the prediction that Hadoop was expected to generate strong growth in the future. It was $7.69 billion in 2016 and was forecasted to grow to $87.14 billion in 2022, with a CAGR of 50%.

Something surely changed in 2017–2018. Early adopters of Big-Data Technologies (2013–14) completed their large and expensive implementations by 2017. They built huge data lakes capable of reliably storing data from varied sources. “Schema Later” approach made dumping the data easier but how about making this data available for reporting? That is still expensive, highly skilled and requires traditional approaches like BI tools. Easy “Plug-and-Play” data retrieval options are not feasible at this point.

And what about those scenarios where immediate is also too late? Situations where like a machine, device or website failure needs to be predicted before it happens, storing your logs in a data lake, unable to use it for predictive modeling is worthless.

Thomas Davenport once famously said that “Data is the new Oil”. But is this oil useful if it stays in the refinery and cannot be put to use to run the automobiles and factories?

So how we can make sure that the Big-Data investment can be turned around for generating value?

Easy Data Retrieval

PySpark and other libraries provide the ability to directly read from HDFS using python. Unless you are a Data Scientist with these skills, making data available for analysis can take weeks or months.

There is definitely a market space for a plug and play platform to extract data from HDFS for latent analysis.

Real-Time Analytics

Using stream processing technologies like Amazon Kinesis along with Kinesis Analytics to processes, aggregate and visualize streaming data can help to make immediate actionable intelligence possible.

Typical Real-Time Analytics Architecture

Apache Kafka is also a dominant force in stream processing but companies may prefer Amazon Kinesis to Kafka for productivity issues with open sources technologies.

In the next blog, we will deep-dive into what it takes to build real-time streaming applications