What’s next? Major trends in data

Sushant Rao
Prophecy.io
3 min readApr 7, 2022

--

At my previous company (Cloudera), I got a chance to see first hand the impact data was having on companies. Whether it was better customer retention, reducing the incidence of fraudulent credit card transactions, improved supply chain, or better patient health outcomes, companies in all industries were able to improve their operations with data. Some of the use cases were really amazing. For example, here’s one of data scientists, actually physicists, using data to turn their refrigerated warehouses into giant cold storage batteries so they didn’t need to use power during the afternoons when it is most expensive.

As I was thinking of what’s next, I saw three trends in the data space

Everybody needs data

More and more people need access to data, analytics, and AI/ML to make better, smarter decisions. Sometimes referred to as “data democratization”, it’s the recognition the companies need to allow “non-technical” employees to utilize data in making decisions and doing their jobs. But, the tools in the data space are, quite frankly, awful. Even the best tools require a fair amount of technical expertise to use. While giving regular people access to data is a worthy goal, we are a long way from making this happen.

For example, say an inventory manager at a grocery store chain wants to see if there is a correlation between weather and which items sell more (or less). To do this, they need to get the daily sales by item for each store and then marry it to the daily weather data for those same stores. The tools to enable the inventory manager to do this kind of exploratory data analytics doesn’t exist. The inventory manager will need to go to the analytics team and get them to do this analysis.

Data pipelines are mission critical

There are entire companies, such as Uber and AirBnB, where their core asset is data (yes, slight exaggeration :-). In fact, at many companies, data is the key to running the company. There are data pipelines that ingest, process, and make the data available for the reports, dashboards, and ML models. These data pipelines are just as important as the applications that power a company’s website. Sometimes referred to as Data as Code, this means being able to treat pipelines with the same best practices that were developed from software engineering.

In the above example, once the company has proven the relationship between weather and a store’s sales, they will want to move this pipeline to production. Once it’s in production, the company needs to make sure that any changes to the pipeline don’t break it. But, the tools for developing, testing, deploying, and versioning data pipelines either don’t exist or are difficult to use.

Excellence in operating data pipelines

Once the data pipeline is set up, they need to be monitored end-to-end to ensure the health and state of the data meets the company’s standards. Sometimes known as data quality, the main objective is to ensure the data accurately represents the real-world. Because bad data = bad analytics = bad decisions. Now combine this issue with ever longer and increasingly complex data pipelines, which means figuring out data lineage, especially at the column level, is extremely difficult. In the example of the store sales being impacted by weather, if the source of the weather data has bad data, how does the company know? If the sales for a particular item doesn’t look right, how do they follow all the transformations that the data went through to get to the source? These issues plague data engineers and analysis alike.

So, what’s the answer? I explain in this blog post

--

--