Putting the AI cart before the data horse — “DataOps” as part of a new enterprise stack
There has been an explosion of interesting enterprise AI applications over the past 3 years, with much of the hype well deserved. Enterprise executives are taking note of the compelling applications of AI to tackle anything from process automation, to churn prediction, to next-best-action recommendations for service reps.
While I empathize with executives’ interests in pursuing the promise of a competitive-advantage-in-a-black-box, it’s easy to miss a less obvious enabler consistent throughout the companies releasing the newsworthy innovations. In short, Google, Baidu, Microsoft, and kin can pursue these big AI initiatives because they have the clean, unified data to feed them.
Many other companies risk putting the “AI cart before the data horse” (a great phrase I can’t claim to have coined myself) because their data is inherently messy and siloed. It’s also important to note that the common disarray of enterprise data is more an organic outcome of a growing, multi-product business than an outcome of mismanagement.
Why is it worth addressing now?
Big data has been a hot topic for nearly 30 years now, so what has changed in the past 5 years? A few key drivers:
- In a word, the data variety (e.g. # of sources, types of data, schemas used, etc.) has expanded beyond the scope of what a reasonably resourced team could possibly wrangle. For example, the number of data sources needed for a single line-of-business CRM can extend into the hundreds or thousands depending on the industry, compared to 1–10 sources just a few years ago.
- The torrent of AI/ML news showcasing the tangible business value created will help arm CIOs and CDOs to command larger budgets and broader influence across the enterprise. Gartner for one estimates a ~20% yoy increase in CDO budgets. Overall, this yields a TAM of $4.1B, growing fast across two subsegments: (1) Data integration: $2.7Bn, 6.7% CAGR (2) Data prep: $1.4B to 4.9B (2017–2022)
- The data volume and criticality is set to triple over the coming 5 years — per some interesting research from Seagate (source):
The solution — “DataOps”
Inspired by the core thesis behind Tamr (where I’m currently working for the summer), “DataOps” represents a new enterprise stack meant to represent all aspects of an agile enterprise data pipeline. The cap on this stack is Business Intelligence 4.0, the predictive analytics powered by ML and fed by the newly beautified and actionable data. Together, “Big Data + AI” is becoming the default stack upon which many modern applications, particularly in enterprise, are being built.
Analogies to DevOps
The DataOps stack is inspired by parallels to DevOps and I found the comparison useful for helping segment companies in the stack. Just like DevOps is focused on delivering feature velocity, DataOps emphasizes value in delivering data velocity. Similarly, the trend is to have a lot more people using a lot more data.
I believe that a similar “DataOps” stack (more accurately, a continuous feedback loop) can also be constructed. Although far from comprehensive, and with many companies bleeding over into adjacent segments, a high level DataOps stack could look like the following:
Just like in DevOps, there are plenty of free and open source solutions available. Apache Airflow, Kafka, Spark, and Cassandra are some of the more broadly adopted tools. Interoperability and REST API connections between each component in the stack are also just as important as they are in DevOps, for many of the same reasons. Finally, I think the analogy also extends to market structure — where we can expect to see many best-in-class companies playing within a narrow range of the stack, but with few fullstack solutions. I think this structure will continue simply because it’s difficult to be good at so many disparate tasks, and because of the adoption challenges of getting data engineers & data scientists to abandon their favorite tools.
Open Questions for emerging DataOps companies
Before digging into individual startups, exploring some market-wide challenges has been interesting:
- There is plenty of competition. While DataOps as a discipline may be new, some layers in the stack have entrenched and powerful incumbents. For example, visualization and databases/storage seem particularly crowded with heavy hitters like Tableau, and AWS/GCP respectively.
- Free tools like those from Apache are another risk of substitution. Airflow, Cassandra, and kin offer a tough-to-match baseline set of features that many data engineers are learning to love and use for free while they are still in school.
- Demonstrating proof of value to B2B Customers. Winners will need a clear set of solutions to emphasize the hard dollars saved or other desired business case behind their particular breed of data integration tool. This is challenging when your startup is only 1 of 10 interdependent players in a data pipeline.
- Commoditization by Big Tech, specifically AWS. Are there any software verticals that Amazon won’t try to break into? Amazon already offers a range of tools to help with AI/ML and big data applications including products at almost every layer in the stack above. “Big Data on AWS” lists their full product offering, and it’s pages of scrolling. Based on past Amazon interests and releases, I expect the infrastructure and analytics layers to be the most affected.
- Timing — is the market ready yet? While impossible to answer, I found it interesting to overlay the dates for when some notable DevOps companies were founded with the Google Trends reports for search interest. Most of the most notable companies were founded before DevOps was even the slightest blip of a search trend. Similarly, Gartner has DataOps positioned with a “5–10 years until plateau” as the newest addition to their data management landscape.
Which startups are positioned to benefit most from enterprise adoption of DataOps?
In a separate post I’ll be exploring specific startups I think are the most compelling bets on the DataOps trend. In short, due to many of the risks I described above, I think the greatest opportunity could lay in the bottom loop of the feedback cycle focused on QA, monitoring, and governance for the overall pipeline. I’m bullish on Quilt Data, DataKitchen, Algorithmia, Unravel, and Pachyderm.io in particular and I’ll be digging into how each stacks up.
P.S. For some super interesting examples of real-world challenges of DataOps, some Google authors have a great (but long) paper on “Data Management Challenges in Production Machine Learning”