At Redpoint, we’ve invested in more than 15 data companies and deployed $250M+ in capital over the last few years. We’re long-time believers in the data/ML infrastructure and analytics markets, which aren’t slowing down. According to IDC, the global Big Data and business analytics market reached approximately $189B in 2019 and is expected to expand dramatically to $274B by 2022, a ~13% CAGR over the period.
This is an incredibly dynamic category, and I’m passionate about analyzing and evaluating what’s coming next (like data security here or synthetic data here). My research seeks to unearth seminal insights that ultimately help move the field forward. Below are our thoughts on four key 2020 Big Data trends: 1) data quality; 2) data catalogs; 3) KPI observability; and 4) streaming.
1. Data Quality
Data quality management ensures data is fit for consumption and meets the needs of data consumers. To be of high quality, data must be consistent and unambiguous. You can measure data quality through dimensions including accuracy, completeness, consistency, integrity, reasonability, timelines, uniqueness, validity, and accessibility. Data quality issues are often the result of database merges or systems/cloud integration processes in which data fields that should be compatible are not due to schema or format inconsistencies. Data that is not high quality can undergo data cleansing to raise its quality.
Currently, most companies do not have processes or technology to identify “dirty data.” Typically, someone must spot the error. Then the data platform or engineering team must manually identify the error and fix it. It is time-consuming, tedious work (taking up to 80% of data scientists’ time), and it’s the problem data scientists complain about most.
High data quality is critical for companies to be able to depend on it, and there are numerous perils of bad data. While the caustic observation, “garbage-in, garbage-out” has plagued analytics and decision-making for generations, it carries a special warning for Machine Learning (ML) since the time it takes to develop a model is significant. If an ML engineer spends time training and serving a ML model built with bad data, the incorrect ML model will be ineffective in production and can have negative secondary implications for user experience and revenue. An O’Reilly survey found that those with mature AI practices (as measured by how long they’ve had models in production) cited a “lack of data or data quality issues” as the main bottleneck holding back further ML adoption.
Data quality is foundational to business’ human and machine decision making. Dirty data can then result in incorrect values in dashboards and executive briefings. Additionally, we’ve heard about bad data leading to product development decisions that have caused corporations to lose millions of dollars in engineering effort. Machine-made decisions based on bad data can lead to biased or inaccurate actions.
2. Data catalogs
According to Alation, a data catalog is “a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness data for intended uses.” Catalogs capture rich information about data, including its application context, behavior, and change. We are interested in data catalogs because they support self-service data access empowering individuals and teams. With data catalogs, analysts avoid the slow process of working with IT to receive data and can discover for themselves relevant data, improving productivity. Additionally, data catalogs can help with compliance by collecting information about data usage, data access, and PII.
There are both commercial and open source data catalogs. Commercial data catalogs include Collibra, Waterline Data, Alation, Atlan, Ataccama, Zaloni, Azure Data Catalogue, Google Cloud’s Data Catalogue, IO-Tahoe, and Tamr. Collibra is furthest along its fundraising journey, having recently raised $112.5M at a post-money valuation of $2.3B. Many tech companies have open-sourced their data catalogs or spoken about them publicly, including Airbnb, LinkedIn, Lyft Netflix, Spotify, Uber, and WeWork.
3. KPI Observability
Most data-driven companies leverage business intelligence tools like Looker, Tableau, and Superset to track key KPIs. While these operational systems can proactively send alerts when a metric passes a certain threshold, the analyst still needs to drill down into the details to determine why the KPI changes. Diagnostics are still fairly manual.
We are seeing a new set of solutions that enable every business to understand what’s driving their key metrics. The operational analytics platforms help teams go beyond dashboards to unearth why their key metrics are changing. By leveraging machine learning, solutions can identify specific factors that are responsible for a KPI change. We believe there is an opportunity in this space because businesses want guidance around which underlying factors to focus on.
We divide the ecosystem into three categories: 1) anomaly detection/root cause analysis; 2) trend detection; and 3) data insights. Anomalies are typically sharp increases/decreases and operate at the single-metric level. Trend detection captures anomalies but more importantly captures drifts and changes in underlying makeup. Data insights uncover the unexpected from a myriad of data.
There are a few businesses offering KPI observability. Anodot, Lightup, and Orbiter focus on anomaly detection and underlying factors causing the change. Falkon and Sisu are focused on anomaly detection and trend detection. Thoughtspot SpotAI and Outlier try to produce the most important insights from massive amounts of data, without requiring human supervision/configuration. In the exhibit below, we’ve included vendors in all relevant categories.
There is increased demand for businesses to make decisions and provide services in real-time, so businesses are moving to streaming communication, storage, and data processing systems. We believe that as teams continue to move from batch to streaming systems, there is a huge market opportunity.
A major player in the space is Kafka, which LinkedIn open-sourced in 2011. Kafka is a publish-subscribe system that delivers persistent, ordered, scalable messaging. Its architecture includes topics, publishers, and subscribers. Kafka can partition message topics and supports parallel consumption. Over the past decade, the technology evolved from a messaging queue to an event streaming platform.
While Confluent, the company behind Kafka, is rumored to be raising at a $5B valuation, we’ve heard the solution is hard to implement and manage at scale. We’ve been told Zookeeper is particularly hard to manage, and although the team is replacing this component, user experience can be improved. Furthermore, we’ve heard maintenance can be challenging because the number of topics can grow large quickly, so teams have to consistently balance and upgrade instances.
There are new approaches to streaming like Apache Pulsar, which has a two-tier architecture where serving and storage can be scaled separately. This is really important for use cases with potentially infinite data retention, like logging where events can live forever. Moreover, if you have to store all the messages, you don’t want everything in high-performance disks. With Pulsar you can move older data into S3, which Kafka can’t do. There’s also auto-rebalancing that AWS Kinesis can’t do. We’ve also heard users express affinity for Pulsar’s lighter client model than Kafka. Aside from Kafka and Flink, there are also other systems like NATS and Vectorized.
For real-time data processing, Apache Flink is the most well-known. Flink processes elements when they occur rather than processing them in micro-batches like Spark streaming. A disadvantage for the micro-batch approach is that batches can be voluminous requiring substantial resources to process. This can be particularly painful for inconsistent or bursty data streams. Another advantage of Flink is that you don’t need to discover through the trial and error the appropriate configuration for micro-batches. If the configuration generates a processing time exceeding its accumulation time, there is a problem. Then the batches start to queue up and eventually all the processing will come to a halt. There are also newer streaming engines like Confluent KSQL and Timely Dataflow from the Materialize team.
ResearchAndMarkets predicts that the Global Event Stream Processing (ESP) market will grow from $690M in 2018 to $1.8B by 2023, a 22% CAGR over the period. We believe the market is growing faster than this based on our conversations with buyers.
Over the next year we’ll be watching the evolution of 1) data quality; 2) data catalogs; 3) KPI observability; and 4) streaming. If you or someone you know is working on a data/ML infrastructure and analytics project or startup, it would be great to hear from you. What trends are you seeing? Comment below or email me at firstname.lastname@example.org to let us know.
☞ If you liked this post, please tap the clap icon to promote this piece to others or share your thoughts with me in the comments