The Data Matrix: Perils of Data Network Effects
The strength of data network effects depends on two often incompatible factors: Geographic scale and the rate of data decay
The growth in internet adoption, among both businesses and consumers, has led to an explosion in the volume of data in the world. This has naturally led to more entrepreneurial interest, resulting in numerous data-based (“big data” or “artificial intelligence”) startups and new business models claimed to be built on “data network effects”. It is usually believed that more data leads to stronger data network effects, but this is often not the case. Data network effects are more misunderstood, much rarer, more difficult to establish, and even harder to recognize than “traditional” network effects.
NFX and a16z have previously explained the value of real-time data and how rare it is to find strong data network effects. Both are must-reads for any entrepreneur building a data startup. As I have previously done for marketplaces and networks, I will build on some of their ideas and take a look at data network effects through the lens of defensibility and scalability. But first, let’s begin with an explanation of what data network effects actually are.
What is and isn’t a data network effect?
“Traditional” network effects exist when the addition of a user makes the product more valuable for all users (same-side), or when the addition of a supplier makes the product more useful for all users and vice versa (cross-side). This feedback loop (sometimes called a “flywheel”) organically attracts new users (or suppliers), increases product value while lowering customer acquisition costs, and creates a deeper moat against competition. This is a core feature of network effects and results in organic, exponential, and defensible growth.
Data network effects exist when the addition of a user generates more data for the product, which then makes the product more valuable for all users. Also, data network effects are almost always one-sided (explained in the “feedback loop” section below). The most well-known example of a successful data network effect is Waze. Allowing drivers to alert others in real-time about accidents and road conditions made the product more useful for all drivers. This seems simple enough, but a data-centric business model needs to meet numerous criteria for data network effects to be present. These include the following:
- Proprietary data: The data required to improve the product has to be proprietary, i.e. generated from the startup’s own users or customers. Many startup founders assume they can create data network effects by scraping vast volumes of public data to train their algorithms. But public data can be accessed by anyone, does not create any defensibility and does not create a feedback loop with customers. Mattermark is one example of this. Mattermark attempted to create a data product for startup investors by aggregating public data. Despite being backed by high profile VCs, it shut down in 2017 because numerous established competitors had access to the same data. On the other hand, Zoominfo has built a viable business by getting its customers to share their internal contact list with the community.
- Feedback loop: The data must improve the value of the product for data producers (users) and not just for a third party. Take, for example, a market data company like Rakuten Intelligence (previously called Slice Intelligence). The company offers consumers the ability to save money on purchases through a free tool for tracking email purchase receipts. The company monetizes by selling aggregated and anonymized e-commerce data to businesses, i.e. a market intelligence product. While the business model is “asymmetric”, and scale helps bring down the per-unit cost of data acquisition, it does not have data network effects. More businesses buying intelligence data does not increase the value of the consumer product, i.e. there is no feedback loop from businesses back to consumers. While third parties may be a way to monetize a model built on data network effects, their presence does not create them. So, many third-party market intelligence companies, including Sense360, Onavo, SimilarWeb, Second Measure, 23andMe, Ancestry, etc., actually don’t have any data network effects.
- Data ownership: Startups need to own the data in question to develop data network effects. For example, analytics startups like Looker help customers manage their own data, but they don’t have data network effects — one customer feeding more data into their analytics tool does not improve the value of the product for other customers. These startups can sometimes benefit from “embedding” if they can become a “system of record” (very few can), but that does not result in a data network effect. Of course, analytics tools can form one part of a model with data network effects, if (anonymized and aggregated) data is fed into a crowdsourced intelligence product for the same customer. But, again, the startup needs to own that data.
- Link with core value proposition: The data needs to strengthen the core value proposition of the product and not just peripheral features. Take Netflix and TikTok, for example. They use engagement data to improve their recommendation algorithm. While this surfaces more relevant content and improves user engagement, it does not change their core value proposition (the quality and volume of content). So it is difficult to call this a real data network effect. Waze, on the other hand, has true data network effects because its core product improves as more users report accidents and traffic conditions.
Meeting these conditions merely confirms the existence of data network effects, but not their strength. We can assess the relative defensibility and scalability of data network effects by applying the same basic principles that we used to evaluate marketplaces and networks.
Scalability: Geographic Range of Data
Much like marketplaces or networks, the scalability of data network effects is a function of their “geographic range”, i.e. the geographic restrictions (if any) on the value of collected data. “Data networks” are normally digital (without any physical data collection or delivery), so they often do not face any geographic restrictions. Cross-border data network effects result in very scalable unit economics, as data collected from a customer in one geography can make the product more valuable in another. This is one reason why many VCs love data businesses like Mapbox. Mapbox offers an SDK for maps (both 2D and AR), points of interest and navigation for customers like Facebook and Snap. Their customers’ users are spread across the globe and Mapbox’s SDK collects data every time it is used. Data collection improves their mapping service and makes it more valuable for all customers, irrespective of geographic location.
Data network effects can also be localized (e.g. restricted to a city). For example, Moovit aggregates public transit data in its journey planning app and also has a community of users who update this data on an ongoing basis. In addition, Moovit collects anonymized data from all users to inform congestion and arrival times. While this meets all the criteria for a data network effect, the data reported by Moovit users is only useful to other users in the same city. So when Moovit launched in a new city, it had to start from scratch all over again. This allowed competitors like Citymapper and Transit to gain the first-mover advantage in regions where Moovit had not yet scaled, creating a regionally fragmented landscape.
Some data networks like Waze are also localized, but don’t face this problem to the same extent because of “network bridging”.
Most Waze users drive and report incidents within their local area. As a result, their data makes the product more useful for other Waze users in that area. However, there are some Waze users who regularly drive for longer distances. The data reported by these users makes Waze more useful for users in multiple areas. As we saw in the case of Facebook, these users act like “network bridges” and help Waze organically expand across multiple regions. But unlike Facebook, these network bridges are limited by the reach of road networks. As an extreme example, there are no road networks connecting Europe and the United States, or South East Asia and Australia. So Waze could only organically expand within a landmass connected by a network of roads (frequented by its users). However, it had to start from scratch when it launched in a new, unconnected region. This made Waze more scalable than purely localized data networks like Moovit, but less scalable than cross-border ones like Mapbox.
Defensibility: Rate of Data Decay
As NFX and a16z have explained, data tends to have diminishing marginal utility, i.e. the first million data points are far more valuable than the next 10 million, and so on. While the exact point of diminishing utility varies based on the use case, the pattern of diminishing utility is well established. So as a first-mover accumulates more data, newer data points become less valuable. This means that a competitor does not need to match the first mover’s scale. All it needs to do is to hit a critical mass of “data producers” to create a “good enough” product experience. This pattern is very similar to the challenges faced by identity-agnostic networks like YikYak and TikTok. This is especially acute for startups that rely on data that decays slowly, i.e. the product is built to aggregate data over time, with longer refresh cycles.
XANT (previously called InsideSales.com) is a good example here. XANT is an AI-enabled sales engagement product for B2B companies. It collects data on sales interactions between all of their customers and their prospects in order to train its algorithm and expand its contact database. This algorithm then provides recommendations and intelligence to sales teams to improve performance. Algorithms are trained by recognizing patterns in data over time. Likewise, buyer contacts change, but only over time. So even though it is constantly fed a stream of new data, historical data in aggregate has a larger impact on recommendations. In other words, its data decays gradually. As a result, the rate of improvement in its algorithm and contact database should decline over time leading to increasing competition and commoditization.
XANT’s recent trajectory validates the impact of gradual data decay and should be a cautionary tale for similar startups. XANT became a unicorn in 2015 but has struggled to live up to its valuation since then. During this time, it has faced competition from numerous AI-powered sales enablement tools, including Chorus.ai, People.ai, Gong.io, SalesLoft, InsightSquared, Datahug, etc. In general, this commoditization risk is consistent across artificial intelligence (AI) startups because their data decays gradually.
Based on this, data that decays quickly should result in the most defensible data network effects. The most extreme form of this is real-time data which decays almost as soon as it is collected. If historical data has no value and the product is purely driven by a real-time stream of data, diminishing marginal utility no longer matters. At this point, the scale of the user base becomes very important as it directly influences the volume of data (and quality of the product) at any given point in time. Waze and Moovit are the most obvious examples of this model.
Moovit tracks anonymized, real-time location data from users as they go about their daily commutes. This allows Moovit to provide data about congestion, arrival times, and real-time transit updates to commuters. Crucially, Moovit needs a constant stream of real-time data because the value of live location data decays shortly after it is reported, i.e. it is not just their data collection that is real-time, their use case is real-time. Because of this, the scale of their user base directly affects the value of the product for users. In other words, the bar for reaching a critical mass of “data producers” is significantly higher than it is for AI startups. This creates stronger defensibility and lowers the risk of commoditization significantly.
Truecaller, a caller ID and spam blocker is another example of a product built on data that decays relatively quickly (but not instantly). Its product automatically blocks spam calls and messages (a key problem in countries like India) via a spam list updated by their community of users. Truecaller claims to update their spam list every day to keep up with spammers changing their phone numbers. Even if this is exaggerated, Truecaller’s spam list data should decay at a faster rate than other contact databases. This makes them less defensible than Waze or Moovit, but more defensible than companies like XANT.
Data Matrix: The Defensibility-Scalability Trade-off
Now that we understand the drivers of defensibility and scalability for data network effects, we can plot them against each other on a “data matrix”. This shows a unique dynamic that we did not see on the network or marketplace matrix and gets to the crux of why data network effects are problematic.
Companies with data network effects seem to fall in one of two buckets:
- Defensible, but somewhat localized
- Scalable, but vulnerable to commoditization
The reason for this is that the most valuable use cases for crowdsourced real-time data are almost always local. Empirically, every single example I have come across so far faces a trade-off between defensibility and scalability. This trade-off can be managed, but only to an extent. For example, Waze and Nexar have leveraged limited network bridging to manage the impact of localized data network effects. Mapbox has attempted to manage commoditization risk by adding localized, real-time data (user location and traffic information) to their corpus of slow-decaying data (2D and AR maps). This adds to their defensibility, but its impact is restricted to a small handful of customers who value that data (navigation or location-based gaming).
Most entrepreneurs building products with data network effects face a spectrum of choices between two extremes. One path is building a defensible product with the knowledge that there will be some natural limitations to your market potential. The other extreme is building a globally scalable product and, crucially, targeting an early exit before your product is commoditized. The path you choose will have a direct impact on your growth path and exit options.