The benefits of bridging data silos have recently gained significant attention by organizations of all sizes. In 2018, Google established the Data Transfer Project and a few days ago Microsoft launched the Open Data Campaign, aiming to make large interoperable datasets production-ready by 2022.
On the receiving end of the data pipeline are machine learning models. Much like an opinion poll benefits from more responses, seeing more examples helps a machine learning model make more accurate predictions. The diversity of examples in a dataset is also key. Pollsters do not just survey a large sample; they survey a diverse sample of people. In the same way, if all our examples are similar, then our model will be stumped the first time it encounters an outlier. See this post for an illustration of this.
Both these goals, dataset size and diversity, pose challenges to organizations. Obtaining examples can be expensive: if you are a department store and you are learning from the dataset of purchases, you cannot easily get more examples, or else you would already have done so. Dataset diversity is often outright impossible, as customers of different department stores belong to different demographics.
We can address these problems by treating access to data as a commodity. If a company wishes to train a model, but lacks the data, they can pay another organization for access to theirs. We feed two birds with one scone: firstly, the company now has access to a much larger dataset, by training on data from multiple organizations. Further, those examples are more diverse because they come from independent sources. The result is a model that is more accurate and robust to outliers.
At first glance, a marketplace for data should immediately raise eyebrows since datasets often contain personal or commercially sensitive information. The key is not to share the data itself, merely insights from it; not individuals’ store purchases or medical histories, but general trends. This has only become possible very recently, with rapid advances in privacy-preserving computing enabling learning from patterns in data while obscuring individual records.
It’s a win-win: The data consumer gets access to more and better data. The data producer captures its value without revealing secrets.
A secondary concern is engineering overhead. Every dataset is different, with incompatible features and its own idiosyncrasies. It is not feasible to re-tailor a model every time we wish to use a new dataset, as doing so wastes engineering effort. Further, this effort gets multiplied when we wish to train on multiple datasets, which may use different formats for the same information.
Manifold learning is a technique that permits us to automate the task of extracting usable information from disparate datasets. Engineers no longer need to worry about the feature encoding of the data. Instead, records are turned into vectors of numbers, whose meaning is consistent across datasets. Training on a new dataset is faster and cheaper.
In this post we will dive deeper into the main roadblocks holding back cross-organizational data exchange at a global scale and how Ntropy is building a platform that makes it effortless to train a model across multiple datasets and monetise data across organizational barriers, without compromising privacy or confidentiality.
Let’s look at some currently available solutions for accessing additional datasets.
One option is to find publicly available data. Alternatives range from mostly academic datasets, like the UCI machine learning repository and commercial datasets in places like Kaggle and Google’s dataset search. Such repositories are meant for running models on individual datasets. There is little incentive to contribute datasets to those pools and the amount of data is correspondingly limited.
Another option is commercial solutions, ranging from incumbents like LiveRamp, LexisNexis and Experian, to more recent projects like Snowflake and AdSquare, as well as decentralized projects like Ocean. Unlike publicly available data pools, commercial data exchanges are two-sided, with some incentive for data providers to make their data available. However, the barriers to entry and overheads for both sides of the market are huge.
Data consumers need to allocate recurring resources to get consistent access to any useful data. This includes discovery, testing and integration of new datasets, navigating licenses from different data providers, cleaning and figuring out feature mappings between datasets, dealing with liabilities associated with transacting raw data, constantly validating data quality and authenticity.
Data providers need to enforce licenses, set optimal pricing, predict demand, ensure portability, anonymity and deal with privacy risks and competitive sensitivity around their data.
Inefficiencies in the current state of the art arise largely because datasets are maintained and validated by humans and the data is hence being encoded in a human-readable format. However, a machine learning model only learns from the data distribution as a whole, independent of the encoding of each individual example.
To enable a reliable data layer on top of which machine learning models can be built with no additional overhead, datasets need to be treated as complementary, coupled streams of information, encoded in a “machine-optimal”, rather than human-readable format. As we will see below, this principle unlocks a substantial advantage over existing solutions.
Let’s consider two companies in the same market with respective datasets about their customers, A and B. Although their fundamental offerings are similar, the two companies will have a different customer base. Their respective product teams will measure different features about each customer. Furthermore, some, if not all of these features, will be highly complex and proprietary. Each data science team trains a model in-house to allow their product to predict customer behavior and respond accordingly.
As we already explored above, each of these two models could be significantly more powerful, if in addition to its own data, it also had access to data from the other company. So, how do we combine two datasets without revealing concrete information about each customer or involving any human-dependent “feature-mapping” in the process?
Let’s take a step back and remember one of the most fundamental theorems in software engineering:
“We can solve any problem by introducing an extra level of indirection” — David Wheeler
Although our datasets A and B are encoded using completely different, proprietary features, their distributions will likely have a lot of similarities between them.
A model trained on both datasets at once could therefore apply concepts it learned in the other dataset to improve performance on its original dataset, and vice versa. Hence, if the two datasets have similar distributions, a “latent space” must exist, where observations from both can be directly compared. Furthermore, unless one has access to the “encoders” which mapped each dataset to this latent space, it would be very hard to reverse-engineer the mapping and figure out the human-readable representation of each observation. Although hard to determine rigorously, this enables a level of privacy that is strong enough for even the most sensitive datasets to be shared publicly. This approach, also called manifold learning, is used for many public datasets. For example, this dataset from Worldline of credit card transactions. Although each human-readable observation would be sensitive, as it contains user PII, device information and the transaction itself, when encoded in latent-space it can be publicly available without additional risk.
In practice, there are many ways to merge multiple datasets in latent space, with varying levels of accuracy and scalability. Perhaps the most straightforward approach, viable for a small number of datasets, is to use a standard autoencoder architecture of a neural network. As latent space is a compressed representation of each observation, the encoder is forced to use similarities between inputs to find an optimal representation. As we outlined above, datasets with similar distributions should share much of this latent representation.
Let’s now apply this approach to a real-world example of a classic, immensely valuable commercial problem. Fraud detection. For this benchmark we have 4 datasets, representing 4 different distributions, of credit card transactions from 3 different companies, FICO, Worldline and Vesta, with on average 125 variables and 250 000 transactions per dataset. Each transaction is labeled as fraudulent or legit. To capture how well our model spots fraudulent transactions, we will use the standard ROC area-under-curve metric.
Taking one of the two datasets from FICO as a baseline, we can see a steady improvement in model performance with each additional dataset it is trained on, with over 5% better accuracy from only 3 additional datasets. As each query effectively accesses information from all datasets at once, we expect this accuracy to continue increasing with each new dataset we add to the network.
A similar performance improvement has very recently been shown in a medical study, to be presented at the SIIM20 annual meeting. A single machine-learning model was trained to predict breast densiy from mammogram datasets across four different institutions. Although the distributions were distinctly different, the combined model was significantly better than each of the models trained on only on its respective dataset.
We have so far established that many of the problems with utilizing data across organizations lie in enforcing human-readability of each data point. By simply introducing an extra layer of indirection, and moving compute, storage and transfer of data to latent space, an significantly more efficient data network can be built than the status quo of today.
All participants in such a network have human-readable access only to data from their own distribution and latent-space access to the combined data of all distributions. A base level of privacy is hence maintained. Observations can be encoded using any proprietary feature set, without requiring any human input to translate it. Furthermore, as each dataset contributes information to the entire network and each query is tapping into this combined information pool, the cost of any competitive sensitivity, which is one of the fundamental issues with any data network, is far outsized by the value of joining the network both as a data consumer and as a data producer.
Over the last few decades, progress in technological innovation has relied on democratizing access to some of its key ingredients: knowledge (open publishing platforms), algorithms (code repositories) and computing (cloud providers) . A new component is becoming increasingly critical for both large and small players. Data. The vast majority of valuable data today sits in silos, trapped behind barriers of regulation, privacy, schema standards and competitive risk.
Ntropy is building a network to allow data consumers to train machine learning models on multi-organization data with minimal engineering overhead and data producers to seamlessly monetize their data pool at no additional privacy risk. This has only been made possible in the last few years, through advances in manifold learning algorithms and privacy-preserving computing. We will be launching the network with an initial set of partners in August, 2020. Check out https://ntropy.network for more info.
We would like to thank Jakub Nabaglo, Alec Mocatta, Peter Goldsborough, Piotr Rosłaniec, Matko Bošnjak, Jonathan Baker, Lukas Köbis, Daniel Duma, Ahmed Medhat, Dimitrios Athanasakis, Sidhartha Roy and David Buchmann for comments, feedback and discussions.