Most startups have a relatively uniform strategy for developing a machine learning product: build a model based on a proprietary dataset that can enable a new product or service for a customer. One of the ways that startups have achieved the “proprietary” piece is by building walled gardens around their data assets. Walled gardens are closed data silos in which data is collected and used, but never willfully shared. This highly centralized approach to data collection and training has many positive attributes: the data resolution is very high, and architects have strong control over the model’s tuning. That said, there are drawbacks to this form of data collection and storage.
A startup deploying a centralized platform experiences slower growth because the platform must be built from scratch and there is usually some friction in migrating a customer to a new platform. A centralized data silo is also risky from a privacy/security perspective — it is a single point of failure since all the data exists in one place. Data silos also subject algorithms to over-indexing of specific features, meaning that results can end up being great on that specific silo, but not replicable across more diverse data sets. Finally, data silos have a tendency to inhibit collaboration since any third party with access to your database is a potential security threat, making the value of collaboration not worth the risk.
There are new tools that enable some of the benefits of machine learning (ML) without the risks inherent in centralization. These tools will help increase ML adoption in previously infeasible applications.
Building a data silo is slow and costly
Startups which take the walled garden approach typically start by migrating a customer to a new, closed platform, and collecting data to improve their product offering. There is a key challenge in this strategy: without the data, how can your service perform well enough to convince someone to migrate? Asking a user to migrate a non-trivial workflow to an unproven platform, with the promise of eventual value, is a very difficult thing to do. This is a dilemma that Owkin’s model disrupts.
Owkin is a company building AI models for medical research. They utilize federated and transfer learning to accelerate drug discovery and development. The Owkin platform uses machine learning-based modeling to analyze molecular and medical imaging libraries as well as clinical patient datasets to uncover complex biomarker patterns that cause disease. Their aim is to improve drug discovery and development by enabling collective intelligence, thereby augmenting pharmaceutical processes and doctors’ capabilities.
Federated learning is an implementation of machine learning where the goal is to train a high-quality centralized model with training data distributed over a large number of customers, without ever exposing the raw data to the network. The first step in federated learning is building a central model, a copy of which is shared with each individual node on the network (a node might be a particular hospital database or an individual mobile phone). The data generated in/on that specific node (e.g. medical information or mobile phone usage) improves the accuracy of the model over time based on feedback from the user (manually correcting, editing or labeling something). Periodically, these unique “learnings” from each node are shared back to the network as a whole, without ever exposing any of the data that caused the machine to learn in the way that it did. These learnings are aggregated, a new central model can be built and shared with the nodes, and the process repeats. See Otium’s Gabriel de Vinzelles’s post on the topic for more detail.
The federated learning approach has several technical challenges, which are beyond the scope of this article. A good summary exists in the second half of this blog post.
Why this matters
Centralized data has some obvious disadvantages. Facebook, Equifax, Orbitz, SunTrust, Saks Fifth Avenue, Under Armour have all been hacked just in 2018, and, within healthcare, LabCorp, Lifebridge Health, UnityPoint, Banner Health, and Carefirst have all had breaches this year, Allscripts was attacked by ransomware, and a report showcased massive security flaws in Army and Navy electronic health records (EHRs). Systems and companies already are struggling at the institutional level with the new reality of cyber-attacks, and most are very wary of anything that might increase this risk.
Instead of accessing personal data, Owkin allows insights to be shared (to make the whole system more intelligent) without ever exposing any of the raw data. Using a strategy like this can dramatically reduce the types of risks listed above.
Decentralized ML is particularly relevant for the healthcare industry
It’s helpful here to take a step back and consider how complicated the current data landscape is in the healthcare industry; data security and sharing are particularly difficult. A dataset focused on advanced personal medicine is composed of an overwhelming set of variables. One patient will generate clinical data from a myriad of places: clinical notes in an EHR, labs, imaging, billing for those labs, claims to the insurer which pass through a clearinghouse. If a patient is suffering from a disease that leads her to seek out clinical trials, she then interacts with the entire pharmaceutical and hospital clinical trial management system as well. At every point along the way, there is a data gatekeeper (in fact, there is usually more than one). Each of these data points will have someone responsible for consent, for storage, and for access. Using a decentralized structure allows for collaboration in a controlled and secure environment.
Hospitals face seven-figure fines if they are deemed to have violated the Health Insurance Portability and Accountability Act (HIPAA). Advocate Health has paid $5.55 million in fees, Memorial Healthcare has paid $5.5 million, and New York-Presbyterian has paid $4.8 million due to their security breaches that led to patient’s private information being shared. New General Data Protection Regulation (GDPR) fines will potentially be even higher. While HIPAA stipulates a $1.5 million per year max fine, GDPR is maxed at whichever is higher of $24 million or four percent of a violator’s annual global revenue. Institutions need to keep their data secure to avoid these fines in the first place, so there is zero added burden on their end if they can collaborate via a federated mechanic.
Collaboration is not important just for collaboration’s sake. Particularly in the healthcare industry, the ability to train a central model over a network of datasets, for instance medical images, will often result in a more precise model, and bring considerable value back to all of the participants in the network. A model trained to detect malignant tumors in medical images is quite likely to be more accurate when trained across a large number of hospitals’ medical records, rather than just one hospital’s siloed dataset.
What Owkin’s platform allows
If two researchers from two different institutions that are working with Owkin want to collaborate, instead of needing to go through the tedious legal and contractual negotiations to put the data in one accessible place, they can do that through Owkin’s platform. They can then use the platform to build models on the siloed data. Owkin is also working on machine learning traceability and interpretability, so those researchers are able to identify which data points were key to the prediction and to the accuracy of the model. This aspect is essential when making discoveries and publishing accompanying papers.
If you can surmount all the hurdles, there are advantages to centralized data. Access (for the data owner) is easier and there are fewer technical challenges of network maintenance across multiple institutions. These features come at the expense of willingness to collaborate, speed of onboarding new partners, and security. While each application varies in terms of how much centralization is “right”, recent regulations are disincentivizing insecure centralization. It is our belief that these tools for getting many of the benefits of machine learning without the risks of a centralized system will greatly expand the types of applications, that ML techniques can impact.
Owkin is a company building AI models for medical research. They utilize federated and transfer learning techniques to accelerate drug discovery and development, with offices in Paris, Nantes, London and New York City. Their proprietary platform, Socrates, uses machine learning-based modeling to analyze molecular and imaging libraries as well as patient datasets to uncover complex biomarker patterns that explain the disease or the response to treatment.
Owkin aims to improve patient treatment, drug discovery and development using collective intelligence built on real-world patient data. They transform hospital’s data into accurate and interpretable predictive models, augmenting physicians’ research capabilities. This real-world collective intelligence and access to data is transferable to the pharmaceutical industry and can empower predictive analytics, biomarker discovery and post-market analysis. Owkin is pioneering federated and transfer learning technologies in healthcare to overcome the data sharing problem, building collective intelligence from distributed data at scale while preserving data privacy and security.
They partner with top-tier academic medical institutions in Europe and in the US, leading pharmaceutical companies, and large cosmetics and dermatology brands. The team participated in a Google Developers Launchpad Scale event in New York City in July 2017 and in the first class of Google Developers Launchpad Studio, focused on the applications of ML in healthcare and biotech.