Confidential Computing approach for secure-by-design data collaboration in genomics
Genetic Testing Market: different business models and siloed datasets
With at least 250 genetic testing companies (GTCs) worldwide, there are only a few companies, database sizes of which exceed one million samples. The business model of the top direct-to-consumer (DTC) GTCs significantly differs from the one adopted by the majority in this industry.
The key factor is that leading GTCs occupy the best-selling and revenue-generating niche of Ancestry and DNA Matching services which allows them to accumulate extensive genomics databases. As stated in our first article, our market research of 165 genetic companies showed that only about 15% of the companies provided both Ancestry and DNA Matching products in 2021.
Large genomics databases provide greater value to GTCs’ customers as GTCs are able to offer more accurate personalised reports and find a higher number of potential relatives. Further, GTCs leverage these comprehensive databases to enter clinical trial markets and participate in drug discovery, therefore diversifying and increasing their revenue streams.
Such strategic capabilities are rarely available to small and medium-sized GTCs due to insufficient database sizes. From scientific and business perspectives, data collaboration can drive the value of their genomics datasets non-linearly with size. However, GTCs databases are locked into silos due to the highly sensitive nature of genetic information.
Protection of data-in-use as a bottleneck for data collaboration
Overcoming barriers to data collaboration has been in the interest of the largest IT companies and public institutions for decades. The process can be constrained by differences in storage standards, country-specific regulations, and stakeholders’ interests. However, as mentioned previously, data privacy is the biggest concern.
In order to ensure privacy in the multi-party scenario, data has to be protected in three different states: at-rest, in-transit, and in-use. While there are several reliable approaches for encrypting data-at-rest and in-transit, such as encrypted databases for storing information and TLS protocols for secure data transfers, data-in-use has been the most vulnerable state.
The issue is that data analysis requires decryption in which case data becomes subject to data disclosure, leakage, and modification.
Data protection by law and data protection by design
Two broad and complementary concepts that summarise data protection techniques can be distinguished between protection by law and protection by design. Protection by law is built on the principles of responsible data sharing and refers to the establishment of protocols that stakeholders participating in collaboration have to sign and follow. The method can be time-consuming, stakeholder selective, and cannot guarantee security in all instances.
Data protection by design implies an in-built technology solution for every stage of data management that can withstand any external and internal threats thanks to its impervious data security controls. The problem is that despite several generations of secure multi-party computing technologies, none of them was efficient and fast enough to be adopted on a large scale. The variety of existing privacy-enhancing solutions lacked comprehensive hardware-backed technologies that could prove the security-by-design concept.
Confidential Computing is a novel hardware-based technology that enables secure-by-design data collaboration
The initiative to advance the protection of data in use has been led by Intel, AMD, Microsoft, HP, and IBM since 2003. In 2019, the efforts were transformed into a Confidential Computing Consortium under Linux Foundation and the global IT leaders introduced a game-changing technology — Confidential Computing.
Confidential Computing is the protection of data-in-use by performing computations in a hardware-based Trusted Execution Environment (TEE). TEE is a secure area of the main processor that essentially operates as a black box or the so-called secure ‘enclave’. Data and code can be transferred to the enclave where computations will be executed in hardware isolation.
Once inside the black box, the code can no longer be modified. All computations occur only inside the black box. Computational results received after code execution can be in the form of analytical insights or trained models. After the code is executed, the TEE black box is eliminated, along with all the code and data. Hence, TEE ensures data integrity, data confidentiality, and code integrity and protects against internal and external threats, including cloud providers.
The disruptive advantage of Confidential Technology is the TEE’s capability to connect data from several sources with no data disclosure among the data owners. This is a new paradigm that enables secure-by-design multi-party analytics. Confidential Computing opens up data collaboration scenarios with extraordinary opportunities for data owners and software developers that have previously been impossible.
As an early adopter of Confidential Computing, GenX developed a solution that implements new approaches for secure data collaboration in genomics
The solution is designed to connect GTCs, bioinformatics software developers, and research centres into a peer-to-peer ecosystem for secure-by-design multi-party analytics of genomic and clinical data.
The first application of the GenX platform is built to improve DNA-relatives matching that was previously available on locally-controlled and extensive databases. The platform connects GTCs among one another, allowing them to add value to their DNA relatives matching services by leveraging each other’s databases to find potential relatives-matches in a secure and privacy-preserving way. Confidential Computing and the network architecture ensure that personal genetic data stays undisclosed among GTCs, GenX, and any third parties.
Therefore, the concept offers a cutting-edge way of collaboration between previously competing companies. In the next article, we will provide more details about DNA relatives matching in a multi-party mode.
Other use-case scenarios of the GenX platform will include a federated learning engine, allowing to train models on genetic data, predict phenotypes, and identify disease-gene associations; patient recruitment for clinical trials, and a marketplace for 3rd-party bioinformatics software.