PySEAL: Homomorphic encryption in a user-friendly Python package
Written by Shashwat Kishore, Lab41
We can all agree that the recent growth of machine learning applications has been driven by innovation in algorithm design, low cost storage for large training datasets, and powerful GPU-driven compute. However, many useful training datasets can never be shared. Consider a biomedical application where a large cohort of patient genomic data needs to be compared to identify previously unknown genetic markers of disease. Of course, we need to safeguard patient data privacy and security and, therefore, cannot openly share their genomic data within and between healthcare and/or research organizations. To date, there have been many successes for identifying genetic markers of some inherited diseases; mostly diseases that are a result of a single gene and can be successfully identified with limited amounts of data. However, more complicated diseases will require advanced analytic approaches, such as machine learning, and likely a substantially greater amount of data. Gaining access to large sets of patient genomic data for a particular disease is challenging due to the necessary legal agreements that need to be in place to obtain that data. Is there another way to address this challenge?
One solution to this problem is to use fully homomorphic encryption. Fully homomorphic encryption refers to an encryption scheme in which performing computations and analyses (even complex, non-linear ones) on encrypted data. What this means is that if we homomorphically encrypt the DNA sequences of patients, we can then query homomorphically encrypted databases for genetic comparisons. We can then decrypt the final result and get the same answer as we would have gotten using unencrypted DNA sequences.
This approach is especially attractive because it allows a healthcare organization or research group to compare DNA sequences without ever exposing patient’s DNA sequences to unauthorized parties. It also allows genomic database holders to ensure the privacy of their DNA and potentially protect their intellectual property if the dataset is owned by a commercial company. It’s a neat trick, and one that has gained more attention in recent years.
Unfortunately, most homomorphic encryption schemes are considerably slower than regular encryption methods, the main difficulty in using these schemes. A key area of ongoing research in cryptography focuses on bridging this speed gap, and recent developments pioneered by Microsoft Research’s (MSR) cryptography group have made significant improvements. In December 2017, MSR released version 2.3 of its Simple Encrypted Arithmetic Library (SEAL), a fast C++ implementation of the homomorphic encryption system described by Fan and Vercauteren in their paper “Somewhat Practical Fully Homomorphic Encryption”. The encryption system proceeds in two separate stages: First, numerical data are converted into polynomials and embedded in a specified polynomial ring. Then, the ciphertexts (encryptions) of the polynomials are computed by applying noisy linear transformations involving the public key to each polynomial. Arithmatic operations are then performed on the encrypted data, and as long as the accumulated noise in our computations does not exceed a threshold amount, we can decrypt the computational output with a linear transformation involving the private key and obtain the correct result (see sections 4 and 5 of Fan and Vercauteren’s paper for details). The MSR implementation has a number of nice features in addition to the basic encryption apparatus, such as recommendation methods that provide optimal parameters for the initial encryption setup, and a noise budget that reflects the noise incurred when performing a given computational procedure.
Lab41 and B.Next have been engaged in a collaborative effort to tackle the DNA sequence and database encryption problem described in the introduction, which is a first step for a host of biotechnology applications that require secure information sharing and comparative analyses of genomic data. Our recent work on fully homomorphic approaches includes a publicly available port of MSR’s C++ SEAL library to Python, which can easily be imported and used in Python REPL’s and projects. The library, called PySEAL, features the capability to call key classes and methods in Python from MSR’s C++ implementation, common use cases of homomorphic encryption as illustrated in the original SEAL library, and a Docker file that takes care of setting up the right environment and building the required executables.
We hope that PySEAL will be a useful platform for data scientists and others who are more accustomed to using high-level programming languages for analysis and experimentation. Given the prevalence of Python in the data science community as well as in the general science and engineering community, we felt that it would be valuable to open source a convenient homomorphic encryption package, readily usable by scientists and engineers of all backgrounds. The readme and the examples included in the library explain the specifics of how the main classes fit together and show how to use these classes to implement various encryption schemes. These examples can be used as starting points by researchers who are implementing homomorphic encryption schemes, or who are just interested in exploring new possibilities for preserving the privacy of their data. Homomorphic encryption as a field is still in its early stages, and we look forward with great anticipation to the technologies and advances that the general scientific community will develop in this sphere in the coming years.
B.Next is designing a biodefense technology strategy, demonstrating the potential that innovative tools and techniques can provide, and supporting the investment strategies of these innovations.
Check out our work at www.bnext.org