Image source

GEMstone Series: Secure Interrogation of Genomic Databases (SIG-DB) — Part 2

Stephanie Rogers
BioQuest

--

Paillier Homomorphic Encryption Proof-of-Concept

This article is co-authored with Alexander Titus (@alexandertitus) of B.Next.

In Part 1 of the SIG-DB blog series, we introduced a concept for comparing an encrypted genomic sequence against a privately held genomic database while maintaining the security and privacy of the data. Since then, we have completed the tuning and testing of the SIG-DB algorithm using Paillier Homomorphic Encryption, which is described in this post.

You can read more about the genesis of this project in our previous post.

Let’s revisit our protocol.

Figure 1: SIG-DB protocol for (1) hashing sequence into locality sensitive hash (LSH), (2) encrypting and passing LSH, (3) hashing database elements into LSHs, (4) comparing encrypted query LSH to database LSHs, (5) passing scores and decrypting, and (6) calculating IoU, IoD, and IoQ scores.

The SIG-DB protocol is executed as follows (steps below coincide with steps in Figure 1):

(1) The Querier converts the query sequence to a Locality Sensitive Hash (LSH)

(2) The Query LSH is encrypted and transported to the Database Owner, along with the scoring executable, hash function (same one used by Querier), and instructions.

(3) The Database Owner uses the hash function to create an LSH for each entry of the database. NOTE: the database LSHs do not need to be encrypted, as database records never leave the database firewall.

(4) The Database Owner uses the executable to compare the encrypted Query LSH to the database LSHs. An encrypted intersection score is generated for each database entry. The magnitude (or sum of all ‘1’s in the result sequence LSH) is also computed for each database entry; this value is unencrypted in the current implementation.

(5) The encrypted intersection scores and unencrypted computed magnitudes are returned to the Querier.

(6) The Querier decrypts the intersection scores and calculates the similarity scores (IoU, IoQ, and IoD) using the computed magnitudes and the following equations:

Setting up the algorithm

The SIG-DB algorithm was developed in Python and uses Paillier Homomorphic Encryption implemented in the phe Python package. To test the effectiveness of querying a database with genomic sequences, we downloaded all E. coli and S. aureus genomes from the NCBI RefSeq database. We then selected random sequences from the overall data set to use as the query sequences for testing. This query was then searched against the rest of the database and similarity scores were returned.

For a complete and detailed description of the algorithm, read our paper that was released on arXiv Quantitative Methods.

So, does this actually work?

We wanted to make sure that the pre-processing protocol and encryption process didn’t create a problem with returning the correct results. To do that, we measured the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR) — also known as a Receiver Operator Characteristic (ROC) curve (an explanation of a ROC curve can be found here). To determine ground truth, we used Average Nucleotide Identity (ANI) since it has been recognized as a good measure of sequence similarity.

We had to set an ANI threshold to identify a ‘true positive’ hit– even though in reality there isn’t necessarily such a distinct line. Alas, we chose to set a threshold of 99% and in a second set of tests a threshold of 95%. Why these thresholds? The literature suggests an ANI of 99% is roughly a threshold of prokaryote isolate level differentiation and 95% for prokaryote species level differentiation (again, in reality it’s not so black and white, but a threshold must be set to perform the analysis).

[For real-life applications, the user would not necessarily need to set a threshold cut-off. However, they would need to have some procedures (or minimally, expectations) going into the analysis that determines what scores would indicate a ‘value of significance’ for their particular application. ]

Using 3 different query:DB entry length ratios (1:1, 1:2, and 1:3) and 750,000 query:DB entry comparisons for each length combo (1,500–2,000 are true positives, depending on the length combo), we were able to demonstrate the algorithm performed very well across all scenarios (area under the curve = 0.97 or higher; see Figure 2 below). Using an ANI threshold of 95% (a slightly more ‘permissive’ gold standard), the algorithm was essentially a perfect classifier, assigning similarity scores to query:DB entry pairs in exactly the same order as their ANI score rankings.

Figure 2: Receiver Operator Characteristic (ROC) curves for Paillier-SIG-DB. Query length = 1000 bases and ANI threshold of 0.99 for A-C; 0.95 for D. (A) DB entry length = 1000 bp with a total of 750,000 query to database entry comparisons and 1,515 of those with ANI scores above the ANI threshold (true positives) (B) DB entry length = 2000 bp with a total of 750,000 query to database entry comparisons and 1,549 of those with ANI scores above the ANI threshold © DB entry length = 3000 bp with a total of 750,000 query to database entry comparisons and 1,708 of those with ANI scores above the ANI threshold (D) DB entry length = 1000bp with a total of 750,000 query to database entry comparisons and 1,877 of those with ANI scores above the ANI threshold (representative of all ROC curves generated using an ANI threshold=0.95).

An important factor in this success is the use of the Max Similarity Score out of all three possible similarity scores (IoD, IoU, and IoQ). The ROC results validated our original decision to include all three scores to account for the query and database entries potentially having different lengths. The classifier performance was poor if, for example, only IoD was used and the DB entry was longer than the query sequence (Figure 3A). One possible explanation is that the denominator is larger than the possible intersection space, and as such, we may be washing out the small differences between the intersections of positive and negative examples. In a dataset of only 0.2% positive samples, this leads to broad misclassification. This phenomenon is exactly why we included IoD, IoU, and IoQ as our scoring metrics. So, when using this algorithm, best practices include calculating all three similarity scores and using the max score for decision making.

Figure 3: Receiver Operator Characteristic (ROC) curves for Paillier-SIG-DB based on each similarity score — (A) IoD, (B) IoU, and © IoQ. Query length =1000 bases and ANI threshold of 0.99. The DB entry length was 1000 bp (top row) with a total of 750,000 query to database entry comparisons and 1,515 of those with ANI scores above the ANI threshold (true positives) and 3000 bp (bottom row) with DB entry length=3000 bp with a total of 750,000 query to database entry comparisons and 1,708 of those with ANI scores above the ANI threshold.

Doesn’t homomorphic encryption require a lot of time to compute?

One major concern for using homomorphic encryption techniques for operational purposes is the computational expense. To assess the practicality of using SIG-DB, we performed a series of tests to understand:

· The amount of time required to run the full operation on a single CPU core

· The time savings that could be achieved using a parallelized approach

· The breakout of time requirements by step in the protocol (encryption time, query time, and scoring time)

We continued using the RefSeq database for these tests, but this time used the entire collection of bacterial sequences. Heat maps (Figure 4) were generated from test runs that evaluate 3 different variables: (1) number of cores used for computation (1–48 cores), (2) query length (100–20,000 bases), and (3) number of entries in the database (100, 1000, 5000, and 10000 FASTA files; or 10,145 sequences, 89,992 sequences, 425,772 sequences, and 860,984 sequences, respectively).

[Of note: SIG-DB is written in Python, which is not optimized for computational performance. It is expected the execution time would improve if SIG-DB was re-developed and optimized in C++ or equivalent code.]

Figure 4: Heat Map showing time of execution for SIG-DB, comparing a single query of ‘qlen’ length to a database containing 1000 entries (82,992 sequences). The trends in this heat map are representative of all scenarios tested.

From the results, we pulled out a number of observations:

Observation #1: Parallelization helps performance but has diminishing returns for more than 16 CPU cores. For example, with a 20,000 bp sequence length and a database of 82,992 sequences, the total execution time goes from 51 minutes with 1 core to 21 minutes with 16 cores — a reduction of more than 50%. However, the execution time is nearly the same for the test runs using 16 or more cores. The dwindling improvement beyond 16 cores likely indicates that 16 cores are sufficient infrastructure to carry out the query efficiently, and that the cost of splitting the query into smaller operations (across more cores) is greater than the performance gained — often referred to as over-parallelization. This also implies that as the dataset increases in size, the value of more CPUs could increase beyond 16 cores; however, that increasing value is not linear.

Observation #2: Query Time and Scoring Time are the most time intensive steps, and they appear to be roughly equal, even as parameters change. This is to be expected, as both steps involved operations with encrypted data.

Observation #3: Scoring Time performs worse when parallelized. While the root cause of this is still unknown, it is likely related to the size of the query being scored and how efficient it is, or is not, to divide the work over several CPUs at the same time, versus running the calculation on a single CPU without that overhead.

Observation #4: Execution Time is linearly correlated with database size, as illustrated in Figure 5 (factor of 8 difference between the two lines; y-axis is in log scale). This helps to extrapolate how long a run might take based on database size. Additionally, this is an important element to the security aspect of the algorithm, because a non-linear runtime could reveal information to the DB Owner about the query and its similarity to the database entries.

Figure 5: SIG-DB execution time for 100 and 1000 database entries compared to different query lengths; both runs performed with 48 cores. Note: y-axis is in log-scale.

So, is it practical? Well… it depends. If you have a small dataset (i.e. less than 100,000 sequences that are 20,000 bp or smaller), then SIG-DB can be executed in minutes. However, it’s probably a safe bet that most use-cases will have substantially larger databases and larger sequences. We ran the practicality tests again, but this time using a larger database (75,960 database entries; containing 1,612,925,787 sequences total) using the same variations of sequence lengths as previously described. As you’ll see in Figure 6, the total execution time is substantial: 2455 minutes for the largest dataset tested. That’s 40 hours to complete. In some instances, that amount of time may be tolerable. However, the dataset is still not as large as some datasets that will likely be encountered, and the time to compute is just going to continue to grow. Improvements to the algorithm’s efficiency (such as recoding in C++) should be explored.

Figure 6: SIG-DB execution time for 75,960 database entries (1,612,925,787 total sequences) compared to different query lengths; running with 1 CPU core. Full bar represents total execution time. Blue box = encryption time; orange box = query time; gray box = scoring time.

Conclusions

There is still much work to be done to develop an operationally-ready tool for encrypted compute for genomic sequence comparison, but our goal was to demonstrate that such an approach is worth the time, energy, and resources to develop; and we believe we have done just that. Our approach used PHE, which offers an encryption approach that is near-ready for some commercial applications. However, other encryption schemes have been developed and may be worth exploring as well. Fully Homomorphic Encryption (FHE) is one we are interested in, have completed some initial tests for this algorithm, and received the same results discussed here (except with approximately 2.25X longer execution times). Microsoft SEAL is the most advanced FHE scheme, which is written in C++. However, we have created a Python wrapper to enable more data scientists and bioinformatics to utilize SEAL. Check out our blog for more on PySEAL.

We are at an exciting time for biology, data science, and information science! By looking outside the norm and leveraging tools and approaches from other communities, we can create new ways of unlocking the potential of the data at our fingertips.

B.Next is designing a biodefense technology strategy, demonstrating the potential that innovative tools and techniques can provide, and supporting the investment strategies of these innovations.

Check out our work at www.bnext.org

--

--