Part II: Analyze Antigen Epitope Data with BigQuery and Build ML Model

The Immune Epitope Database (IEDB) is a freely available resource funded by NIAID. In an effort to make the data easier to query and analyze, Google Cloud is making it publicly available in BigQuery. BigQuery has a sandbox through which you can try it out without having to sign up for Google Cloud (or having to provide a credit card).

The IEDB database provides a rich collection of peptidic epitopes data and antigen research assays. In this article, we will use BigQuery to understand HLA and Peptide information as well as build a machine learning model to predict binding classification to understand how a peptide might bind to a given HLA molecule, generating a strong candidate for a vaccine test.

Understanding Key Data Entities

Epitope data set contains many tables to provide information about assays that have been gone through research. In this section, we will explore key tables. Antigen_full table provides reference to number of assays and epitopes for a given antigen. Let’s start exploring a data set with few questions: Which antigen has the most number of epitopes for an organism of interest, for example ‘coronavirus’? Go to https://console.cloud.google.com/bigquery and type in:

SELECT
Antigen_Name,
sum(number_of_epitopes) AS epitopes
FROM `bigquery-public-data.immune_epitope_db.antigen_full_v3`
WHERE organism_name like ‘%coronavirus%’
GROUP BY Antigen_Name
ORDER BY epitopes DESC

Query would return similar results as shown below:

The result suggests that most number of epitopes research assays are for Spike glycoprotein for organisms are type of coronavirus. Let’s explore further: Which MHC allele has most references in a binding affinity data for coronavirus?

SELECT
allele_name as allele,
count(*) as counts
FROM `bigquery-public-data.immune_epitope_db.mhc_ligand_full`
WHERE
object_type = ‘Linear peptide’
AND organism_name like ‘%coronavirus%’
AND antigen_name like ‘Spike%’
GROUP BY allele_name
ORDER By counts desc

The query result shows that in a given data set, top three allele to peptide binding information are from allele type of HLA-A and HLA-DRB. But what types of peptides are binding with these alleles, what is the most effective length of linear peptide of spike protein for ligand-peptide binding reference?

SELECT
length(Description) as peptide_mers,
count(*) as counts
FROM `bigquery-public-data.immune_epitope_db.mhc_ligand_full`
WHERE
object_type = ‘Linear peptide’
AND organism_name like ‘%coronavirus%’
AND antigen_name like ‘Spike%’
GROUP BY 1
ORDER BY counts desc

The query result shows that for coronavirus peptides of type Spike protein, most binding affinity data are available for peptide length of 9 and 10. Which indicates that, for an example of a vaccine candidate, one may want to focus further study and testing with peptides of 9 or 10 mers. Based on the above two queries, what is the strong binding affinity between 9 or 10 mers peptide of spike protein with HLA allele? List results in order of high positive binding to negative based on quantitative measure.

SELECT
antigen_name,
Description as peptide,
Parent_protein,
assay_group as binding_type,
allele_name as allele,
qualitative_measure,
quantitative_measurement as result_score
FROM `bigquery-public-data.immune_epitope_db.mhc_ligand_full`
WHERE
object_type = ‘Linear peptide’
AND organism_name like ‘%coronavirus%’
AND assay_group != ‘qualitative binding’
AND allele_name like ‘HLA-%’
AND antigen_name like ‘Spike%’
AND length(Description) IN (9,10)
ORDER BY result_score

Build Machine Learning Model with BQ

Building ML models with BigQuery is as simple as writing SQL statements; makes ML modeling accessible to even SQL developers and analysts. In this example, we will create a simple classification model to predict for a given peptide if there is strong binding affinity with certain HLA Allele.

Following statement creates a classification model using logistic regression by selecting feature columns of Allele and peptide of specific mers to classify if a peptide is a good candidate for vaccine testing. Filter data for peptides with length of 9 or 10 mers only. Also, since we can run multiple samples, we are randomizing samples by 80% of data for learning. You would need a project and dataset to run and hold model, following statement has ’corona’ as a dataset name:

CREATE OR REPLACE MODEL `corona.Classification_model_P2`
TRANSFORM (Qualitative_Measure, Description, Allele_Name,
ML.MIN_MAX_SCALER(Quantitative_measurement) OVER() AS rs
)
OPTIONS
(
model_type=’logistic_reg’,
input_label_cols=[‘Qualitative_Measure’]
)
AS
SELECT
Qualitative_Measure, Description, Allele_Name, Quantitative_measurement
FROM
`bigquery-public-data.immune_epitope_db.mhc_ligand_full`
WHERE length(Description) IN (9,10)
AND organism_name like ‘%coronavirus%’
AND rand() < 0.8

As you can see from the statement above, not only you can create a machine learning model but you can also do feature engineering as part of model input definition. We normalized Quantitative_measurement to its deviation with respect to min-max value. This allows for no extra data manipulation during prediction since the model takes care of formatting/featuring data from row inputs. From BQ Web UI, you can examine a model just created for its accuracy and confusion matrix.

You can also run ML.EVALUATE function to query model stats using SQL statement as below:

SELECT
*
FROM ML.EVALUATE(MODEL `corona.Classification_model_P1`)

Once the model is ready, let’s run a prediction statement to check how our ML model performs. Run following SQL statement to run prediction with random sample from our dataset:

SELECT
predicted_Qualitative_Measure, predicted_Qualitative_Measure_probs, Qualitative_Measure as original_result
FROM ML.PREDICT(MODEL `corona.Classification_model_P2`, (
SELECT Qualitative_Measure, Description, Allele_Name, Quantitative_measurement
FROM `bigquery-public-data.immune_epitope_db.mhc_ligand_full`
WHERE length(Description) IN (9,10)
AND organism_name like ‘%coronavirus%’
AND rand() < 0.0009))

The query result shows predicted qualitative classification with probability which you can observe against original value. Such inferences are very useful to narrow down vaccine testing candidates to speed up the overall research cycle. ML modeling and testing is an iterative process. AI pipeline is Google AI Platform’s Pipeline framework to help you optimize and efficiently operationalize your ML process.

Next: We will discuss AutoML and Building Pipeline to operationalize at scale.

Stay safe!

Closing comments and action items

  1. With BigQuery, one can create a peta-byte scale data warehouse for super powerful query performance. Not only does BQ provide scale-able storage and compute but also, provides an easy way to build and consume ML models using SQL statements. Learn more about BQML here.
  2. Building and Consuming ML is just one small task within the overall data pipeline and research flow. To optimize your overall data and ML pipeline, you can leverage Google Cloud’s AI Platform Pipeline. You can learn about AI Pipeline here. You can read and use how the above example of vaccine research flow can be modeled with AI pipeline with this github resources.
  3. BigQuery is free without a credit card (within the free tier). If you add a credit card make sure to set cost controls.

I don’t speak for my employer. This is not official Google work. Any errors that remain are mine, of course.

--

--

Jignesh Mehta
Google AI Platform for Predicting Vaccine Candidate

Google Data Analytics Specialist, Driving Digital Transformation and Solutions with Cloud Data Platform, Advanced Analytics and AI/ML