Part III: Building a Machine Learning Models for Peptide Vaccine using AutoML

AutoML can be used as a quick way to explore data and try to understand what models can be used to perform inference.

One of the advanced features of AutoML Table is to automatically perform common feature engineering tasks for you and produce state-of-the-art models (see features for more details) with one click.

Key Advantage of AutoML:

  • Takes your datasets and starts training for multiple model architectures at the same time
  • Determine the best model architecture for your data: Linear, Feedforward deep neural network, Gradient Boosted Decision Tree, AdaNet or Ensembles of various model architectures
  • Export models as docker image.
  • Reverse engineering and model exploration in TensorBoard.

For this blog, we will continue to use The Immune Epitope Database (IEDB) publicly available in BigQuery as described in previous blog. We will demonstrate use of AutoML through Big Query ML as well as through AutoML console.

Creating AutoML models with BigQuery is as simple as writing SQL statements. Following statement creates AutoML model by selecting feature columns of Allele and peptide of specific mers to classify if a peptide is a good candidate for vaccine testing. Filter data for peptides with length of 9 or 10 mers only. The reason for that was discussed in Part I-II. Also, since we can run multiple samples, we are randomizing samples by 80% of data for learning. You would need a project and dataset to run and hold model, following statement has ’corona’ as a dataset name

CREATE OR REPLACE MODEL `corona.Classification_model_automl`
OPTIONS
(
model_type='automl_classifier',
input_label_cols=['Qualitative_Measure'],
budget_hours=1.0
)
AS
SELECT
Qualitative_Measure, Description, Allele_Name, Quantitative_measurement
FROM
`bigquery-public-data.immune_epitope_db.mhc_ligand_full`
WHERE length(Description) IN (9,10)
AND organism_name like '%coronavirus%'
AND rand() < 0.8

AutoML will explore explore multiple models which will be trained in parallel by AutoML. Details of the models can be found in the “Operations Logging”. From Operations Logging Web UI, you can examine hyperparameters which was explored by AutoML.

Operations Logging for AutoML

After creating your model, you can evaluate the performance of the classifier using the ML.EVALUATE function. To run the ML.EVALUATE query that evaluates the model:

SELECT
*
FROM ML.EVALUATE(MODEL `corona.Classification_model_automl`)

You can also use the ML.ROC_CURVE function for specific metrics. A classifier is one of a set of enumerated target values for a label. For example, in this tutorial you are using a classification model that detects one of the qualification class for peptide binding.

SELECT roc_auc,
CASE WHEN roc_auc > .8 THEN 'good'
WHEN roc_auc > .7 THEN 'fair'
WHEN roc_auc > .5 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM ML.EVALUATE(MODEL `corona.Classification_model_automl`)

Now that you have evaluated your model, the next step is to use it to predict outcomes. To run the query that uses the model to predict the number of transactions: Following example demonstrate leveraging BQ model for prediction. Optionally, you can export model and publish it on to Google AI Platform for serving prediction.

SELECT
predicted_Qualitative_Measure, predicted_Qualitative_Measure_probs, Qualitative_Measure as original_result
FROM ML.PREDICT(MODEL `corona.Classification_model_automl`, (
SELECT Qualitative_Measure, Description, Allele_Name, Quantitative_measurement
FROM `bigquery-public-data.immune_epitope_db.mhc_ligand_full`
WHERE length(Description) IN (9,10)
AND organism_name like '%coronavirus%'
AND rand() < 0.0009))

The result shows predicted quality class with confidence. You can compare that with original result.

AutoML Console

Customers can directly use AutoML Console if they don’t want to deal with SQL. You still might want to create a view in Big Query which will filter your data for peptides with length of 9 or 10 mers only.

SELECT
Qualitative_Measure, Description, Allele_Name, Quantitative_measurement
FROM
`bigquery-public-data.immune_epitope_db.mhc_ligand_full`
WHERE length(Description) IN (9,10)
AND organism_name like '%coronavirus%'
AND rand() < 0.8

Data can be loaded directly from Big Query of Google Cloud Storage.

Target column for prediction has to be specified through console. And model training can be started in the console after number of hour allowed for model training will be set.

If optimal model will be found before specified time then training will be stopped. Once the model is trained, we can explore model accuracy and perform batch or online predictions:

export the result. You’ll find the export option under TEST & USE. (See the documentation for details on the export process).

Model Exploration with TensorBoard

Finally, AutoML model can be visualized and explored in TensorBoard. This step can provide more incites about model complexity and provide Bioinformaticians with leads for model enhancement.

To visualize AutoML model in TensorBoard, this requires a conversion step. You will need to have TensorFlow 1.14 or 1.15 installed to run the conversion script.

Then, download this script, e.g. via

curl -O https://raw.githubusercontent.com/amygdala/code-snippets/master/ml/automl/tables/model_export/convert_oss.py,

to the parent directory of model_export. Create a directory for the output (e.g. converted_export), then run the script as follows:

mkdir converted_export

python ./convert_oss.py — saved_model ./model-export/tbl/<your_renamed_directory>/saved_model.pb — output_dir converted_export

Then, point TensorBoard to the converted model graph:

# Load the TensorBoard notebook extension

%load_ext tensorboard

tensorboard — logdir=converted_export

You will see a rendering of the model graph, and can pan and zoom to view model sub-graphs in more detail.

Exploring automatically generated models can give data scientists and data analysts new leads on how models can be enhanced.

Next: We will discuss Building Pipeline to operationalize at scale.

Stay safe!

Closing comments and action items

  1. With AutoML, one can quickly experiment and create a Machine Learning pipeline. BQML and AutoML provides an easy way to build and consume ML models using SQL statements or directly through google Cloud Console. Learn more about BQML here.
  2. Building and Consuming ML is just one small task within the overall data pipeline and research flow. To optimize your overall data and ML pipeline, you can leverage Google Cloud’s AI Platform Pipeline. You can learn about AI Pipeline here. You can read and use how the above example of vaccine research flow can be modeled with AI pipeline with this github resources.
  3. BigQuery is free without a credit card (within the free tier). If you add a credit card make sure to set cost controls.

I don’t speak for my employer. This is not official Google work. Any errors that remain are mine, of course.

--

--