Google Cloud Professional Machine Learning Engineer Certification Practice Questions

GCP Guru - Google Cloud Certification Club

22 min readOct 5, 2021

👉 Premium Questions + Detailed Solution Explanations + Reference Links

More sample questions with detailed solutions: https://drive.google.com/drive/folders/1Ky3fom7QCILjhNp_Nu7tmyT1g2cFnsTV?usp=sharing

Premium material: ExamsRocket.com

Q1. You are building an ML model to detect anomalies in real-time sensor data. You will use Pub/Sub to handle incoming requests. You want to store the results for analytics and visualization. How should you configure the pipeline?

A. 1 = Dataflow, 2 = AI Platform, 3 = BigQuery

B. 1 = DataProc, 2 = AutoML, 3 = Cloud Bigtable

C. 1 = BigQuery, 2 = AutoML, 3 = Cloud Functions

D. 1 = BigQuery, 2 = AI Platform, 3 = Cloud Storage

A is correct as Data from sensors would be ingested to a Pub/Sub topic which would be further pre-processed using DataFlow batch streaming jobs, later AI-Platform Models is used to serve the model detections via Batch Predictions and the results can be stored in Bigquery for analysis and Visualizations using Data Studio or AI-Platform Notebooks.
B is incorrect as Apache Beam SDK used in Dataflow has integrations with Pub/Sub streaming and is recommended with Pub/Sub instead of DataProc, and BigQuery is a better choice for analytics and visualizations.
C is incorrect as Cloud Functions can’t be used for result analysis or visualization.
D is incorrect as Cloud Storage can’t be used for result analysis or visualization.
Note:
You can read JSON-formatted messages from a Pub/Sub topic and write them to a BigQuery table, but the results also are needed to be stored in bigquery for analysis, and using Bigquery you can’t have API calls for model predictions, it can be done using DataFlow jobs, hence C and D can’t be correct.
Links:
Similar problem statement: https://cloud.google.com/architecture/detecting-anomalies-in-financial-transactions

Q2. Your organization wants to make its internal shuttle service route more efficient. The shuttles currently stop at all pick-up points across the city every 30 minutes between 7 am and 10 am. The development team has already built an application on Google Kubernetes Engine that requires users to confirm their presence and shuttle station one day in advance. What approach should you take?

A. 1. Build a tree-based regression model that predicts how many passengers will be picked up at each shuttle station. 2. Dispatch an appropriately sized shuttle and provide the map with the required stops based on the prediction.

B. 1. Build a tree-based classification model that predicts whether the shuttle should pick up passengers at each shuttle station. 2. Dispatch an available shuttle and provide the map with the required stops based on the prediction.

C. 1. Define the optimal route as the shortest route that passes by all shuttle stations with confirmed attendance at the given time under capacity constraints. 2. Dispatch an appropriately sized shuttle and indicate the required stops on the map.

D. 1. Build a reinforcement learning model with tree-based classification models that predict the presence of passengers at shuttle stops as agents and a reward function around a distance-based metric. 2. Dispatch an appropriately sized shuttle and provide the map with the required stops based on the simulated outcome.

A is incorrect as shuttle stations would already be available and there is no need to predict those using an ML algorithm.
B is incorrect as shuttle stations would already be available and there is no need to predict those using an ML algorithm.
C is correct as all the shuttle stations which are required to be attended would be available 1 day prior from the application which is already built, hence optimal path can be determined and appropriate shuttle size can be decided and sent accordingly.
D is incorrect, this method can be used if any application isn’t already available, now we go with option C.

Q3. You were asked to investigate failures of a production line component based on sensor readings. After receiving the dataset, you discover that less than 1% of the readings are positive examples representing failure incidents. You have tried to train several classification models, but none of them converge. How should you resolve the class imbalance problem?

A. Use the class distribution to generate 10% positive examples.

B. Use a convolutional neural network with max-pooling and softmax activation.

C. Downsample the data with upweighting to create a sample with 10% positive examples.

D. Remove negative examples until the numbers of positive and negative examples are equal.

A is incorrect, this might help, but 1% is very less data to effectively use upsampling techniques like SMOTE, etc. And C is a better solution.
B is incorrect as using CNN with max pooling will compensate for overfitted problems but wouldn’t resolve data imbalance.
C is correct as downsampling while adding more weight to downsampled data during calculating loss is used to boost the prediction score of downsampled class while training and model will converge faster. If only downsampling is done then the prediction scores for downsampled classes would be low and training would take more time to converge. (refer link)
D is incorrect as it would cause loss of data.
Links:
Handling unbalanced datasets: https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data

Q4. You want to rebuild your ML pipeline for structured data on Google Cloud. You are using PySpark to conduct data transformations at scale, but your pipelines are taking over 12 hours to run. To speed up development and pipeline run time, you want to use a serverless tool and SQL syntax. You have already moved your raw data into Cloud Storage. How should you build the pipeline on Google Cloud while meeting the speed and processing requirements?

A. Use Data Fusionג€™s GUI to build the transformation pipelines, and then write the data into BigQuery.

B. Convert your PySpark into SparkSQL queries to transform the data and then run your pipeline on Dataproc to write the data into BigQuery.

C. Ingest your data into Cloud SQL, convert your PySpark commands into SQL queries to transform the data, and then use federated queries from BigQuery for machine learning.

D. Ingest your data into BigQuery using BigQuery Load, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table.

A is incorrect.
B is correct as DataProc will significantly reduce the job running time for transformations and then BigQuery can be used to create the ML Models.
C is incorrect, here transformation is done on Cloud SQL, which wouldn’t scale the process.
D is incorrect as this process wouldn’t scale the data transformation routine. And, it is always better to transform data during ingestion.
Links:
GCP Doc: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example
Jupyter Notebook (Github): https://github.com/tfayyaz/cloud-dataproc/blob/master/notebooks/python/1.2.%20BigQuery%20Storage%20%26%20Spark%20SQL%20-%20Python.ipynb

Q5. You manage a team of data scientists who use a cloud-based backend system to submit training jobs. This system has become very difficult to administer, and you want to use a managed service instead. The data scientists you work with use many different frameworks, including Keras, PyTorch, Theano, Scikit-learn, and custom libraries. What should you do?

A. Use the AI Platform custom containers feature to receive training jobs using any framework.

B. Configure Kubeflow to run on Google Kubernetes Engine and receive training jobs through TF Job.

C. Create a library of VM images on Compute Engine and publish these images on a centralized repository.

D. Set up Slurm workload manager to receive jobs that can be scheduled to run on your cloud infrastructure.

A is correct as AI-Platform supports running training jobs with custom containers. Hence, all the available code in different frameworks can be easily containerised and they would be ready to run on AI-Platform. AI-Platform is a managed service which supports distributed training, hyper-parameter tuning, monitoring, logging and visualization with certain frameworks.
B is incorrect as Kubeflow isn’t a managed service provided by GCP out of the box. It is a platform to manage/orchestrate the complicated kubernetes ML workflows.
C is incorrect, as Compute Engine (VM) isn’t a managed service and it won’t make any administration work simpler.
D is incorrect, this is more far from a managed service based solution.
Links:
https://cloud.google.com/ai-platform/prediction/docs/use-custom-container

Q6. You work for an online retail company that is creating a visual search engine. You have set up an end-to-end ML pipeline on Google Cloud to classify whether an image contains your companyג€™s product. Expecting the release of new products in the near future, you configured a retraining functionality in the pipeline so that new data can be fed into your ML models. You also want to use AI Platformג€™s continuous evaluation service to ensure that the models have high accuracy on your test dataset. What should you do?

A. Keep the original test dataset unchanged even if newer products are incorporated into retraining.

B. Extend your test dataset with images of the newer products when they are introduced to retraining.

C. Replace your test dataset with images of the newer products when they are introduced to retraining.

D. Update your test dataset with images of the newer products when your evaluation metrics drop below a pre-decided threshold.

A is incorrect, as it won’t give information of model performance on new data.
B is correct, as the model is being re-trained, i.e. it is being actually trained on previous original train dataset and also on the new dataset, the evaluation must be performed on both original test data and new test data to validate the model performance. As new and new data is available, there would be a slight drift in the data from the original one, hence retraining is done to compensate for that drift. Various retraining strategies can be used, similar strategy which is used for training data is needed to be replicated with test data as well.
C is incorrect, as the model is also trained on the original dataset, only including new data in testing would not give the correct representation of model performance.
D is incorrect, there is no need to only update test data with new images when evaluation accuracy drops below a certain threshold.
Note:
If the model is retrained only using the new data, then too, test data shouldn’t contain only the new data, eventually we can drop older data with a certain % during evaluation.

Q7. You need to build classification workflows over several structured datasets currently stored in BigQuery. Because you will be performing the classification several times, you want to complete the following steps without writing code: exploratory data analysis, feature selection, model building, training, and hyperparameter tuning and serving. What should you do?

A. Configure AutoML Tables to perform the classification task.

B. Run a BigQuery ML task to perform logistic regression for the classification.

C. Use AI Platform Notebooks to run the classification model with pandas library.

D. Use AI Platform to run the classification model job configured for hyperparameter tuning.

A is correct as no coding would be required, exploratory data analysis, feature selection, model building, training, and hyperparameter tuning and serving is supported with AutoML Tables.
B is incorrect, as to run ML classification task on BigQuery SQL commands would be needed
C is incorrect, AI Platform Notebooks are generally used for experimentation involving EDA, training, tuning, but not for serving purposes. And extensive coding would be required to perform these tasks.
D is incorrect, to run a classification job on AI Platform, classification code needs to be written and EDA and feature selection would need to be performed separately before running the training job.
Links:
AutoML Table functionalities: https://cloud.google.com/automl-tables/docs/beginners-guide

Q8. You work for a public transportation company and need to build a model to estimate delay times for multiple transportation routes. Predictions are served directly to users in an app in real-time. Because different seasons and population increases impact the data relevance, you will retrain the model every month. You want to follow Google-recommended best practices. How should you configure the end-to-end architecture of the predictive model?

A. Configure Kubeflow Pipelines to schedule your multi-step workflow from training to deploying your model.

B. Use a model trained and deployed on BigQuery ML, and trigger retraining with the scheduled query feature in BigQuery.

C. Write a Cloud Functions script that launches training and deploying jobs on AI Platform that is triggered by Cloud Scheduler.

D. Use Cloud Composer to programmatically schedule a Dataflow job that executes the workflow from training to deploying your model.

A is correct, Kubeflow can be used to orchestrate end-to-end ML pipelines based on Kubernetes containers. Even AI-Platform clusters can be connected to Kubeflow SDK. This is google’s recommended way to run end-to-end ML Pipeline.
B is incorrect, this is also a viable solution, but ML capabilities of the model are restricted by BigQuery ML. The exact model requirements aren’t mentioned in the question for us to see whether the model can be trained using BigQuery ML, hence this option is discarded.
C is incorrect, as this is a very crude way to implement the end-to-end ML pipeline.
D is incorrect. Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow. It is a recommended way by Google to schedule continuous training jobs. But DataFlow isn’t used to run the training jobs. AI Platform is used for training and deployment.
Note:
All options are feasible here, but we have to select the best option.
Links:
Running ML pipelines on GCP (also refer internal links for Kubeflow):
https://cloud.google.com/ai-platform/pipelines/docs/run-pipeline
Cloud Composer Continuous Training jobs (for reference): https://www.coursera.org/lecture/ml-pipelines-google-cloud/what-is-cloud-composer-CuXTQ

Q9. You are developing ML models with an AI Platform for image segmentation on CT scans. You frequently update your model architectures based on the newest available research papers and have to rerun training on the same dataset to benchmark their performance. You want to minimize computation costs and manual intervention while having version control for your code. What should you do?

A. Use Cloud Functions to identify changes to your code in Cloud Storage and trigger a retraining job.

B. Use the gcloud command-line tool to submit training jobs on the AI Platform when you update your code.

C. Use Cloud Build linked with Cloud Source Repositories to trigger retraining when new code is pushed to the repository.

D. Create an automated workflow in Cloud Composer that runs daily and looks for changes in code in Cloud Storage using a sensor.

A is incorrect, using Cloud function to detect changes in code stored in Google cloud storage would lead to trigger of training job at every change, it’s better to trigger training job manually only when required.
B is correct, manual intervention would be very less to run the job on AI-Platform, as you need to run just one command to submit the training job. You would be submitting the training job only when you feel the model code is ready for retraining purposes. And AI-Platform only charges you for the consumed ml-units thus minimizing the cost.
C is incorrect, as build pipeline would be triggered for each commit in Source repository, whether you want to run the build(here training) or not. It is great for Continuous deployment pipelines where application availability is priority, but not in case of ML model training.
D is incorrect, again this isn’t a viable solution as checking changes in code daily may trigger non-required training jobs.
Note:
Tip: AI-Platform is generally recommended by GCP for custom training workflows.
Links:
AI Platform Submit Training jobs: https://cloud.google.com/ai-platform/training/docs/training-jobs

Q10. Your team needs to build a model that predicts whether images contain a driver’s license, passport, or credit card. The data engineering team already built the pipeline and generated a dataset composed of 10,000 images with driver’s licenses, 1,000 images with passports, and 1,000 images with credit cards. You now have to train a model with the following label map: [‘drivers_license’, ‘passport’, ‘credit_card’]. Which loss function should you use?

A. Categorical hinge

B. Binary cross-entropy

C. Categorical cross-entropy

D. Sparse categorical cross-entropy

A is incorrect, Categorical Hinge loss is used in problems such as Question-Answering problems in ML when we have a difference loss function only above a certain threshold (refer link for ‘How to use Hinge loss’).
B is incorrect, Binary cross-entropy is used either when there is classification amongst only two classes or when the problem is a multi-label classification problem (with multiple correct outputs). In this case the output classes are one-hot encoded and the output activation function of sigmoid is used.
C is incorrect, Categorical cross-entropy is used in multiple label classification when output classes are one-hot encoded and there is only one correct label. In this case output activation function is softmax.
D is correct, Sparse categorical cross-entropy is used in multiple label classification when output classes are label encoded and there is only one correct label. In this case output activation function is softmax. (Such type of problem is mentioned in the given question)
Links:
How to use hinge loss:
https://www.machinecurve.com/index.php/2019/10/17/how-to-use-categorical-multiclass-hinge-with-keras/
Choosing loss function: https://stats.stackexchange.com/questions/326065/cross-entropy-vs-sparse-cross-entropy-when-to-use-one-over-the-other

Q11. You are designing an ML recommendation model for shoppers on your company’s e-commerce website. You will use Recommendations AI to build, test, and deploy your system. How should you develop recommendations that increase revenue while following best practices?

A. Use the ‘Other Products You May Like’ recommendation type to increase the click-through rate.

B. Use the ‘Frequently Bought Together’ recommendation type to increase the shopping cart size for each order.

C. Import your user events and then your product catalog to make sure you have the highest quality event stream.

D. Because it will take time to collect and record product data, use placeholder values for the product catalog to test the viability of the model.

A is incorrect, Option C would yield better results by importing user events. This would only recommend Other Products as it is the type mentioned.
B is incorrect, Option C would yield better results by importing user events. This would only recommend Frequently bought together Products as it is the type mentioned.
C is correct as Google’s recommended way to use Recommendation AI to create the highest quality event stream by importing your user events and product catalogs.
D is incorrect, products can only be recommended by users’ behaviour.
Links:
Recommendation AI(Refer How It works diagram): https://cloud.google.com/recommendations

Q12. You are designing architecture with a serverless ML system to enrich customer support tickets with informative metadata before they are routed to a support agent. You need a set of models to predict ticket priority, predict ticket resolution time, and perform sentiment analysis to help agents make strategic decisions when they process support requests. Tickets are not expected to have any domain-specific terms or jargon. The proposed architecture has the following flow:

Which endpoints should the Enrichment Cloud Functions call?

A. 1 = AI Platform, 2 = AI Platform, 3 = AutoML Vision

B. 1 = AI Platform, 2 = AI Platform, 3 = AutoML Natural Language

C. 1 = AI Platform, 2 = AI Platform, 3 = Cloud Natural Language API

D. 1 = Cloud Natural Language API, 2 = AI Platform, 3 = Cloud Vision API

A is incorrect, as sentiment analysis is not a computer vision problem, it’s an NLP problem. AutoML Vision is used to train computer vision models for Image classification or Object detection on our data.
B is incorrect, AutoML NLP is used to train the text-classification (here sentiment analysis) model on our own dataset without the need of writing the code for optimal model architecture. But here since no specific jargon is present Cloud NLP API would suffice.
C is correct, Custom models deployed on AI-Platform can be used for Resolution time prediction and Ticket priority prediction. Cloud Natural language API is an NLP API provided by Google out of the box for powerful Text Analysis. Since Tickets doesn’t have any jargons then pretrained API can be used for sentiment analysis.
D is incorrect, as sentiment analysis is not a computer vision problem, it’s an NLP problem. Cloud Vision API is a powerful visual analytics API by Google for Image analysis.
Links:
AutoML NLP features: https://cloud.google.com/natural-language/automl/docs/features
Cloud NLP API (refer features): https://cloud.google.com/natural-language/

Q13. You have trained a deep neural network model on Google Cloud. The model has a low loss on the training data but is performing worse on the validation data. You want the model to be resilient to overfitting. Which strategy should you use when retraining the model?

A. Apply a dropout parameter of 0.2 and decrease the learning rate by a factor of 10.

B. Apply an L2 regularization parameter of 0.4 and decrease the learning rate by a factor of 10.

C. Run a hyperparameter tuning job on the AI Platform to optimize for the L2 regularization and dropout parameters.

D. Run a hyperparameter tuning job on the AI Platform to optimize for the learning rate and increase the number of neurons by a factor of 2.

A is incorrect, insufficient data to decide upon the parameter values.
B is incorrect, insufficient data to decide upon the parameter values.
C is correct, L2 regularization and Dropout are used to reduce overfitting in the neural network.
When we feel that model is overfitting due to less amount of training data, we go for L2 regularization and when we feel the overfitting is due to model complexity we go for dropout in neural networks. (In case of excessive features we go for L1 in traditional algorithms). Since the overfitting reason isn’t mentioned, we would run a hyper-parameter tuning job on AI-Platform to find the appropriate parameters.
D is incorrect, increasing the number of neurons would worsen overfitting, since it would increase the model complexity. (Dropout is used to reduce the model complexity and make model weights more robust, this is a crude explanation, refer below link for details)
Links:
Statquest L2 Regularization: https://www.youtube.com/watch?v=Q81RR3yKn30
Statquest L1 Regularization: https://www.youtube.com/watch?v=NGf0voTMlcs
Dropout: https://www.youtube.com/watch?v=D8PJAL-MZv8

Q14. You built and manage a production system that is responsible for predicting sales numbers. Model accuracy is crucial because the production model is required to keep up with market changes. Since being deployed to production, the model hasn’t changed; however, the accuracy of the model has steadily deteriorated.
What issue is most likely causing the steady decline in model accuracy?

A. Poor data quality

B. Lack of model retraining

C. Too few layers in the model for capturing information

D. Incorrect data split ratio during model training, evaluation, validation, and test

A is incorrect, as poor data quality of the original data isn’t the main reason for accuracy deterioration. Retraining is needed, as the model needs to keep up with market changes.
B is correct, since there is lack of retraining model isn’t keeping up with market changes (We retrain model with the new data, so that keeps with changes in market)
C is incorrect, as this is not an underfitting problem, as accuracy is deteriorating with time, and any information on train accuracy is not mentioned.
D is incorrect, as this wouldn’t explain the deteriorating nature of accuracy with time.
Links:
Why retraining is important: https://neurospace.io/blog/2019/09/why-is-retraining-so-important/

Q15. You have been asked to develop an input pipeline for an ML training model that processes images from disparate sources at low latency. You discover that your input data does not fit in memory. How should you create a dataset following Google-recommended best practices?

A. Create a tf.data.Dataset.prefetch transformation.

B. Convert the images to tf.Tensor objects, and then run Dataset.from_tensor_slices().

C. Convert the images to tf.Tensor objects, and then run tf.data.Dataset.from_tensors().

D. Convert the images into TFRecords, store the images in Cloud Storage, and then use the tf.data API to read the images for training.

A is incorrect, this can prefetching can be also done with tfrecords and is more efficient with it.
B is incorrect, since tfrecords are most recommended.
C is incorrect, since tfrecords are most recommended.
D is correct, Tfrecords with tf.data.TFRecordDataset is the most recommended way. tf.data API is optimized for tfrecords and the prefetch with tfrecords works really fast, thus the next batch of data is always effectively prefetched when the current batch is being processed by the model during training.
Note:
Tfrecords with tf.data is GCP’s recommended way while training a model with a huge dataset using tensorflow.
Links:
Tensorflow official Doc: https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset
Kaggle Notebook for Tfrecords: https://www.kaggle.com/ryanholbrook/tfrecords-basics

Q16. You are an ML engineer at a large grocery retailer with stores in multiple regions. You have been asked to create an inventory prediction model. Your model’s features include region, location, historical demand, and seasonal popularity. You want the algorithm to learn from new inventory data on a daily basis. Which algorithms should you use to build the model?

A. Classification

B. Reinforcement Learning

C. Recurrent Neural Networks (RNN)

D. Convolutional Neural Networks (CNN)

Note:
From the question it can be interpreted as a time series problem as the terms like historical demand and seasonal popularity are used.
A is incorrect, This option is very generic
B is incorrect, Reinforcement Learning -> Game AI, Industrial Automation, many more (refer link)
C is correct, RNN -> Time series, NLP problems (sequential data)
D is incorrect, CNN → Image data, 1D CNN can also be used in conjunction with RNN sometimes to deal with overfitting
Links:
RL: https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html, https://analyticsindiamag.com/top-10-free-resources-to-learn-reinforcement-learning/

Q17. You are building a real-time prediction engine that streams files that may contain Personally Identifiable Information (PII) to Google Cloud. You want to use the Cloud Data Loss Prevention (DLP) API to scan the files. How should you ensure that the PII is not accessible by unauthorized individuals?

A. Stream all files to Google Cloud, and then write the data to BigQuery. Periodically conduct a bulk scan of the table using the DLP API.

B. Stream all files to Google Cloud and write batches of the data to BigQuery. While the data is being written to BigQuery, conduct a bulk scan of the data using the DLP API.

C. Create two buckets of data: Sensitive and Non-sensitive. Write all data to the Non-sensitive bucket. Periodically conduct a bulk scan of that bucket using the DLP API, and move the sensitive data to the Sensitive bucket.

D. Create three buckets of data: Quarantine, Sensitive, and Non-sensitive. Write all data to the Quarantine bucket. Periodically conduct a bulk scan of that bucket using the DLP API, and move the data to either the Sensitive or Non-Sensitive bucket.

A is incorrect, the method is correct but it doesn’t satisfy the real-time requirement of the prediction engine.
B is correct, since the application is real-time, we scan the bulk of data in BigQuery while other data has been written in it.
C is incorrect, as storing data first in a sensitive bucket would actually leak the PII to unauthorised users.
D is incorrect, the method is correct, but isn’t real-time.
Links:
DLP with GCS: https://cloud.google.com/architecture/automating-classification-of-data-uploaded-to-cloud-storage
DLP with BQ: https://cloud.google.com/bigquery/docs/scan-with-dlp

Q18. You work for a large hotel chain and have been asked to assist the marketing team in gathering predictions for a targeted marketing strategy. You need to make predictions about user lifetime value (LTV) over the next 20 days so that marketing can be adjusted accordingly. The customer dataset is in BigQuery, and you are preparing the tabular data for training with AutoML Tables. This data has a time signal that is spread across multiple columns. How should you ensure that AutoML fits the best model to your data?

A. Manually combine all columns that contain a time signal into an array. AIlow AutoML to interpret this array appropriately. Choose an automatic data split across the training, validation, and testing sets.

B. Submit the data for training without performing any manual transformations. AIlow AutoML to handle the appropriate transformations. Choose an automatic data split across the training, validation, and testing sets.

C. Submit the data for training without performing any manual transformations and indicate an appropriate column as the Time column. AIlow AutoML to split your data based on the time signal provided, and reserve the more recent data for the validation and testing sets.

D. Submit the data for training without performing any manual transformations. Use the columns that have a time signal to manually split your data. Ensure that the data in your validation set is from 30 days after the data in your training set and that the data in your testing sets from 30 days after your validation set.

A is incorrect, there is no need to manually combine data as array, and don’t use automatic data splits, as AutoML would consider all rows independently and split the data randomly.
B is incorrect, as AutoML would consider all rows independently in case of automatic data splits and split the data randomly with appropriate ratios.
C is incorrect, AutoML Tables uses the earliest 80% of the rows for training, the next 10% of rows for validation, and the latest 10% of rows for testing by default when Time column is specified, but this split wouldn’t satisfy the days criteria given in question. (Eg: Let’s say the 10% validation data doesn’t have data even for 20 days then how can that be validated)
D is correct, as this would satisfy the days criteria mentioned in the question. 30 days is more than 20 days, and the prediction model can be used on a validation dataset to validate the results for the next 20 days.
Links:
AutoML Preparing data (refer section ‘The Time Column’): https://cloud.google.com/automl-tables/docs/prepare

Q19. You have written unit tests for a Kubeflow Pipeline that require custom libraries. You want to automate the execution of unit tests with each new push to your development branch in Cloud Source Repositories. What should you do?

A. Write a script that sequentially performs the push to your development branch and executes the unit tests on Cloud Run.

B. Using Cloud Build set an automated trigger to execute the unit tests when changes are pushed to your development branch.

C. Set up a Cloud Logging sink to a Pub/Sub topic that captures interactions with Cloud Source Repositories. Configure a Pub/Sub trigger for Cloud Run, and execute the unit tests on Cloud Run.

D. Set up a Cloud Logging sink to a Pub/Sub topic that captures interactions with Cloud Source Repositories. Execute the unit tests using a Cloud Function that is triggered when messages are sent to the Pub/Sub topic.

A is incorrect, this limits the rich Git CLI functionalities as we would be using script to push code.
B is incorrect, cloud build not recommended for unit testing.
C is correct, as cloud run can use custom containers endpoints to run in a serverless way. Pub/Sub notifications can be created from the update in the source repository, and that event topic can be used to trigger the cloud function.
D is incorrect, as cloud functions can be used for unitesting with custom libraries as local packages but only for the same language.
Links:
Unit testing with pub/sub and cloud functions: https://cloud.google.com/functions/docs/samples/functions-pubsub-unit-test#functions_pubsub_unit_test-python
Cloud Source repository with Pub/Sub: https://cloud.google.com/source-repositories/docs/pubsub-notifications
Cloud Run v/s Cloud Function: https://medium.com/google-cloud/cloud-run-and-cloud-function-what-i-use-and-why-12bb5d3798e1
Packaging custom libraries as local packages in cloud function (same language): https://cloud.google.com/functions/docs/writing/specifying-dependencies-python

Q20. You are training an LSTM-based model on AI Platform to summarize text using the following job submission script:

gcloud ai-platform jobs submit training $JOB_NAME \
— package-path $TRAINER_PACKAGE_PATH \
— module-name $MAIN_TRAINER_MODULE \
— job-dir $JOB_DIR \
— region $REGION \
— scale-tier basic \
— \
— epochs 20 \
— batch_size=32 \
— learning_rate=0.001 \

You want to ensure that training time is minimized without significantly compromising the accuracy of your model. What should you do?

A. Modify the ‘epochs’ parameter.

B. Modify the ‘scale-tier’ parameter.

C. Modify the ‘batch size’ parameter.

D. Modify the ‘learning rate’ parameter.

A is incorrect, less training iteration will affect model performance.
B is correct, cost is not a concern as it is not mentioned in the question, the scale tier can be upgraded to significantly minimize the training time.
C is incorrect, wouldn’t affect training time, but would affect model performance.
D is incorrect, the model might converge faster with higher learning rate, but this would affect the training routine and might cause exploding gradients.
Links:
Running Training job on AI Platform: https://cloud.google.com/ai-platform/training/docs/training-jobs
Scale Tier AI Platform: https://cloud.google.com/ai-platform/training/docs/machine-types

Google Cloud Professional Machine Learning Engineer Certification Practice Questions

Written by GCP Guru - Google Cloud Certification Club