How to Build A ML model and Get Predictions using TensorFlow : (1/3)

Code Heroku

Published in

Code Heroku

8 min readApr 22, 2019

Exploring the Dataset : To Predict Baby’s Weight.

Welcome. This is the 1st Tutorial from the series : How to Build A ML model and Get Predictions using TensorFlow.

In this series, we are going to use tf.estimator API.
Its a high-level TensorFlow API that greatly simplifies machine learning programming. It can be used for training, evaluation, prediction purposes in machine learning environment.

On Google Cloud Platform (GCP) , we can use Cloud ML Engine to train machine learning models in TensorFlow and other Python ML libraries (eg. scikit-learn).

Cloud ML Engine gives us a service for training and deploying TensorFlow models. If you have data that fits in memory, pretty much any machine learning framework will work. But once your data sets get larger than the memory, you need a more sophisticated ML framework.

This is where TensorFlow comes in.

The technology we are going to use is : Cloud Datalab.

With the help of Cloud Datalab, we can explore, transform, visualize and process data. The data may be present in Google BigQuery, Compute Engine and Cloud Storage.

Besides Cloud Datalab, the other piece of technology we will be using is BigQuery. It connects to GCP, hence accessing BigQuery is a breeze.

Based on this, we can then use the data to build data pipelines for deployment to BigQuery or to create the machine ML models.

The Dataset of births that we need, is stored in BigQuery.

BigQuery is a serverless data warehouse that operates at massive scale.

Also, What’s noteworthy about Datalab is that the Datalab is open source, and developers can simply fork it and/or submit pull requests on GitHub.

Overview of the Series:

En route, in this tutorial you explore the data, visualize the dataset, pick features for your machine learning model.
Then, in the second tutorial, you create a sample dataset so that you can do local development of a TensorFlow model. Which, then, we will use to develop the TensorFlow Estimator API model.
Once we have the prototype in place, we will change over to Productionization. The training of the model will be distributed and scaled out on Cloud ML Engine. Then , trained model will be deployed as a service

Prerequisites :

To get started you are going to need a Google cloud account. The free account is sufficient for this purpose.

Google Cloud Datalab :

You may ask, why do I need a Google account when I can use Jupyter, Python and TensorFlow on my own resources?

The answer is — you can easily access BigQuery sized data collections directly from the notebook running on your laptop.

To get started go to the Datalab home page.

To say a few words about Cloud Datalab — it is a very powerful tool created to explore, analyze, transform and visualize the data and build machine learning models on Google Cloud Platform.

Setting up :

(Note : we will be using this setup throughout the series)

Sign up for the Cloud Datalab :
It is basically free as you’ll be getting an initial $300 credit.
Create a bucket using your GCP Console :
In your GCP Console, click on the Navigation menu (Three bars /Top-Left) and select Storage.
Click on Create Bucket. Choose — Regional bucket and set an unique name. Then, click Create.

Open Cloud Shell. The Cloud Shell icon is at the top right of the GCP web console.

Google Cloud Shell, gives developers command-line access to their compute resources on Google Cloud Platform.

gcloud compute zones list

Pick a zone which is closer to you. The listed regions are the regions that currently support Cloud ML Engine jobs.
For example, if you are in the US, you can choose us-east1-c as your zone.
Replace <ZONE> with us-east1-c

datalab create mydatalabvm --zone us-east1-c

Be patient. Datalab will take about 5 minutes to start.

Cloud Console Prompt for the Web Preview

Click on Web Preview icon on the top-right corner of the Cloud Shell toolbar. Click Change Port and enter the port 8081 and click Change and Preview.

Cloning the Repo :

In Cloud Datalab home page , navigate into notebooks and add a new notebook using the icon

Rename this notebook as repocheckout.
In the new notebook, enter the following commands in the cell, and click on Run (on the top navigation bar) to run the commands:

%bash git clone https://github.com/GoogleCloudPlatform/training-data-analyst rm -rf training-data-analyst/.git

Confirm that you have cloned the repo by going back to Datalab Browser and ensuring that there exists a directory called training-data-analyst
Lets head to the exploration!

Explore and Visualize the Dataset :

Now, you may have been thinking that — why are we exploring the data? Why shouldn’t we just feed the data into the machine learning model?

Isn’t it a machine learning model?. Isn’t it the ML model’s job to figure out what datasets are needed and what aren’t?!

Well, real life doesn’t work that way.

Many times the data, as it is recorded, isn’t what you expect.

The real world data is surprisingly confusing, and if we use that data without understanding it first, we will end up using the data in a way that will make productionization very hard.

Because in the productionization, we have to deal with the data as it comes in, in real time.

The goal of this tutorial is to investigate which features have influence on the baby’s weight. Which we are going to predict.

For example, if you want to know whether the field pluraity or the field is_malehas some influence on the weight, you can generate a bar chart for - the average weight of babies sharing the same pluralityvalue, or the same is_male value.

However, it’s not enough to just check whether the average weight of the baby in these different conditions is different, you have to look at the whole distribution of values too.

Checking the number of babies for each value is necessary because if you don’t have enough sample data for any range of input values, predictions using those values may turn out to be inaccurate.

Get started :

Start with running the following code in the Datalab Notebook.

BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
 gsutil mb -l ${REGION} gs://${BUCKET}
fi

Explore Data :

The data is Natality data : record of births in the US. Natality stands for the ratio of the number of births to the size of the population (birth rate).

We are Exploring BigQuery Dataset using Cloud Datalab.

Here, the goal is to predict the baby’s weight given a number of factors about the pregnancy and the baby’s mother.

The hash of the year-month will be used for that — this way, twins born on the same day won’t end up in different cuts of the data.

1. Create SQL query :

Using natality data after the year 2000.

query = """
SELECT
 weight_pounds,
 is_male,
 mother_age,
 plurality,
 gestation_weeks,
 ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
FROM
 publicdata.samples.natality
WHERE year > 2000
"""

2. Call BigQuery :

And examine in dataframe.

import google.datalab.bigquery as bq
df = bq.Query(query + " LIMIT 100").execute().result().to_dataframe()
df.head()

Let’s write a query to find the unique values for each of the columns and the count of those values. This is important to ensure that we have enough examples of each data value, and to verify our hunch that the parameter has predictive value.

3. Create a function :

That finds the number of records and the average weight for each value of the chosen column.

def get_distinct_values(column_name):
 sql = """
SELECT
 {0},
 COUNT(1) AS num_babies,
 AVG(weight_pounds) AS avg_wt
FROM
 publicdata.samples.natality
WHERE
 year > 2000
GROUP BY
 {0}
 """.format(column_name)
 return bq.Query(sql).execute().result().to_dataframe()

4. Bar plot :

To see is_malewith avg_wtlinear and num_babieslogarithmically.

df = get_distinct_values('is_male')
df.plot(x='is_male', y='num_babies', kind='bar');
df.plot(x='is_male', y='avg_wt', kind='bar');

5. Line plot :

To see mother_age with avg_wt linear and num_babies logarithmically.

df = get_distinct_values('mother_age')
df = df.sort_values('mother_age')
df.plot(x='mother_age', y='num_babies');
df.plot(x='mother_age', y='avg_wt');

6. Bar plot :

To see plurality(singleton, twins, etc.) with avg_wt linear and num_babies logarithmic.

df = get_distinct_values('plurality')
df = df.sort_values('plurality')
df.plot(x='plurality', y='num_babies', logy=True, kind='bar');
df.plot(x='plurality', y='avg_wt', kind='bar');

7. Bar plot :

To see gestation_weeks with avg_wt linear and num_babies logarithmic.

df = get_distinct_values('gestation_weeks')
df = df.sort_values('gestation_weeks')
df.plot(x='gestation_weeks', y='num_babies', logy=True, kind='bar');
df.plot(x='gestation_weeks', y='avg_wt', kind='bar');

All these factors seem to play a part in the baby’s weight :

Male babies are heavier on average than female babies.
Teenage and older moms tend to have lower-weight babies.
Twins, triplets, etc. have lower weight than single births.
Premature babies weigh in lower as do babies born to single moms.

In addition to this, it is important to check whether you have enough data(here — no. of babies). Otherwise, the model prediction against values that doesn’t have enough data may not be reliable.

In the next article, We are going to create a smaller dataset for model development and for local training. We will then create a TensorFlow model using the high-level Estimator API. The TensorFlow model will be trained on the sampled dataset.

Thanks for the read!