Implementing Multiclass Logistic Regression using BigQuery ML

Manoveg Saxena
Dec 15, 2018 · 7 min read

Data science is a very hot topic nowadays. As a result industry and academia are investing huge amount of time and money in the research and advancement of this field. Companies like Google, Amazon, & Microsoft are betting big on AI. This has led to the emergence of innumerable open source and proprietary tools and libraries to assist the overall development of data science projects and experiments.

Even with a large number of machine learning libraries and tools like Tensorflow, Keras, MxNet, Pytorch, etc. one of the major challenges is to move the model to production. A machine learning experiment only brings value to an organization when it is deployed in production. Additionally scaling the prediction process dynamically requires descent amount of planning, effort and infrastructure considerations. For e.g. docker containers on Kubernetes or managed model deployment on AWS, Google, etc. BigQuery promises to simplify these challenges.

BigQuery is a popular data warehouse solution from Google. Its serverless architecture makes it highly scalable with very little effort. The serverless architecture allows end user to purely focus on the individual functions in the code. Services such as Google functions, Microsoft Azure functions take care of physical hardware, virtual machine operating system, and web server software management.

Google BigQuery is based on the similar principle where end user does not have to worry about the underlying hardware, virtual machine, and number of nodes or instances etc. The user simply writes an SQL query and executes it. BigQuery’s execution algorithm analyzes the necessary resources needed to execute the query as fast as possible, provisions the resources, performs the query execution, and releases the resources.

Getting started with BigQuery is really simple. The user has to just create an account in Google cloud and activate BigQuery from the console. It will take few seconds for BigQuery to activate. The user can then create a new data-set. A data-set is logically equivalent to a database . Tables can be created inside a data-set. Schema for a table can de defined at creation time or dynamically at the time of data ingestion. Once data is loaded in a BigQuery table it’s ML features can be used to prepare models.

Google provides 10 GB of storage and 1 TB of data analysis free per month. One of the recent capabilities that Google added to BigQuery is BigQuery ML.


Basics of BigQuery ML

BigQuery ML allows it’s users to develop machine learning models directly in the data warehouse using simple SQL. It supports standard SQL dialect which is ANSI:2011 compliant. Model is created and deployed automatically as part of the training job. One of the biggest advantages is that the data is not required to be moved out of the data warehouse thereby saving an extra step. Traditionally as part of the training process , the data is moved out of the data store to be pre-processed for feature engineering step. BigQuery handles the feature engineering and pre-processing automatically out of the box. Due to it’s serverless architecture, the training process is automatically scalable.

As of now BigQuery supports following models:

  1. Linear Regression
  2. Binary logistic regression
  3. Multiclass logistic regression for classification

Hands on Multi class classification

In this post, I will demonstrate how to use BigQuery ML for multi class classification. Since at the time of writing this article BigQuery only supports logistic regression, therefore we will use that for building our model.

I will use the consumer complaints database as the example data-set. The data-set can be found here. I will use Python in Jupyter notebook to connect to BigQuery. The authentication will be performed using service account JSON obtained from Google access control.

Problem Description

The consumer complaints database consists of complaints received about various financial products. It is published by Bureau of Consumer Financial Protection. I will use logistic regression to train the model on historical consumer complaints using consumer complaint narrative as the feature. The label will be the financial product name. The model should be able to predict the product correctly based on the consumer complaint narrative text.

Solution Approach

Insert data in BigQuery using Python as client library. Once the data is ingested in a table then use BigQuery ML to train the model and get the predictions. I will use Pandas to load the data from CSV and using BigQuery Python client ingest the dataframe directly into BigQuery.


Load data in BigQuery

: Read Pandas dataframe :
Since we will be loading data into BigQuery directly from a Pandas dataframe therefore as the first step the CSV is read into a dataframe. The date columns should be parsed correctly so that BigQuery can automatically parse it as timestamp while ingesting.

import pandas as pd
from google.cloud import bigquery
import numpy as np
consumer_complaints_df = pd.read_csv("Consumer_Complaints.csv",parse_dates=[0,13])
.set_index("Date received")
consumer_complaints_df.head()
Read CSV in a Pandas dataframe

: Remove rows with null value for “consumer complaint narrative” column.
Since the purpose of this demo is to predict product name based on “consumer complaint narrative” text therefore all the records with null value are removed.

print(consumer_complaints_df.shape[0])
print (consumer_complaints_df.index.min())
print (consumer_complaints_df.index.max())
print(consumer_complaints_df.shape[0])
consumer_complaints_without_nan_df = consumer_complaints_df[consumer_complaints_df["Consumer complaint narrative"].notnull()]print(consumer_complaints_without_nan_df.shape[0])
consumer_complaints_without_nan_df.head()
Selecting records with not null for “Consumer complaint narrative”

Print unique labels to get an idea about the labels in the data-set.

consumer_complaints_without_nan_df["Product"].unique()
List of labels

: Create a new dataframe with top 5 products.
Due to quota limitation in sandbox environment and free tier, I will limit the experiment to top 5 products only.

top_5_products_df = consumer_complaints_without_nan_df.groupby(["Product"]).size().reset_index(name='count').sort_values(['count'], ascending=False)                                                                       .head(5)top_5_products_df.head(10)
Top 5 products

: Filter records which have only products in the top 5 list.

all_records_top5_product = consumer_complaints_without_nan_df.loc[consumer_complaints_without_nan_df["Product"]                                                                  .isin(top_5_products_df["Product"])]all_records_top5_product.shape
Dataset with only top 5 products

: Ingest data in BigQuery
Now we will connect to BigQuery using service account JSON file and ingest the filtered dataframe into the BigQuery data-set.

client = bigquery.Client.from_service_account_json('<path to service account json>')dataset_id = 'bigquery_demo'
table_id = 'consumer_complaint'
dataset_ref = client.dataset(dataset_id)
dataset = client.get_dataset(dataset_ref)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV
job_config.skip_leading_rows = 1
job_config.autodetect = True
print('Dataset ID: {}'.format(dataset_id))
print('Description: {}'.format(dataset.description))
job = client.load_table_from_dataframe(all_records_top5_product, table_ref, location='EU')job.result()
Insert dataframe in BigQuery

Build a logistic regression multi-class classifier

: Execute training job.
Training in BigQuery is quite straightforward. Syntactically it looks like creating a view in a database. The keyword create or replace model defines that we are creating a training job which will create a new model. The options part expects parameters to define the model. In this case it’s logistic regression. If one of the columns in select is not named as “label” then input_label_cols should contain the name of the column that should be treated as label. Other algorithm specific hyper-parameters should be mentioned in this block as well.

# Train a model
train_query = """
create or replace model
`bigquery_demo.consumer_complaint_model`
options
( model_type='logistic_reg',
auto_class_weights=true,
input_label_cols=['Product'],
max_iterations=10) AS
select
Product, Consumer_complaint_narrative
from `bigquery_demo.consumer_complaint`
where Consumer_complaint_narrative is not null
and Date_received <= "2017-09-30 00:00:00"
limit 100000
"""
training_job = client.query(train_query).to_dataframe()
print(training_job)
Train a logistic regression classifier

: Check training summary.
ML.TRAINING_INFO call on the model trained in the previous step returns the training summary.

training_info = """
select
*
from
ML.TRAINING_INFO(MODEL `bigquery_demo.consumer_complaint_model`)
"""

training_info_job = client.query(training_info).to_dataframe()
print(training_info_job)
Training summary

: Evaluate the model on the test set.
ML.EVALUATE call on the model with the select query on the data will evaluate the model. In this case the select query provides all records in the table with date higher then 30-Sep-2017 as input to the model for evaluation.
The average accuracy is .28 which is not very good. It might get better with some hyper parameter tuning but that’s not the scope of this article. One important thing to note here is that entire pre-processing of data for training and evaluation is managed by BigQuery. This might also have some impact on the accuracy.

# perform evaluation
query_evaluate = """
select
*
from ML.EVALUATE (MODEL `bigquery_demo.consumer_complaint_model`,
(
select
Product, Consumer_complaint_narrative
from `bigquery_demo.consumer_complaint`
where Consumer_complaint_narrative is not null
and Date_received > "2017-09-30 00:00:00"
))
"""

evaluation_job = client.query(query_evaluate).to_dataframe()
print(evaluation_job)
Model evaluation

: Get predictions
ML.PREDICT call on the model will perform predictions on the data passed by the select query.

# perform prediction
query_predict = """
select
*
from ML.PREDICT (MODEL `bigquery_demo.consumer_complaint_model`,
(
select
Consumer_complaint_narrative
from `bigquery_demo.consumer_complaint`
where Consumer_complaint_narrative is not null
limit 2))
"""
prediction_job = client.query(query_predict).to_dataframe()
prediction_job.head()
Predictions from the model

Step 5: Evaluate the response
The prediction will return the dataframe with probability distribution over all labels.

prediction_job["predicted_Product_probs"][0]prediction_job["predicted_Product_probs"][1]
Probability distribution among classes

Limitations:

BigQuery ML is still in beta phase. It is a relatively new product with some very promising features. However, it still has a long way to go. As of now, it supports only linear regression, binary logistic regression and multi class logistic regression. The maximum number of unique labels that are allowed in multi class logistic regression is 50.

The source code of this exercise is on Github.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Thanks to Prateek Joshi

Manoveg Saxena

Written by

Data Scientist, Machine learning enthusiast. https://www.linkedin.com/in/manoveg/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

More From Medium

More from Analytics Vidhya

More from Analytics Vidhya

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade