Of Machine Learning And Prisons: Simulating User Ratings for Nigerian Prisons Using Google Cloud Platform

Akande Imisioluwa
Feb 20, 2018 · 11 min read

What is Machine learning?

Machine learning is a practice that involves training a system towards a specific task without explicit programming.

What does cloud mean in computing?

Cloud is a virtual environment that provides ubiquitous access to shared pools of configurable system resources and services with minimal management effort and cost over the Internet. The basic idea behind cloud computing is that resources are made available at any point in time and you don’t have to pay for it when you are not using it.

In this tutorial, we will apply machine learning to train certain datasets on Google Cloud Platform(GCP). The sample data we will utilize is relatively small, so the best suited tool is the CloudSQL.

  • The data set used in this tutorial was obtained from Nigeria Prison Service. Stratified random sampling was performed on the data set; 1 labeled as ‘poor’, 2 labeled as fair, 3 labeled as good.

CloudSQL is essentially MySQL. MySQL is an open source relational database and Cloud SQL is Google-managed MySQL.

What does relational database mean?

A relational database (RDB) is a collection of multiple data sets that are organized in tables which contain rows(records) and columns(fields). The standard way to access and manipulate data in a relational database is through Structured Query Language(SQL).

Why you may want to consider CloudSQL over your server’s on-premises MySQL

i. Security

ii. Backup

iii Global accessibility

iv. Automated replication when the need arises

v. Fast connection

The world is moving away from managing their own machines to a completely elastic process, automated services and scalable data. This will enable you as a Data Engineer to focus on business application while the Cloud vendor manages the infrastructure.

To start with, you need to sign up for a Google Cloud account. Then take a step forward by creating a project on Google Cloud Platform console. Follow this link:

I titled my project My Test Project. Give yours whatever name you like.

Next, you need to add your billing information. You will receive $300 free credit valid for 365 days which will be sufficient for you to implement the features experimented in this tutorial.

Once you are done adding your billing information, click the menu on the left-hand side and scroll down to SQL and click.

This brings up the Cloud SQL page. You will be creating an instance.

So click on Create Instance, select a database engine MySQL and then choose the Second Generation Instance which is the latest version for Google Cloud SQL for MySQL and give it an instance ID called relation. Then pick a location; you can select central region. Keep those defaults and go ahead and create it.

Typically, to stage data into GCP, you have to get the data into cloud storage. Cloud storage is responsible for staging any input into the cloud SQL. You can follow this link to create Cloud Storage. The Google Cloud Storage(GCS) is made up of buckets, which are essentially containers that hold your data. Create a bucket in GCS and name it auto-sampletest. Add a folder to the bucket and name it cloudsql.

The next step is to run the Google Cloud Shell. The Google Cloud Shell is a shell environment for managing resources hosted on Google Cloud Platform and it’s available for the duration of the browser window. The Cloud Shell, among other things, provides command-line access to a virtual machine instance in a terminal window that opens in the Google Cloud Platform Console. You can follow this link to activate it.

Once your shell is up, you can run:

git clone https://github.com/imisi-akande/Cloud-BigData-machine-learning.git 

to copy the database files into the Virtual Machine from Github. You now have the the files on your Google Cloud Shell. You can double-check by changing directory into the repository as depicted below:

Google Cloud Shell area

On the Google Cloud Shell, run this command to stage the files into the Google Cloud Storage:

Your cloud storage should contain three files: create_table.sql, nigeria_prisons.csv and rating.csv as shown below:

Google Cloud Storage area

The create_table.sql file is a standard data SQL file. It creates a database grade_spark. It also checks for and drops the Grade, Rating and Prison tables if they already exist. It then creates tables Prison, Rating and Grade with their columns.

At this point, go to the Cloud SQL area so that you can import your database files from the Google Cloud Storage. Click on relation which is the CloudSQL instance you created earlier. Click on import in the instance details. Import the create_table.sql file first.

create_table.sql import

Then you can now start to import the CSV files in a proper order. The order of import is important because the files depend on one another. Import the nigeria_prisons.csv file first:

nigeria_prisons.csv import

Followed by the rating.csv file:

rating.csv import

The nigeria_prisons.csv data set consists of all states in Nigeria, the number of prisons, total number of inmates(male and female) and their graded capacity. The rating.csv file consists of ratings i.e. people rating the capacity and the prison condition in each state. The ‘Grade’ table is currently empty. This is because you are yet to run the Grade machine learning system.

The data that you have collected corresponds to the prisons in Nigerian states and to the ratings people have given them. This is a MYSQL database so you interact with it the way you would normally interact with MYSQL.

Now , let’s go ahead and explore the database. You need to connect to the database from the Cloud Shell. In order to do this, you need to know the IP address of the Cloud Shell instance. Change directory into test1 folder and run :

bash ./find_my_ip.sh 

on the Cloud Shell.

IP address for Cloud shell Instance area

Go to the Authorization tab. Add your displayed Cloud Shell IP address to the authorized networks section of the Cloud SQL instance. Click on save.

Authorization area for Cloud SQL

In order to connect to the Cloud SQL, we need to obtain the IP address on the overview tab on the Cloud SQL instance area.

Then run :

On running that command, it requests for a password. Go to the Users tab and change the password.

User area

Return to the Cloud Shell and enter the new password. On adding the password, you have access to the mysql shell

mysql shell

Let’s do some query.

Run :

use grade_spark 

to connect to the database.

Run:

show tables

to display the tables.

Run:

select * from Prison

to select all the columns from the Prison table.

Prison

Run:

select * from Rating 

to select all the columns from the Rating table

The rating.csv file will serve as our training dataset later in this tutorial.

Rating

When you run:

select * from Grade 

you get an empty table.

Let’s do a little more complex query.

Run:

select * from Prison where total>capacity 

and we get about 16 states that meet that criterion.

Now that you have created the CloudSQL loaded with data, the next thing is to use Hadoop (Spark in particular) to create grades for the prisons. Hadoop is a rich source ecosystem for Big data.

In the early days, the primary way to use Hadoop was to write map reduce programs in Java. However, this process was too verbose and people came up with the idea of writing map reduce programs in scripting languages and running it on Hadoop cluster. This technology is called Pig.

The second technology is called Hive. Hive is used when you have a structured data and a schema. You associate that schema on top of your structured data stored on a distributed file system. In the case of Hadoop, you store it in Hadoop Distributed File System (HDFS) then you will be able to do queries using Hive.

Modern day programs on the Hadoop ecosystem tend to be written in Spark. Spark is fast, and it has a bunch of libraries that allow you deal with SQL, streaming data and machine learning.

Dataproc is a Google managed hadoop, pig, hive, and spark program. It is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Dataproc is a cluster of machines: it consists of a master node and worker nodes and you can resize it by adding or removing some worker nodes. The advantage of the integration of Dataproc to Google Cloud Platform is its data storage provision with Google Cloud Storage. In this tutorial, we will be using Dataproc for training our data and for predictions.

Let’s go back to our Cloud Shell. You can follow this link to create a Dataproc cluster. Give your Dataproc name cluster-1. Ensure you select the same zone as your CloudSQL instance so as to reduce latency. The machine type for master node and worker nodes should be 2vCPU(7.5GB memory, n1-standard-2). Our data set is small and will work well with this configuration. You can leave the remaining option as default; then create your cluster.

cluster area

Once your cluster has been created, you can submit jobs to the cluster. In order to do this, you have to copy our files to the Cloud Storage using the gsutil.

Add a folder to the auto-sampletest bucket on the Cloud Storage area and change directory to the test2 folder on the cloud shell and run:

Now, your bucket should contain the sparkml directory and your storage area should appear like this:

You need to change the CloudSQL instance IP address in the

/sparkml/train_and_apply.py 

to the current IP address of the CloudSQL in the Cloud shell. Go to the CloudSQL area to check for the IP address

You can make use of vim or the inbuilt text editor for Cloud Shell to edit the address on the shell. The train_and_apply file employs PySpark which is the Python API for Spark. It uses the Spark MLlib package which enhances machine learning because of its simplicity, scalability, and easy integration with other tools.

Run :

vi sparkml/train_and_apply.py
train_and_apply.py

Next, on the Cloud shell

run:

bash authorize_dataproc.sh 

This command will fetch the address of each machine on the Dataproc cluster and also goes to the CloudSQL instance to add those machines as being authorized to connect to it.

On running the command, it requests for some parameters such as the name of the cluster, the zone and and the number of workers. Go to the CloudSQL area to double-check parameters:

Double-check for required parameters

Insert those required parameters alongside the command in the Cloud Shell. Then it patches the CloudSQL instance with the three IP addresses, so these three machines which constitute the Dataproc will be able to connect to CloudSQL.

Authorize Dataproc

Now that the patch is successful, you can submit the jobs. Fill in the necessary fields as they appear on the submit job area. The Region is global, the name of the cluster is cluster-1, the job type is PySpark since we are using the PySpark package for training the dataset in the train_and_apply.py which is the main Python file.

submit job area

At this point, the job is submitted. It will take a few minutes for the job to complete. During this process, you can click on the job id to enter the log area so that you can trace the log just in case of any error.

Log message area

So here we get predictions for all persons. It will predict grades for all persons that rate state prisons and store those grades into the CloudSQL instance. We will later check the CloudSQL instance to see what grades have been predicted for each specific person.

Now that the job is completed, we need to remove authorization for Dataproc. The Dataproc is a job specific resource, so when its job is done you can safely de-authorize and delete it. You can go into your test1 directory and run:

bash authorize_cloudshell.sh

to authorize your Cloud Shell VM. With this, you will be able to access the database and do MYSQL in the Cloud Shell environment.

Now run:

Make sure you replace the host with your Cloud Shell IP address. Enter your password and then we are now connected to the database via the Cloud Shell VM

Let’s do some query.

Run:

use grade_spark

to access the database.

use database

Remember, we had our Grade table empty initially before employing PySpark to do data set training and prediction. Now run:

select * from Grade

to display all the grades for the different persons.

Now let’s look at the states’ prisons grading from each individual person.

Run:

select g.personid, g.prisonid, g.prediction, p.states, p.prisons, p.male, p.female, p.total, p.capacity, p.rating from Grade as g, Prison as p where g.prisonid = p.id and g.personid = 13;

In this query, you are selecting personid, prisonid, the prediction and the information about the states’ prisons from Grade g and Prison p where the Prison ID is matched and the personid is 13

A

So these are the top five states and their prisons rated by this particular person.

Run another query but this time change the personid to 7 to see another opinion

select g.personid, g.prisonid, g.prediction, p.states, p.prisons, p.male, p.female, p.total, p.capacity, p.rating from Grade as g, Prison as p where g.prisonid = p.id and g.personid = 7;

So this time, these are Person B’s top five rated states and their prisons based on predictions.

B

I hope you found this tutorial helpful. Look forward to articles on different Machine Learning technologies. In the meantime, feel free to drop your comments.

Data Source: Nigeria Prison Service

The Andela Way

A pool of thoughts from the brilliant people at Andela

Akande Imisioluwa

Written by

The Andela Way

A pool of thoughts from the brilliant people at Andela

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade