Machine Learning with Google Colab

A quick start guide to Google Colab

Nunzio Logallo
Google DSC PoliMi Journal
7 min readApr 4, 2021

--

Nowadays there are different tools to run code on a PC, you can use an IDE (Integrated Development Environment), a CLI (Command Line Interface) with the proper software installed or you can use notebooks.
Let’s focus on the last tool I mentioned, and no, I am not talking about paper notebooks, I’m talking about an environment like this one:

Jupyter Notebook — Welcome page

With a notebook, you can easily write pieces of code executing them step by step, trying to figure out what piece of code works better than another, or preparing your code for a presentation or explanation. In notebooks, you can also attach plain or formatted text, images, or graphs. Different apps use the notebook “technology” but for sure the most famous one is Jupyter Notebook an open-source tool with which you can create and run a notebook pretty easily.
If you want to run simple code you can easily do it with a personal computer but think about running heavier tasks like machine learning algorithms, it will take ages without the proper hardware, that’s because programmers or data scientists usually work with the cloud to speed up their work and get to the point faster.

Google Colab

Google Colab — Welcome page

Google Colab is a platform that allows you to run code directly on the cloud, this means that you can use very powerful hardware to run your code and the only requirement to do it is to have a Google account.
In Google Colab you can only use Python as a programming language but this is fairly enough for the features it offers. Every line of code you write is automatically saved on your Google Drive storage and you can easily access your project notebook whenever you want, wherever you are.
Going through the runtime menu you can choose whether to use a GPU or a TPU for your project but be smart, don’t use these resources if you don’t need them, they are limited and if your project can perfectly run without them it is good to leave these powerful resources to someone who needs them. Anyway, if you need to use them you have to import libraries that take advantage of them. In the next step, I will show you an easy example but before that, I want to explain to you how to run your code on Google Colab:

  • Every single line of your code will be executed by a runtime. You can choose a local runtime but in that case, you have to specify the URL to it or you can connect to the Google-hosted runtime. You can easily restart your runtime by clicking on the “Runtime” menu on the top bar.
  • To run your code you have to press shift+enter or click on “Runtime” menu -> “Run the focused cell”. After the execution of a line, Colab will automatically create another line, anyway, you can also add a new line by clicking on the button “+ Code”.
  • To add some text you have to click on the “+ Text” button. In this case, you can add simple comments to your notebook or formatted text and elements with HTML.

An example: Car price prediction

Now, I’m going to show you an example of a machine learning algorithm, running on Google Colab. To make it easier I’m going to use the linear regression algorithm with which you can predict linear values. In this example, we will check if there is a linear relation between car features and their price. For this purpose, I used the CarPricePreditionDataset, I cleaned it a bit removing some useless columns and leaving others just for data understanding. You can find this dataset and the notebook I made on my GitHub. These are the step to run your first machine learning algorithm on Google Colab:

  • First of all, we need to upload our dataset inside the platform, and to do this I suggest using Google Drive. You can easily mount your GDrive space by clicking on the icon in the “Files” menu on the left. After the confirmation, you will have your drive folder in “content/drive/MyDrive”.
  • After, you need to import the libraries for our project where we have, “matplotlib” and “seaborn” for the plots, “pandas” for the dataset management, and “sklearn” for the Linear Regression algorithm. As you can see, from “sklearn” we import “train_test_split” for the training set and test set split, “LinearRegression” for the algorithm itself, “MinMaxScaler” for data normalization, and “r2_score” for the evaluation.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score
  • The next step is to import our dataset. In this case, I previously removed some unnecessary columns from the table but I left something else like the cars’ name (completely useless but interesting for cars connoisseurs). The elements in this dataset are not ready yet for the linear regression algorithm, in fact, now we only need to check these data and understand them. To do this we have to use Pandas, a library that allows us to work on a dataset and show it.
df = pd.read_csv("drive/MyDrive/Colab Notebooks/datasets/CarPrice_Dataset_cleaned.csv")df.head()
The first 5 rows of our dataset
  • Now that we have our wonderful dataset we have to focus on our goal: searching for correlations between cars’ features and price. As you might though we have to remove all the columns that are unusable from the algorithm. Linear Regression, like many other ML algorithms, can only use numerical features, so to make this dataset usable, we have to remove all the other features.
numerical_feature = [feature for feature in df.columns if df[feature].dtypes!="O"]numerical_featureOUTPUT:
['wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight', 'enginesize', 'boreratio', 'stroke', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg', 'price']
  • At this point, we can visualize the remaining features and their correlations with the price thanks to matplotlib and seaborn. Do not underestimate this step, data visualization is one of the most important steps in data analysis and when done right, it can save you a lot of time.
    As you may notice from the plots there are visible linear correlations between some features and the price like for “curbweight” or “enginesize” and there also are some features with no correlation with the price like “compressionratio” and “stroke”. Thanks to data visualization we can remove the useless values from the dataset that is going to become a training set.
for feature in numerical_feature:
sns.scatterplot(x = df[feature], y = df['price'])
plt.show()
Plots of car features vs price
  • At this point, we should drop the unnecessary columns for the Linear Regression algorithm and normalize all the numerical features. This step is very important because here we can use a unique measure for each numerical data. We could not use the multiple linear regression algorithm otherwise.
df = df.drop(['CarName', 'fueltype', 'aspiration', 'carbody', 'drivewheel', 'enginelocation', 'enginetype', 'fuelsystem', 'doornumber', 'cylindernumber'], axis=1)df = df.drop(['peakrpm', 'compressionratio', 'peakrpm', 'stroke', 'carheight', 'wheelbase'], axis=1)df['enginesize'] = np.log(df['enginesize'])scaler=MinMaxScaler()
scaler.fit(df)
dataset=pd.DataFrame(scaler.transform(df),columns=df.columns)
dataset.head()
The first 5 rows of our normalized dataset
  • Now we have to divide the dataset into a training set and test set. To do this we have to distinguish the value that the algorithm should guess (y) and the values that should be given to the algorithm (X). After this, we will use the “train_test_split” method of the sklearn library to easily create the training set and the test set.
X=df.drop(['price'], axis=1)
y=df['price']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=42)

As you can see inside the “train_test_split” function I wrote 0.1, this means that the function will automatically use the 10% of the dataset as a test set. The smaller this value, the more data we will have for the training set but on the other hand the larger the training set, the less data we will have to verify the effectiveness of the algorithm. Because of this you always have to adjust this value according to the dimension of your dataset.

  • Now we have the step that all of you have been waiting for: The model training! Let’s create the linear regression model with our training set.
lr=LinearRegression()
lr.fit(X_train,y_train)
OUTPUT:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
  • At this point our model is ready and we can make some predictions using our test set. This process is automatic and after this, we will have a result set that we can compare with the test set. To define the accuracy of our model we can use the most popular metric for linear regression called “R-squared”, the closer the value is to 1, the more accurate the model is.
y_predLR = lr.predict(X_test)r2_score(y_test, y_predLR)OUTPUT:
0.8453707785571698

This is an extremely easy example of how to run your machine learning algorithm on Google Colab. This task is so light and it can run on any pc or server, but Colab can also run heavier tasks like deep learning algorithms!

If you want to learn more about linear regression please check out my article on Towards AI!

Thanks for reading.

Nunzio Logallo

--

--

Nunzio Logallo
Google DSC PoliMi Journal

PoliMi Computer Engineering student. Aspiring Enterpreneur