Dev steps to Google Colab

Manish Verma
5 min readJan 30, 2018

--

I will start with thanking Google, because of which this will not be possible. Google has certainly ease the way of learning machine learning for a motivated newbie . We all know the grudge of seeing our screen (htop) when we do not have powerful GPU for testing your machine learning intuition. Well, not cent percent but at least Google has given us some relief and we all can test our intuition on Colab notebooks.

In this short article I intend to help other machine learning enthusiast to setup colab notebooks and start executing there intuition. I will follow the workflow of creating and testing a model, which I have divided into multiple steps.

First we need to access our data from colab notebook. There are many ways to do that. One can read data from google drive , upload/download it to server directly or read from google cloud storage. I am considering that our input data ( test, train etc) are uploaded to a folder in google drive.

Now we will start a new colab notebook and change the runtime to GPU. For this goto Runtime menu , then goto Change Runtime Type , there you can select your python version and change hardware to GPU. Your Notebook will take time and restart with these changes.

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

Running above code will show an added device type “GPU” with name as “/device:GPU:0” . As of now GPU is of type Tesla K80 which is boon for those who do not have GPU.
Now if you are not seeing any GPU attached to your machine then try again after some time. This happens because GPU is in high demand. One can try after some time.

Now we need to add some code to read file from the drive. We will add dependency first.

from google.colab import auth
from googleapiclient.discovery import build
import io , requests, os
import sys
auth.authenticate_user()
from googleapiclient.discovery import build
drive_service = build('drive', 'v3')

auth.authenticate_user() will yield a link and ask for a token. Clicking the link and authenticating your account will give a token which needs to be entered in the box. This authentication process gives this notebook, permission to access your files in google drive.

Now using googleapiclient we need to search for the folder where your input data are kept. This function will do the job.

def get_parent_folder(folder_name):
page_token = None
folder_array = []
query = "name='%s' and mimeType='application/vnd.google-apps.folder'" % folder_name
while True:
response = drive_service.files().list(q=query,
spaces='drive',
fields='nextPageToken,
files(id, name)',
pageToken=page_token).execute()
for file in response.get('files', []):
# Process change
#print (file.get('name'), file.get('id'))
folder_array.append({"name" : file.get('name'), "id" : file.get('id')})
page_token = response.get('nextPageToken', None)
if page_token is None:
break
return folder_array

Now we need to get meta data for all the input files in the input folder. This code block returns an array containing all meta data for each of the file in the parent folder. parent_id is the id of the parent folder.

def get_files_from_parent(parent_id):
page_token = None
folder_array = dict()
query = "'%s' in parents" % parent_id
while True:
response = drive_service.files().list(q=query,
spaces='drive',
fields='nextPageToken, files(id, name)',
pageToken=page_token).execute()
for file in response.get('files', []):
# Process change
#print (file.get('name'), file.get('id'))
folder_array.update({file.get('name'):file.get('id')})
page_token = response.get('nextPageToken', None)
if page_token is None:
break
return folder_array

This will download the file from drive and return the file buffer.

def get_file_buffer(file_id, verbose=0):
from googleapiclient.http import MediaIoBaseDownload
request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
# _ is a placeholder for a progress object that we ignore.
# (Our file is small, so we skip reporting progress.)
progress, done = downloader.next_chunk()
if verbose:
sys.stdout.flush()
sys.stdout.write('\r')
percentage_done = progress.resumable_progress * 100/progress.total_size
sys.stdout.write("[%-100s] %d%%" % ('='*int(percentage_done), int(percentage_done)))
downloaded.seek(0)
return downloaded

Now we need to save the file buffer to the local disk. Given code snippets needs the name of the input folder and with the help of above code snippets it downloads the file to local disk in the location of /content/datalab/ . Doing ls -ltrh /content/datalab/ can verify whether all the input files are downloaded or not.

parent_folder = get_parent_folder('INPUT_FOLDER')SOURCE_FOLDER=’/content/datalab/’print(parent_folder)parent_folder[0]["id"]
input_file_meta = get_files_from_parent(parent_folder[0]["id"])
print(input_file_meta)
for file, id in input_file_meta.items():
downloaded = get_file_buffer(id, verbose=1)
dest_file = os.path.join(SOURCE_FOLDER, file)
print("processing %s data" % file)
with open(dest_file, "wb") as out:
out.write(downloaded.read())
print("Done %s" % dest_file)

Now we can access our input files locally and generate our dataframe. Here I am using data for a Kaggle Competition and trying to solve the problem statement.

We will now import our dependencies for machine learning .

import sys, os, re, csv, codecs, numpy as np, pandas as pdfrom keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

Now is the turn for preparing our data for model input. This large piece of code will do exactly this.

embed_size = 50
max_features = 20000
maxlen = 100
train = pd.read_csv(TRAIN_DATA_FILE)
test = pd.read_csv(TEST_DATA_FILE)
list_sentences_train = train["comment_text"].fillna("_na_").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("_na_").values
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
if i >= max_features: continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Creating our model for learning.

inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary() will give this input.

_________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= input_1 (InputLayer)         (None, 100)               0          _________________________________________________________________ embedding_1 (Embedding)      (None, 100, 50)           1000000    _________________________________________________________________ bidirectional_1 (Bidirection (None, 100, 100)          40400      _________________________________________________________________ global_max_pooling1d_1 (Glob (None, 100)               0          _________________________________________________________________ dense_1 (Dense)              (None, 50)                5050       _________________________________________________________________ dropout_1 (Dropout)          (None, 50)                0          _________________________________________________________________ dense_2 (Dense)              (None, 6)                 306        ================================================================= Total params: 1,045,756
Trainable params: 1,045,756
Non-trainable params: 0

Now we will fit our model for a single epoch , save the model and make prediction for testing data.

model.fit(X_t, y, batch_size=1024, epochs=1)
model.save(os.path.join(SOURCE_FOLDER,'lstm.h5'))
y_test = model.predict([X_te], batch_size=1024, verbose=1)
sample_submission = pd.read_csv(SUBMISSION_SAMPLE_FILE)
my_submission = os.path.join(SOURCE_FOLDER, "my_submission.csv")
sample_submission[list_classes] = y_test
sample_submission.to_csv(my_submission, index=False)

Now we need to download our submission csv file to upload it to Kaggle for evaluate.

from google.colab import files
files.download(my_submission)

So that’s it . This way any machine learning enthusiast can train a LSTM model from data from google drive.

This entire code can be found at this github repositories as python notebook.

Thanks.

--

--