The 4 Problems with Training Models in Colab

Ritesh Kanjee
Augmented AI
Published in
7 min readAug 24, 2021

There are four problems that you will encounter while training a Computer Vision Model in Google Colab. Now, these problems not only apply to just YOLOR but also YOLOX, YOLOv4 YOLOv5, YOLOv279!!, etc.

In this article, I’m not only going to delve into what these problems are but also provide you with solutions as to how to overcome each of these problems.

The First Problem with training in Colab is that if you are training large models like YOLOR, YOLOX, YOLOv5, etc.

Your run-time will, with absolute certainty, disconnect during training. This is super frustrating because imagine that you are training large models which require a couple of thousand epochs, let’s say you are aiming for 3000 epochs. You spent the weekend just getting halfway when all of a sudden Google decides that it wants to disconnect your run-time.

When this happened to me I just looked blankly at my screen and cried.

Now, the reason why it disconnects is that Google has a policy that notebooks on the Free plan have an idle timeout of 90 minutes and an absolute timeout of 12 hours.

This means, if a user does not interact with a Google Colab notebook for more than 90 minutes, its instance is automatically terminated. Not forgetting about the maximum lifetime of a Colab instance of 12 hours. This sort of makes sense, that they want to fully utilize their GPU and TPUs and ensure that they are always in use and not idling. But it does not help you and me if we want to train our models overnight and over many days.

After much searching, I came across some solutions on Stack Overflow and I saw many creative solutions to make it appear as if there is activity on the notebook, and this is by using some clever JS scripts to simulate user activity.

You also get Chrome Extensions called Colab Alive which does something similar. While this is great to overcome the idle timeout of 90 minutes, it still does not conquer the 12-hour absolute limit that they impose.

So after much trial and error, and many weeks later, I settled upon the solution of just paying for Colab PRO. Now, if you are hesitant to pay for anything, just hear me out because I was in the same boat. I mean why should I pay for using a free cloud GPU, right? Well, it actually makes a lot of sense to go PRO, and no I’m not affiliated with any google products. The main reason is if you are using this for commercial purposes with PRO you get a faster GPU with longer reliable runtimes and more RAM.

Looking at their pricing options, I went for the $9.99 per month plan but you can cancel anytime. Relatively speaking, the PRO plus plan is just way too expensive for me.

But I mean if for the month if you want to train a model, it is still way cheaper than buying a brand new GTX 3090Ti. Ya.. and we all know how scarce those things are.

Okay, so the Second Problem that you will encounter is: Sure PRO and PRO+ give you longer runtimes, but there is still a risk of your run-time disconnecting. Now tell me, do you want to risk your run-time disconnecting when you are at 2999 epochs of 3000... [cough]

…Ya, I thought so. So in the Colab Notebook, you are given some disk space in which you can train your models.

This is in the region of about 100GB which is more than enough for training Models like YOLOR. However, when our run-time disconnects all files that are in this run-time get deleted, which leads us to the second solution, Saving Our files into Google Drive.

So during training, we will clone our GitHub repository and save our training checkpoints to Google Drive. And if god forbid your run-time crashes, then you can just pick up from the latest saved weight checkpoint.

This sounds really sweet until you realize that you will quickly run out of Google Drive space which limits you to 15GB on a brand new account, this will be Problem number 3.

When I ran YOLOR training, it easily went over that threshold. Now you can limit the interval of saved weight checkpoints but it will mean that if your run time disconnects, you will have larger gaps to cover in terms of your training progress. The alternative would be to delete older weights, but you won’t really know if maybe, epoch 50 has a lower loss than epoch 5000.

Now the way I solved this was to make another purchase, I swear this is the last purchase that you will have to make. Which is getting Google’s One Plan Subscription. Which gives you 100 GB of space that you can use as you please.

From my tests, I have found that in training my YOLOR model, I only reached around 35 GB of space for around 300 epochs on my playing cards dataset. From the demo below, you can see it worked insanely well!

Great! So now we have all of the building blocks that we need to train a Production-Ready Computer Vision Architecture like YOLOR. Now, how do we go about training it?

Great question.

So right down below you will find a link to my FREE YOLOR Course. You just need to sign up, and then you can instantly start learning how to run YOLOR Object Detection in Google Colab. After you are signed up, you will receive your course via email. However, to gain access to the full modified training Colab which will enable you to resume training on YOLOR you can enroll in YOLOR PRO.

In this comprehensive course, we will go deep into the training methodology of collecting, annotating, augmenting, and deploying your YOLOR models for custom applications.

I’ll also show how you would use YOLOR with the DeepSORT for Robust Multi-Object Tracking.

In later modules, you will learn how to build YOLOR and StreamLit user interfaces to build beautiful web apps.

And finally, we will be building 18+ real-world projects that include the code, datasets, and models that you can make your own.

YOLOR PRO is currently available at early-bird discounts so ensure that you enroll soon before prices go up.

If you would like to try out YOLOR natively, then check out this video right here.

Enroll HERE for YOLOR PRO Coursehttps://bit.ly/YOLORPROCourse

--

--