Create a Image data set from YouTube using Google Colab

Aditya Kaki
3 min readApr 23, 2020

Google Colab

Google colab is a boon for all the Machine learning and Deep learning students. Google offers their users a free GPU on the cloud for 12 hours. This is quite sufficient for student level projects.

This option can be chosen as below:

Google Colab > Runtime > Change Runtime type > GPU

YouTube

No need to answer that but that’s the best place where so much of data is available. In the form of videos. Videos of all different kinds. Youtube can serve the purpose for all different kinds of Image and Video related data sets.

I have used Youtube videos for creating celebrity data set, Sports data set, Animal data set for my projects.

How is the data set made?

Google Colab and some python, open CV libraries are used to create the data set on the google drive. That means all computation and data is over the cloud. So it creates no over head on ones personal PCs.

Code:

from google.colab import drive
drive.mount('/content/drive')

The above code is to mount the google drive on google colab.

%cd /content/drive/My\ Drive/
!mkdir SportsDataset
%cd SportsDataset
!pip install pafy
!pip install imutils pafy youtube-dl

The above piece is to create a data set directory and redirect to that folder. As though on colab pafy python libraries are not pre-installed these needs to be installed. pafy is a python library which helps in getting a handle to youtube video streams.

import imutils
import cv2
import pafy
url = 'https://youtu.be/xY3wc-aFEYE'
video = pafy.new(url)
videoStream = video.getbestvideo()
vidcap = cv2.VideoCapture(videoStream.url)

Importing the required libraries and getting a hold to the youtube video stream in the above section the variable url is holding the youtube url to the video which we want to convert to frames, rest are to get handle to youtube video object and then pass on the video stream to opencv video handler.

def getFrame(time,frameCount):
vidcap.set(cv2.CAP_PROP_POS_MSEC,time)
frameDet, frame = vidcap.read()
if frameDet:
frame = imutils.resize(frame,width=128)
cv2.imwrite(str(frameCount)+".jpg", frame)

The above python function takes the time stamp and frame count as inputs to separate an image from the video and save it to the dataset directory which is created previously.

!mkdir sport1
%cd sport1
!mkdir vid1
%cd vid1

To save the frames under a specific label a different folder structure is created.

startTime = 0
frameRate = 500
frameCount = 1
h,m,s = video.duration.split(sep=':')
endTime = (int(h)*3600 + int(m)*60 +int(s))*1000
for time in range(startTime,endTime,frameRate):
frameCount = frameCount + 1
time = round(time,2)
getFrame(time,frameCount)

In the above variable startTime is the time stamp start point of the video, and frameRate is the rate at which the frames are captured. Then there is a loop which runs till the end of the video and captures the frames and saves them.

Here basing on the needs of the project one can play around with startTime, endTime and frameRate variables and filter out frames only between certain time interval too.

As part of getFrames function some image preprocessing resizing the images to width to 128 with aspect ratio saved. This can be improved as per the project needs.

I have attached the googledrive link below to a jupyter notebook and also to the sports dataset I have created from several videos.

https://drive.google.com/open?id=1huTV8Oik2-cvtMXDSkyUvw7q95YCfycn

Happy learning :)

--

--