Creating dataset— Scraping Marathon Images Using Selenium and BeautifulSoup
Creating a dataset of marathon images to use for bib recognition (Part 1)
This post continues on my earlier post on Marathon Bib Recognition project. In case you haven’t read the introductory blog, please go over it to get a better context.
Where to get the dataset
Okay, I read the intro blog. It sounds great! But, how do you get started? You need data, right?
Right. To be able to even try any one of the methods to recognize the bibs, that were described in the intro blog, we need data — marathon images, and many of them.
One obvious way was to go to a running event myself and photograph a few (at least 1000, I planned) images using my DSLR and phone camera. But, none of the Sundays I could get myself to get up at 4 am, ride my bike for an hour in the cold Bangalore morning and reach a running event to cover it. On one of the days that I did, it was to run myself.
The other way, I figured, was to get the images from the race organizers. I contacted the team of Kaveri Trail Marathon (KTM), a beautiful run I had participated at recently. KTM team was happy to help and shared with me the contact details of MyRace, their tech partner for the race. I tried to contact MyRace and to get to talk to the concerned person to discuss this project and ask for help, many times, but in vain. That’s when I thought of trying to scrape the images from their website. Jugaad! It did take a while to figure out the right hacks to do so, but I got it working eventually.
Exploring the website to scrape
Nice, but how do you figure out how to scrape MyRace website for the images? I mean, a website is a big jungle of HTML/CSS/JS. How do you navigate through it, identify the relevant parts and extract the images?
With patience and persistence, my friend!
To get started, there are a few prerequisites:
- Understanding of basics of HTML and how a website is structured
- Experience with BeautifulSoup, the python library for web scraping.
- Know-how of Selenium drivers and its Python API
- A bit of smartness will help as well ;-)
With the prerequisites covered, I started to explore the website. The end goal of this exploration is to find a link of the image in the image display page with a tag
<img> and attribute
src with value like
.jpeg, etc. should give you a clue. This is where the actual image resides and can be downloaded from. I navigated to one of the image display pages: MyRace -> Event Photos -> Cult10K -> image (https://www.myracephotos.in/Event-Photos/CULT-10K/i-czZjhrK). On this page (in Chrome browser), right-click and ‘Inspect’
CTRL+SHIFT+I the elements of the page. This will open up an ‘Elements’ panel (either on the side or the bottom) in the browser with the code for the page. Again, right-click ->Inspect on the image and it will highlight the code corresponding to that image in the Elements panel. This highlighted code will be the image tag with the link of the source of the image.
<img class=”sm-lightbox-image” src=”https://photos.smugmug.com/Event-Photos/CULT-10K/i-czZjhrK/2/80c2df4c/X3/IMG_0812-X3.jpg" alt=”” style=”height: 867px; width: 1300px; position: absolute;” id=”yui_3_8_0_1_1550650077711_799">
Copy the link to the image and paste it in a new browser window and it will display the image. Use the
wgetcommant to download the image using the terminal and verify that the link works.
Great! So, you just need to repeat this for every image?
Yes, but, not manually. How do you do this in a more automated way? (There are about 5000 images for each race.)
Hmm… You can go to the previous page (album page) which has thumbnails of all the images with the links to the respective display pages. Using BeautifulSoup you should be able to parse the HTML source code of the page, identify the specific tag that contains the link to each image display page, extract the links, and add them in a list. Similarly, iterating over that list of links, parse the source code of the image display page, find the link to the image source and download the image.
- If you observe the album page (https://www.myracephotos.in/Event-Photos/CULT-10K/), only as many thumbnails are populated as much you scroll the page. When using BeautifulSoup, none of the source code for these thumbnails is parsed.
- Similarly, when parsing the source code for the image display page using BeautifulSoup, the image source link that we found earlier, manually, is also not available.
These problems were identified after parsing these webpages using BeautifulSoup and carefully reading through it and comparing it with what I could see in the Elements panel in the browser.
The solution is in two parts:
- Use Selenium driver Python API to emulate human browsing the website on the web browser (Chrome) and scroll down till the end of the album page. Once all the thumbnails are loaded on the page, we will parse the source code, identify the specific HTML tags with the links to the image display page and save them in a csv file.
- Load the csv file and iteratively read the links to the image display page. Parse the webpage source code, which will be incomplete. Here, I discovered that although the exact link will not be available, among what is available, one of the tags will have a link which, with minor modification, will form the actual link for the image source.
Because this post is already getting too long, I’ll cover the two-part solution in two separate posts:
Check out the Github repository, here.