Web Scraping Using Python and BeautifulSoup!

Shreyans Jain
Jul 1, 2019 · 6 min read

In this short post I’ll be showing you guys how you can get started with scraping web pages by using Python and BeautifulSoup. If you are still unfamiliar with Python you should definitely check out my other post where I have tried to show how Python as a programming language can be highly robust and flexible depending on the task at hand.

Things we’ll be needing:

Getting BeautifulSoup4 for Python 3.x

What we’’ll be scraping…

That said, we’ll be scraping the popular image hosting platform IMGUR. We will be writing a python script that takes input a search string from the user, searches this query on www.imgur.com and downloads random photos relevant to the search query within a new folder.

So let’s get started…

urlopen() allows us to access the webpage and read it’s html

urlretrieve() will be used to retrieve (download) the images that we find

bs4 (BeautifulSoup4) is imported as bs

os module is needed to make new folders and get directory paths (as we’ll see)

Getting the Search String and preparing the Folder

In the first line we have taken input the search string (for example the user could input Tokyo if he/she wants pictures of Tokyo)

In the second line we have decided a folder name by using the search string itself. (for example if the search string input was “Tokyo City” then our folder name would be “Tokyo_City”) This is the folder where the images will be downloaded.

In the third line we have stored the current path in the variable currPath by using the os module (this is the path to where our python script has been saved, for example Desktop/imgurScraper)

In the fourth line we have evaluated the path to folder where we will be saving the images (So if our currPath is “Desktop/imgurScraper” and folder is “Tokyo_City” then reqPath is “Desktop/imgurScraper/Tokyo_City”) Again for this we have used the os module.

Finally in the fifth and sixth lines we have checked using the os module wether this folder already exists or not. If it does not already exist we create a new folder with this name.

Finding the Search Query from the Search String

Let’s say we want to search for pictures of Tokyo City on imgur. We can do this in two ways:

  1. We can go to www.imgur.com , type into the search bar “Tokyo City” and press ENTER.
  2. We can directly open the url www.imgur.com/search?q=Tokyo+City

The fact that Imgur has a fixed format for converting search strings to search queries (replacing spaces by + signs) is very helpful in our attempt to scrape this website.

For example, Google uses a random format every time it converts a search string to a search query. Hence google cannot be scraped effectively using bs4 alone.

So now that we have taken input from the user the search string we can find the link where we will find our images.

Creating the Soup…

A Small Observation…

  1. Search for “Tokyo City”
  2. On the results page copy the image address of the first image
  3. Paste this image address in notepad
  4. Click this image
  5. On the newly loaded page you will find the same enlarged image
  6. Copy the image address of this enlarged image and compare it with the previous address that you pasted in notepad
  7. You’ll find that they differ by just an extra ‘b’ before the extension name in the previously pasted address.

So why did we do all this? Well this means that if we have the address of the smaller image (which we have for all the images on the results page) then we can find the address of the larger image by removing the extra ‘b’ before the extension name. For this we can write the the following method:

So now our task is to find the address of every candidate image on the results page. By saying “candidate image” I am trying to differentiate between the images that are actually to be downloaded and other images that are present on the page like the company logo etc.

Getting the address of Every Candidate Image

If we inspect the results page and hover over one of the candidate images we can view the corresponding html in the inspect window. To inspect the page right click and inspect. Then click on the icon at the top left of the inspect window (the one with a mouse pointer over a rectangle screen).

Now click on the first image. The html code corresponding to that image will get highlighted in the inspect window.

Clearly you can see that the “img” tag does not have a id, name or class attribute. However it wrapped in an “a” tag with class as “image-list-link”. This is true for every candidate image and false for every other image on the page, which you can verify. So now we know that our candidate images will always be wrapped in an “a” tag whose class attribute’s value we know will be “image-list-link”.

We have to store the “src” attributes of all these images, remove the extra ‘b’ before the extension name and place “https:” in front of them to convert them into valid accessible links for the urlretrieve() method (which we will use to fetch and save images by using their links).

Here atags is a list which stores all the “a” tags having class attribute set to “image-list-link”. For creating this list we use the findAll() method on our previously created soup object.

imgtags is a list which stores all the “img” tags that were within all the “a” tags stored in atags. For creating this list we use the find() method on the “a” tag objects stored in atags.

Finally srcs is a list which stores the addresses of all the images to be downloaded. To create this list we iterate over each “img” tag object stored in imgtags. We access the “src” attribute of this “img” tag by using img[‘src’]. Then we remove the extra ‘b’ by using the alterSrc() method that we created previously. This method returns the address of the enlarged image which is then concatenated with “https:” to get the final link of the enlarged image that we will feed to the urlretrieve() method.

Saving the Images by using their links

In the above code we are iterating over each link in the srcs list as ‘l’.

The urlretrieve() method takes two arguments. First is the link from which the item is to be fetched and second the filename which also serves as the path where it will be stored.

Suppose our image address was “i.imgur.com/abcd.jpg” and our folder name is (the folder where the images are to be saved) “Tokyo_City” , Then our filename should be “Desktop/imgurScraper/Tokyo_City/abcd.jpg”.

Remember that “Desktop/imgurScraper/Tokyo_City” is stored in reqPath. And os.path.basename(l) would return here “abcd.jpg”. Then we can use os.path.join() to get “Desktop/imgurScraper/Tokyo_City/abcd.jpg”.

Finally the link along with the evaluated filename is passed to the urlretrieve() method which saves the image at that link with the passed filename. This process is repeated for every ‘l’ in srcs.

Here’s the script in action. I tried downloading some pictures of the Lamborghini Aventador using it.

In the end I would like to say just one thing. As most of you must have noticed scraping Imgur for images was only 20% about coding and 80% about observing how they store images in their backend. With that said I would like to point out that this post is intended only for ethical purposes.

Hope you guys learned something new today…Please leave your feedback in the comments section.

Happy Scraping! ✌️

The Startup

Get smarter at building your thing. Join The Startup’s +785K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Shreyans Jain

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +785K followers.

Shreyans Jain

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +785K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store