In this short post I’ll be showing you guys how you can get started with scraping web pages by using Python and BeautifulSoup. If you are still unfamiliar with Python you should definitely check out my other post where I have tried to show how Python as a programming language can be highly robust and flexible depending on the task at hand.
Things we’ll be needing:
Python 3.x and BeautifulSoup4 (Python module)
Getting BeautifulSoup4 for Python 3.x
Run the following command in your terminal.
What we’’ll be scraping…
Ok so first of all not every website can be scraped by using bs4. For example google images is not prone to scraping using bs4. You can find out wether a website can be scraped by using bs4 only once you fail while trying to do.
That said, we’ll be scraping the popular image hosting platform IMGUR. We will be writing a python script that takes input a search string from the user, searches this query on www.imgur.com and downloads random photos relevant to the search query within a new folder.
So let’s get started…
Starting off with the necessary module imports
urlopen() allows us to access the webpage and read it’s html
urlretrieve() will be used to retrieve (download) the images that we find
bs4 (BeautifulSoup4) is imported as bs
os module is needed to make new folders and get directory paths (as we’ll see)
Getting the Search String and preparing the Folder
In the first line we have taken input the search string (for example the user could input Tokyo if he/she wants pictures of Tokyo)
In the second line we have decided a folder name by using the search string itself. (for example if the search string input was “Tokyo City” then our folder name would be “Tokyo_City”) This is the folder where the images will be downloaded.
In the third line we have stored the current path in the variable currPath by using the os module (this is the path to where our python script has been saved, for example Desktop/imgurScraper)
In the fourth line we have evaluated the path to folder where we will be saving the images (So if our currPath is “Desktop/imgurScraper” and folder is “Tokyo_City” then reqPath is “Desktop/imgurScraper/Tokyo_City”) Again for this we have used the os module.
Finally in the fifth and sixth lines we have checked using the os module wether this folder already exists or not. If it does not already exist we create a new folder with this name.
Finding the Search Query from the Search String
When we type in search keywords in the search bar of a website and press enter, this search string is converted to a “query” object and is passed on to the next page so that it can be populated accordingly.
Let’s say we want to search for pictures of Tokyo City on imgur. We can do this in two ways:
- We can go to www.imgur.com , type into the search bar “Tokyo City” and press ENTER.
- We can directly open the url www.imgur.com/search?q=Tokyo+City
The fact that Imgur has a fixed format for converting search strings to search queries (replacing spaces by + signs) is very helpful in our attempt to scrape this website.
For example, Google uses a random format every time it converts a search string to a search query. Hence google cannot be scraped effectively using bs4 alone.
So now that we have taken input from the user the search string we can find the link where we will find our images.
Creating the Soup…
Once we have the link of the page where all our images (or the links to all images are available) we can create the soup object.
A Small Observation…
- Open www.imgur.com
- Search for “Tokyo City”
- On the results page copy the image address of the first image
- Paste this image address in notepad
- Click this image
- On the newly loaded page you will find the same enlarged image
- Copy the image address of this enlarged image and compare it with the previous address that you pasted in notepad
- You’ll find that they differ by just an extra ‘b’ before the extension name in the previously pasted address.
So why did we do all this? Well this means that if we have the address of the smaller image (which we have for all the images on the results page) then we can find the address of the larger image by removing the extra ‘b’ before the extension name. For this we can write the the following method:
So now our task is to find the address of every candidate image on the results page. By saying “candidate image” I am trying to differentiate between the images that are actually to be downloaded and other images that are present on the page like the company logo etc.
Getting the address of Every Candidate Image
So how will we differentiate between images? One observation here is that all the candidate images on the results page have the same styling. This means that they must have the same “value” for the “class” attribute or all of them may be inside divs which belong to the same class for styling.
If we inspect the results page and hover over one of the candidate images we can view the corresponding html in the inspect window. To inspect the page right click and inspect. Then click on the icon at the top left of the inspect window (the one with a mouse pointer over a rectangle screen).
Now click on the first image. The html code corresponding to that image will get highlighted in the inspect window.
Clearly you can see that the “img” tag does not have a id, name or class attribute. However it wrapped in an “a” tag with class as “image-list-link”. This is true for every candidate image and false for every other image on the page, which you can verify. So now we know that our candidate images will always be wrapped in an “a” tag whose class attribute’s value we know will be “image-list-link”.
We have to store the “src” attributes of all these images, remove the extra ‘b’ before the extension name and place “https:” in front of them to convert them into valid accessible links for the urlretrieve() method (which we will use to fetch and save images by using their links).
Here atags is a list which stores all the “a” tags having class attribute set to “image-list-link”. For creating this list we use the findAll() method on our previously created soup object.
imgtags is a list which stores all the “img” tags that were within all the “a” tags stored in atags. For creating this list we use the find() method on the “a” tag objects stored in atags.
Finally srcs is a list which stores the addresses of all the images to be downloaded. To create this list we iterate over each “img” tag object stored in imgtags. We access the “src” attribute of this “img” tag by using img[‘src’]. Then we remove the extra ‘b’ by using the alterSrc() method that we created previously. This method returns the address of the enlarged image which is then concatenated with “https:” to get the final link of the enlarged image that we will feed to the urlretrieve() method.
Saving the Images by using their links
Once we have all the links, we can start the download procedure.
In the above code we are iterating over each link in the srcs list as ‘l’.
The urlretrieve() method takes two arguments. First is the link from which the item is to be fetched and second the filename which also serves as the path where it will be stored.
Suppose our image address was “i.imgur.com/abcd.jpg” and our folder name is (the folder where the images are to be saved) “Tokyo_City” , Then our filename should be “Desktop/imgurScraper/Tokyo_City/abcd.jpg”.
Remember that “Desktop/imgurScraper/Tokyo_City” is stored in reqPath. And os.path.basename(l) would return here “abcd.jpg”. Then we can use os.path.join() to get “Desktop/imgurScraper/Tokyo_City/abcd.jpg”.
Finally the link along with the evaluated filename is passed to the urlretrieve() method which saves the image at that link with the passed filename. This process is repeated for every ‘l’ in srcs.
Here’s the script in action. I tried downloading some pictures of the Lamborghini Aventador using it.
In the end I would like to say just one thing. As most of you must have noticed scraping Imgur for images was only 20% about coding and 80% about observing how they store images in their backend. With that said I would like to point out that this post is intended only for ethical purposes.
Hope you guys learned something new today…Please leave your feedback in the comments section.
Happy Scraping! ✌️