“MacBook Pro on brown wooden table” by Max Nelson on Unsplash

Web Scraping (images) with Python

Saurabh Suman
Sep 8, 2018 · 3 min read

Has it ever occurred, that you see a bunch of images on a web page and want to download it all but simply can’t do it because clicking that many links is simply impossible? I was in a similar situation while working on a use case where I had to train the Convolution Neural Network, which required hundreds/thousands of images. It was not even a remote possibility to do it manually. This is when I decided to write a python program using the Beautiful Soup library and let my computer do all the clicking and downloading.
I have used a website called celebmafia.com(Ad Blocker recommended), which I found to be hosting a lot of celebrity images in high quality. Why celebrity images? because the problem that I was working on, had to do with training the neural network in identifying celebrity by just checking the image.


A basic familiarity with python and HTML is required. Now let’s start:

The first thing that needs to be done is check the page structure and how is it that you reach to the destination page where you get the images to be downloaded.
You will first see the URL of the celebrity page

https://celebmafia.com/mila-kunis/page/2/

The pattern here signifies that the homepage belongs to “mila kunis” and the page currently being viewed is number 2. So the two parameters that we can change as per our convenience is “celebrity name” and the “page number”Next thing is the Read More button which actually takes you to the destination page. We need to see the HTML source code to see the component class.

The section on web page to pay attention to
HTML source of the section shown above

We find it to be a class=”more-link” and is placed between <a> , which is expected because it contains a hyperlink to other page.On analyzing the destination page’s source code we see that it contains the hyperlink to actual image location

Now that we are done with the preliminary analysis of the page structure. lets jump into coding

Firstly we need to get all our libraries into place.

import urllib.request as urlfrom bs4 import BeautifulSoupimport requestsimport shutilimport os, sys

The next important thing that you need to take care is because we are not using actual webpage for accessing the webpage.

url.Request(wiki, headers= {‘User-Agent’ : ‘Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36’})

This request parameter needs to have information about user agent, which in layman term means that we are passing the information about the browser and operating system to the server that is secured by https

The next important parameters is using our preliminary analysis in code

nextlink = soup.findAll(“a”, {“class” : “more-link”})nextlink.get(‘href’)

this basically does the task of looking for all the more-links class and get the href for the same.The same steps are repeated on the child page to finally get the image urls. Once we have all the relevant urls. we use request again to get the raw form of image and finally write it down in the form of file object.

r.raw.decode_content = Trueshutil.copyfileobj(r.raw, f)

The working code is available on the following Git repo.

Once the code is executed successfully, you will get a nicely structured directory structure for each of the child pages and it will contain the image files that has been written to it.

Saurabh Suman

Written by

An aspiring data scientist. Here for learning and sharing more about this "fourth paradigm" of science.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade