Data Acquisition Using Web Scraping, Web Crawlers and APIs (Part 2)

Aryan Chugh
Analytics Vidhya
Published in
7 min readJun 17, 2020
Web Scraping

Introduction

This article will cover what are APIs and the basics of forming valid queries, fetching data from APIs, and structuring data to valid formats like JSON, CSV, etc.

All the codes are provided in my Github repository, please click here to jump directly to the code.

This article is a continuation of the introductory article Data Acquisition Using Web Scraping, Web Crawlers and APIs (Part 1)”. I would recommend reading the mentioned article as it covers the basics of web scraping and BeautifulSoup.

What is an API ??

API stands for Application Programming Interface. An API is a software intermediary that allows two applications to talk to each other. In layman’s terms, an API connects an application and a database and we can retrieve data by building a query link with all the required parameters. The API then delivers your request to the provider that you’re requesting it from and delivers the response back to you.

APIs :

  1. Weather API — Provided by OpenWeatherMap
  2. Graph API — Provided by Facebook
  3. Google Geolocation API

A bonus section on image scraping is also included at the end of this article.

OpenWeatherMap API

First, we will have to to the official website of OpenWeatherMap and signup to generate a unique API key which we will then use in our queries to retrieve data.

Query Format:

api.openweathermap.org/data/2.5/weather?q={city name},{state code},{country code}&appid={your api key}

In the ‘q’ parameter we can give all the three parameters or just the ‘city name’ to get weather information from this API.

from urllib.request import urlopenapi_url = "https://samples.openweathermap.org/data/2.5/weather?q=London,uk&appid=####################"# get a response from the url
url_result = urlopen(api_url)
# read content of url
# can contain xml, json, images etc.
data = url_result.read()
print(type(data))
print(data)

Output:

<class 'bytes'>
b'{"coord":{"lon":-0.13,"lat":51.51},"weather":[{"id":300,"main":"Drizzle","description":"light intensity drizzle","icon":"09d"}],"base":"stations","main":{"temp":280.32,"pressure":1012,"humidity":81,"temp_min":279.15,"temp_max":281.15},"visibility":10000,"wind":{"speed":4.1,"deg":80},"clouds":{"all":90},"dt":1485789600,"sys":{"type":1,"id":5091,"message":0.0103,"country":"GB","sunrise":1485762037,"sunset":1485794875},"id":2643743,"name":"London","cod":200}'

As we can see in return we get the data we asked the API for in a JSON format but as this data is in bytes and not in the string we need to convert this to valid JSON format for parsing this data.

# We will convert this json data into a dictionary using json libraryimport json
json_data = json.loads(data)
print(type(json_data))
print(json_data)

Output:

<class 'dict'>
{'coord': {'lon': -0.13, 'lat': 51.51}, 'weather': [{'id': 300, 'main': 'Drizzle', 'description': 'light intensity drizzle', 'icon': '09d'}], 'base': 'stations', 'main': {'temp': 280.32, 'pressure': 1012, 'humidity': 81, 'temp_min': 279.15, 'temp_max': 281.15}, 'visibility': 10000, 'wind': {'speed': 4.1, 'deg': 80}, 'clouds': {'all': 90}, 'dt': 1485789600, 'sys': {'type': 1, 'id': 5091, 'message': 0.0103, 'country': 'GB', 'sunrise': 1485762037, 'sunset': 1485794875}, 'id': 2643743, 'name': 'London', 'cod': 200}

Now we can easily parse the data as it is present to us in the format of a python dictionary. We can easily get specific information from this dictionary by directly using the keys like [“coord”] (for coordinates), [“weather”][“description”] (for weather type), etc

print("Coordinates: {}".format(json_data['coord']))
print("Weather Description: {}".format(json_data['weather'][0]['description']))
print("Temperature: {}".format(json_data['main']['temp']))

Output:

Coordinates: {'lon': -0.13, 'lat': 51.51}
Weather Description: light intensity drizzle
Temperature: 280.32

Facebook Graph API

We will use this API to retrieve profile pictures of users on Facebook. This API can be used to get basic information about their profile and bio (if added).

Query Format:

URL = “http://graph.facebook.com/<user_id(numeric)>/picture?type=large"

The first and most important parameter that we have to integrate into our URL is the user id. There is a unique numeric user id that is assigned to every user on Facebook.

In order to get this id, we first have to go to that person’s profile and then copy the URL (as shown in the image) and use this website to get the user id for that profile.

We will be using the requests library for scraping images using the graph API. The requests library is one of the most popular Python libraries for making HTTP requests, it is a wrapper over the urllib library and works well with python 3.x.

import requestsurl = "http://graph.facebook.com/4/picture?type=large"
r = requests.get(url)
print(r)
print(type(r.content))

We used user_id = 4, that is the official profile of Mark Zuckerberg.

Output:

<Response [200]>
<class 'bytes'>

As we can see we got a response for our query with response code 200 which means a successful API request, to study more about HTTP response codes please visit this website. We got the image but in byte format hence we will first save it and then visualize it using OpenCV and matplotlib.

# This will create a jpeg file from the binary content we scraped
with open("./Data/sample_pic.jpeg", 'wb') as f: # 'wb' stands for write-binary
f.write(r.content)
# We will visualize image using opencv and matplotlib
import cv2
import matplotlib.pyplot as plt
img = cv2.imread("./Data/sample_pic.jpeg")
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.imshow(img)
plt.axis("off")
plt.show()

Output:

We have successfully scraped the profile picture using the requests library and some basic code.

Google Geolocation API

Google provides us with many different APIs with varied functionalities and in this example, we will be using google’s geolocation API. In order to get data from this API, we have to sign in and get a unique API key just as we did in the weather API. You can follow the steps given on this website in order to acquire a unique API key.

One advantage of using google APIs is that we can monitor our data usage using the interactive dashboard provided to us by google.

url = "https://maps.googleapis.com/maps/api/geocode/json?"
parameters = {
"address" : "coding blocks pitampura",
"key" : "######################" # Your API key
}
# Another way of packing the parameters in the base query url
r = requests.get(url, params=parameters)
print(r.url)

Output:

'https://maps.googleapis.com/maps/api/geocode/json?address=coding+blocks+pitampura&key=#################'

We can now see our response data using the “.content()” method of the response object. We can convert the data from bytes to str(JSON format) as shown in the last example and as the acquired data is very large I will just display part of the data

# Will write the contentr in a better way, using "utf-8" format 
print(r.content.decode('UTF-8'))

Output:

  "address_components" : [
{
"long_name" : "Metro Pillar Number 337",
"short_name" : "Metro Pillar Number 337",
"types" : [ "subpremise" ]
},
{
"long_name" : "Main Road",
"short_name" : "Main Road",
"types" : [ "route" ]
},
{
"long_name" : "Nishant Kunj",
"short_name" : "Nishant Kunj",
"types" : [ "political", "sublocality", "sublocality_level_3" ]
},
{
"long_name" : "Pitam Pura",
"short_name" : "Pitam Pura",
"types" : [ "political", "sublocality", "sublocality_level_1" ]
},
{
"long_name" : "New Delhi",
"short_name" : "New Delhi",
"types" : [ "locality", "political" ]
} ..........

A good practice is to disable the API from the google console after using it as the traffic is measured and we get limited usage for APIs (if not paid for them)

Bonus Section

In this section, we will scrape beautiful quotes in the form of images from this website using BeutifulSoup and requests libraries. We have covered the understanding and basics of the following code in this article and Part 1 of this article so minimal explanation will be given for the following code

import bs4
import requests
url = "https://www.passiton.com/inspirational-quotes?page=1"
response = requests.get(url)
# Now we need to create a BeautifulSoup object which will help us parse the html content easier
soup = bs4.BeautifulSoup(response.content, 'html.parser')
# We can specify which type of parser we want but default parser is of html

We will use inspect element to study the DOM of the page and then extract image URLs using BeautifulSoup parser

divs = soup.findAll('div', {'class' : 'row', 'id' : 'all_quotes'})
anchor = divs[0].findAll('a')
print(len(anchor))
print("\n")
print("Number of images on our page: {}".format(anchor[2]))

Output:

<a href="/inspirational-quotes/8081-one-of-the-most-sincere-forms-of-respect-is"><img alt="One of the most sincere forms of respect is actually listening to what another has to say. #&lt;Author:0x00007f1c5362d1f0&gt;" class="margin-10px-bottom shadow" height="310" src="https://assets.passiton.com/quotes/quote_artwork/8081/medium/20200609_tuesday_quote.jpg?1591393705" width="310"/></a>Number of images on our page: 64

We have successfully parsed our data to get the anchor elements that contain the images and we will use image src to download images directly into our computer

for i in range(0,64,2):
with open('Data/Scraped Images/Inspirational quote no.{}.jpg'.format(i), 'wb') as f:
img_url = anchor[i].img.attrs['src']
response = requests.get(img_url)
f.write(response.content)

We can visualize these images using matplotlib and OpenCV as explained in graph API example.

Visualizing our result

import matplotlib.pyplot as plt
import cv2
for i in range(0,7,2):
img = cv2.imread('Data/Scraped Images/Inspirational quote no.{}.jpg'.format(i))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.imshow(img)
plt.axis("off")
plt.show()

Output:

Conclusion

After completing this article you can easily parse HTML and scrape useful data from websites and APIs. Please feel free to browse through my Github repository for more interesting projects and explanatory codes.

I have also posted an explanatory code in this repository for scraping image quotes using web crawlers which span multiple pages and gather images automatically.

--

--