The Great Data Hunt: Scraping the Web and Unlocking APIs

Sergio David
4 min readAug 11, 2023

--

In the world of big data, collecting relevant information is a fundamental step. This process can often be complex and requires precise tools and methods. In this article, we’ll delve into two core techniques: web scraping with BeautifulSoup and using APIs with the requests library. By breaking down examples and sharing snippets of code, we’ll understand the logic behind these powerful data collection tools, including extracting city data from Wikipedia and collecting weather information through OpenWeatherMap API.

Whether you’re new to data collection or seeking to expand your skills, these techniques offer a robust way to gather vast arrays of information, laying a solid foundation for data analysis and insights.

Section 1: Web Scraping with BeautifulSoup

Web scraping is a method used to extract data from websites. BeautifulSoup is a popular library in Python that makes this task more accessible.

Example: Extracting City Data

In this example, we will be extracting the metro population data of Berlin from its Wikipedia page. We will employ BeautifulSoup for web scraping and pandas to store the extracted data in a structured way.

  1. Importing Necessary Libraries:

We need to import BeautifulSoup for parsing HTML content, requests for fetching the web page, and pandas for handling data.

from bs4 import BeautifulSoup
import requests
import pandas as pd

2. Defining the URL and DataFrame:

We specify the Wikipedia page URL of Berlin, and create a DataFrame with columns ‘City’ and ‘Population’ to store the scraped data.

url = "https://en.wikipedia.org/wiki/Berlin"
df = pd.DataFrame(columns=['City', 'Population'])

3. Sending a Request to the URL:

Using the requests.get method, we send an HTTP request to the specified URL to get the page content.

response = requests.get(url)

4. Parsing HTML Content:

We create a BeautifulSoup object and pass the content from the response object along with the parser 'html.parser'.

soup = BeautifulSoup(response.content, 'html.parser')

5. Locating the Metro Population Data:

We look for the anchor tag <a> containing the string 'Metro' and then find the adjacent table data cell <td> that contains the population information.

metro_row = soup.find('a', string='Metro')
if metro_row is not None:
metro = metro_row.find_next('td', class_="infobox-data").get_text(strip=True)

6. Adding Data to the DataFrame:

We create a new DataFrame with the scraped data and concatenate it with the original DataFrame to store the information.

data_to_add = pd.DataFrame({'City': ['Berlin'], 'Population': [metro]})
df = pd.concat([df, data_to_add])

7. Printing the DataFrame:

Finally, we print the DataFrame to view the extracted information.

print(df)
City    Population
Berlin 6,144,600

Section 2: API Calls with Requests and Python

APIs (Application Programming Interfaces) provide a systematic way to interact with web services and obtain data. They can offer detailed information on various subjects like weather, stocks, or social media interactions.

Example: Collecting Weather Data through an API

In this example, we’ll explore how to collect weather data from Berlin using the OpenWeatherMap API.

  1. Define the URL and API Key:

You’ll need an API key from OpenWeatherMap. The URL will contain the city name and API key to fetch weather data.

city = 'Berlin'
openweather_key = 'YOUR_API_KEY_HERE'
url = f"http://api.openweathermap.org/data/2.5/forecast?q={city}&appid={openweather_key}&units=metric"

2. Send a GET Request to the API URL:

We use the requests.get method to send a request to the specified URL.

response = requests.get(url)

3. Parse the JSON Response and Check for Errors:

The response object will contain the weather data in JSON format. We’ll check the status code to make sure the request was successful before converting it into a Python object.

if response.status_code == 200:
json = response.json()
else:
print(f"Failed to fetch weather data. Status code: {response.status_code}")

4. Accessing Specific Data:

You can now navigate through the JSON object to access specific weather details. Here’s how you could access the temperature forecast for the next day:

temperature_forecast = json['list'][0]['main']['temp']
print(f"The temperature forecast for Berlin is {temperature_forecast}°C")

For more information on the structure of the JSON object and how to access different weather parameters, you can refer to the OpenWeatherMap API documentation.

Conclusion:

Data collection is an essential step in any data-driven project. By understanding the logic behind web scraping with BeautifulSoup and API calls using requests or specialized wrappers, we can efficiently gather the information we need.

These techniques offer flexibility and power, enabling us to tap into a vast array of data sources. Whether it’s collecting city information from a webpage or fetching weather data through an API, the examples provided in this article serve as a practical guide to these essential skills in data science and analytics.

Combining both web scraping and API calls, we can create a versatile data collection pipeline that integrates various sources. With these tools in our toolkit, the possibilities for data analysis, visualization, and decision-making are virtually limitless.

--

--

Sergio David

Data Scientist exploring the frontier of machine learning. Join me on Medium for insights into the evolving world of data.