Using Python to scrape and perform ETL (Extraction, Transformation, Loading) on BBC’s Chocolate Cake Recipes
This article is based on a project that we undertook while having an introduction to data engineering. We used Python as our language of choice as it is the most widely used programming language for data scraping. Hope this article helps you learn a bit more about ETL and Python. Let’s get started!
Before we begin, we need to import the necessary libraries. We have used BeautifulSoup as our scrapper. We have also used the module requests. This allows us to send HTTP requests, which returns a Response Object with the response data.
from bs4 import BeautifulSoup
import requests
import pandas as pdurl= "https://www.bbcgoodfood.com/recipes/collection/chocolate-cake"headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}response = requests.get(url, headers=headers)
response
If all is good, you should see the following output -
<Response [200]>
The number 200 signifies that the action was successfully received, understood and accepted.
Next, the variable data is used to store the text of the HTML document of the site. The variable soup parses that text file.
data = response.text
soup = BeautifulSoup(data, 'html.parser')
PHASE 1: EXTRACTION
The source of our dataset is a public website for everyone to view and there is no restriction on scraping the site. This site is about recipes for various types of chocolate cakes and it contains ratings, level of difficulty, the time taken to bake the cake and so on. For the first step on attributes extraction, we extract title for each of the chocolate cake recipes.
Attribute 1: Title
titles = soup.find_all("h3",{"class":"teaser-item__title"})for title in titles:
print(title.text)
The output is as follows -
Easy chocolate cake
Ultimate chocolate cake
Chocolate fridge cake
Chocolate caramel cake
Double chocolate loaf cake
Salted dark chocolate, rye & courgette cake
Dark chocolate & orange cake
Chocolate & sesame loaf cake
Dark chocolate, coconut & passion fruit mousse cake
Easy vegan chocolate cake
Gravity-defying sweetie cake
Chocolate brownie cake
Cookies & cream party cake
Brooklyn blackout cake
Malted chocolate drip cake
Chocolate & lime cake
Lighter Chocolate cake with chocolate icing
Hazelnut latte cake
Seriously rich chocolate cake
Espresso, chocolate & chilli cake with coffee cream
Chocolate meringue Mont Blanc cake
Flourless chocolate & pear cake
Pistachio & milk chocolate squares
Chocolate fudge cake
White & dark chocolate cake
Attribute 2: Description
descriptions= soup.find_all("div",{"class":"teaser-item__text-content"})for description in descriptions:
print(description.text)
print("\n")
The output is as follows -
Master the chocolate cake with an airy, light sponge and rich buttercream filling. It's simple enough for an afternoon tea but special enough for a party too...Indulge yourself with this ultimate chocolate cake recipe that is beautifully moist, rich and fudgy. Perfect for a celebration or an afternoon teaSophisticated and intensely chocolatey with a hit of juicy raisins and crunchy biscuit pieces, serve this decadent chocolate fridge cake as a teatime treat This rich cake makes an impressive centrepiece for any summer tea party Chocolate and cake are two of our favourite things, so what's not to love about this indulgent cake? Not only are these chocolate squares a great way to use up a glut of courgettes, they're topped with rye crumbs and sea salt for a chocolate cake with a twist... A dense, dark and devilishly delicious cake, this will be gone before you know it Tahini gives this cake a subtle nutty flavour that partners perfectly with rich chocolate. This recipe makes 2 cakes, so why not freeze one as a treat for another day! Sandwich our best ever chocolate sponge with a fruity coconut mousse and smother in a rich ganache for a stunning party centrepiece This indulgent, fudgy vegan bake is topped with a rich frosting – you'd never guess that it's free from dairy, eggs, wheat and nuts... Anti-gravity cakes are this year's must-have trend for birthdays or parties - this stunning Smartie cake will delight kids and grown-ups alike... The perfect cake for brownie lovers, try a big slice warm with some ice cream Stack up our best ever chocolate sponges and smother in an Oreo biscuit icing to make this towering celebration cake This rich, dark sponge is filled and coated with a thick chocolate custard, then finished with crumbled cake - best eaten chilled This layered drip cake is smothered in cream cheese icing and drizzled with a dark chocolate ganache - a stunning centrepiece for a big occasion Give chocolate cake a zesty lift with lime buttercream filling, chocolate and lime icing and candied zest to decorate – an irresistible flavour pairing... You'd never guess that this rich-tasting, light-textured sponge is lower in fat, calories and sugar thanks to ground almonds and yogurt Give coffee cake the ultimate upgrade - layer with chocolate hazelnut frosting, add a creamy topping and sprinkle with crushed nuts Dark, rich and delicious - the perfect dessert The warmth of coffee and chilli combine beautifully with chocolate in this impressive dinner party dessert, finish with double cream This towering cake has layers of chocolate sponge, meringue and chestnut cream - a stunning centrepiece for a festive dinner party A light, but luscious cake made with hazelnuts - serve a slice for dessert with a dollop of crème fraîche Try something different with Sarah Cook's Pistachio & milk chocolate squares - you won't be able to stop at just one piece!... You just can't beat a good chocolate cake, and this one is guaranteed to disappear as soon as you cut the first slice... For big celebrations this cake is a must, four layers of moist sponge, lashings of chocolate ganache and the crunch of Maltesers
Attribute 3: Duration
durations = soup.find_all('li',{'class':'teaser-item__info-item--total-time'})for duration in durations:
print(duration.text)
print("\n")
The output -
55 mins
2 hours and 10 mins
20 mins
50 mins
1 hour and 20 mins
1 hour and 30 mins
2 hours and 10 mins
1 hour and 20 mins
1 hour and 30 mins
55 mins
1 hour and 35 mins
1 hour and 5 mins
1 hour and 50 mins
1 hour and 10 mins
3 hours
1 hour and 50 mins
50 mins
1 hour and 55 mins
1 hour and 5 mins
1 hour and 5 mins
2 hours and 10 mins
1 hour
1 hour and 20 mins
1 hour and 20 mins
1 hour
Attribute 4: Ratings
ratings = soup.find_all('div',{'class':'star'})for rating in ratings:
print(rating.text)
The output -
4.663045
4.705535
4
4.37931
4.74312
4
4.528735
3.666665
4.666665
4.357145
4.80645
4.62234
4.8125
4.8
4.916665
5
3.555555
3.72222
4.05
5
5
4.70513
4.51852
4.223075
4.41818
Attribute 5: Difficulty
difficulties = soup.find_all('li',{'class':'teaser-item__info-item teaser-item__info-item--skill-level'})for difficulty in difficulties:
print(difficulty.text)
And, the output -
Easy
Easy
Easy
Easy
Easy
Easy
Easy
Easy
A challenge
Easy
More effort
Easy
More effort
More effort
A challenge
Easy
Easy
A challenge
More effort
Easy
More effort
More effort
Easy
Easy More effort
Attribute 6: Links
tags = soup.find_all('a')for tag in tags:
print(tag.get('href'))
The output of this is pretty lengthy as there are A LOT of links in the site. Hence, we have not included it.
Attribute 7: Food Type
vegetarians = soup.find_all('li',{'class':'teaser-item__info-item--vegetarian'})for vegetarian in vegetarians:
if vegetarian:
print(vegetarian.text)
else:
print("N/A")
The output is as follows -
Vegetarian
Vegetarian
Vegetarian
Vegetarian
Vegetarian
Vegetarian
For the final process in attribute extraction, we combine all the attributes extracted into a pandas dataframe.
recipes = soup.find_all("article",{"class":"node-recipe"})rep_no = 0
recipes_list = {}for recipe in recipes:
title = recipe.find("h3",{"class":"teaser-item__title"}).text
link = recipe.find("a").get('href')
abs_url = "https://www.bbcgoodfood.com" + link
description = recipe.find("div",{"class":"teaser-item__text
content"}).text ratings = recipe.find('div', {'class':'star'}).text duration = recipe.find('li',{'class':'teaser-item__info-
item--total-time'}).text vegetarian_tag = recipe.find('li',{'class':'teaser-item__info
-item--vegetarian'}) vegetarian = vegetarian_tag.text if vegetarian_tag else "N/A" difficulties = recipe.find('li',{'class':'teaser-item__info-
item teaser-item__info-item--skill-level'}).text rep_no += 1 recipes_list[rep_no] = [title,description,abs_url,ratings,vegetarian,duration,difficulties] recipes_df = pd.DataFrame.from_dict(recipes_list, orient ="index", columns=['Recipe Title','Description','Link','Ratings','Food Type','Duration','Difficulty'])recipes_df.head()
The output is a nice-looking dataframe that consists of the recipes with their respective extracted attributes.
PHASE 2: TRANSFORMATION
There are few data processing techniques that we used for this data which is generally focused on data formatting and handling missing values. This phase is to ensure that the data is reliable to use for further analysis and visualization. Before we start the pre-processing steps, we need to identify these points:
- What are the features of the dataset?
- What are the expected types (int, float, string, boolean)
- Is there any missing data?
We can identify these points by looking at our data. Let’s take a look:
From this, we can answer our first question which is the features of the dataset. These are the features we got:
- Recipe Title (Title of the recipe)
- Description (Recipe Description)
- Link (Absolute URL link to each recipe)
- Ratings (Recipe ratings from 0 to 5)
- Food Type (whether it is a Vegetarian food or not)
- Duration (Recipe duration)
- Difficulty (Skill level: Easy, More Effort, A Challenge)
Next, what are the expected data types?
- Recipe Title: String
- Description: String
- Link: String
- Ratings: float or double
- Food Type: String
- Duration: String
- Difficulty: String
Besides, there is only one missing data which is:
- Food Type
After we observed these points, it is much clearer for us to identify which cleaning process is relevant for this data. So, let’s start cleaning!
Cleaning Task 1: Handling Missing Values
Based on the data that we have scraped, “Food Type” column contains missing values. From this, we need to fill it with meaningful values in order to avoid missing values in the data. In the column, there is only one category named Vegetarian and the others are stated as null values. So, we decided to replace all the null values with “Non-vegetarian”.
Before:
recipes_df['Food Type'].head()
The output -
1 Vegetarian
2 N/A
3 N/A
4 N/A
5 N/A
Name: Food Type, dtype: object
Visualization
The missing values category is stated as N/A.
After:
Replace null values with “Non-Vegetarian”.
recipes_df['Food Type'] = recipes_df['Food Type'].replace({'N/A': 'Non Vegetarian'})recipes_df['Food Type'].head()
And, the output -
1 Vegetarian
2 Non Vegetarian
3 Non Vegetarian
4 Non Vegetarian
5 Non Vegetarian Name: Food Type, dtype: object
Visualization:
Create histogram to see the column insights.
%matplotlib inline
import matplotlib.pyplot as pltrecipes_df['Food Type'].value_counts().plot(kind='bar' , figsize=(10, 6), color='lightgreen')plt.title('Histogram of Recipes Food Type')
plt.xlabel('Food Type')
plt.ylabel('Number of recipes')
Cleaning Task 2: String types to numerical categories
For the “Difficulty” column, we can see that the categorical values are of the datatype string. So, we decided to convert the string values into numerical data. This can help the process of visualization to become much easier.
Before:
recipes_df['Difficulty'].head(15)
The output of this was -
1 Easy
2 Easy
3 Easy
4 Easy
5 Easy
6 Easy
7 Easy
8 Easy
9 A challenge
10 Easy
11 More effort
12 Easy
13 More effort
14 More effort
15 A challenge
Name: Difficulty, dtype: object
Visualization:
After:
recipes_df['Difficulty']
=recipes_df['Difficulty'].replace(to_replace=['Easy', 'More
effort', 'A challenge'], value= [1, 2, 3], regex=True)recipes_df['Difficulty'].head(9)
The changed output is as follows -
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 3
Visualization:
%matplotlib inline
import matplotlib.pyplot as pltrecipes_df.plot(kind='bar', x='Recipe Title', y='Difficulty',
figsize=(10, 6), color='lightblue')plt.title('Level Of Difficulties For Each Of The Recipe')
plt.xlabel('Recipe Title')
plt.ylabel('Level of Difficulty')
plt.show()
Cleaning Task 3: Change Letters to Lowercase
This is done to ensure that the naming convention of the data is standardized without having mixed-case values in the dataset. We chose “Recipe Title” and “Description” column to lowercase their values.
Before:
Column — Recipe Title
recipes_df['Recipe Title'].head()
The first five rows of data look like this -
1 Easy chocolate cake
2 Ultimate chocolate cake
3 Chocolate fridge cake
4 Chocolate caramel cake
5 Double chocolate loaf cake
Name: Recipe Title, dtype: object
Column — Description
recipes_df['Description'].head()
The output -
1 Master the chocolate cake with an airy, light ...
2 Indulge yourself with this ultimate chocolate ...
3 Sophisticated and intensely chocolatey with a ...
4 This rich cake makes an impressive centrepiece...
5 Chocolate and cake are two of our favourite th...
Name: Description, dtype: object
After:
Column — Recipe Title
recipes_df['Recipe Title'] = recipes_df['Recipe Title'].str.lower()
recipes_df['Recipe Title'].head()
The changed output-
1 easy chocolate cake
2 ultimate chocolate cake
3 chocolate fridge cake
4 chocolate caramel cake
5 double chocolate loaf cake
Name: Recipe Title, dtype: object
Column — Description
recipes_df['Description'] = recipes_df['Description'].str.lower()
recipes_df['Description'].head()
The changed output-
1 master the chocolate cake with an airy, light ...
2 indulge yourself with this ultimate chocolate ...
3 sophisticated and intensely chocolatey with a ...
4 this rich cake makes an impressive centrepiece...
5 chocolate and cake are two of our favourite th...
Name: Description, dtype: object
Cleaning Task 4: String time frame into numeric types
We decided to convert the “Duration” column from string into numeric types. The time will be formatted into minutes’ in order to standardize and ease the process of analysing the duration to complete the recipes.
Before:
Column — Duration
recipes_df['Duration'].head()
The output —
1 55 mins
2 2 hours and 10 mins
3 20 mins
4 50 mins
5 1 hour and 20 mins
Name: Duration, dtype: object
Now, we scrape the hour and minute values from the site.
hours = soup.find_all('span',{'class':'hours'})
mins = soup.find_all('span',{'class':'mins'})
After:
Create a new column that contains durations in minutes. We named the column as “Duration (in Mins)”
recipes = soup.find_all("article",{"class":"node-recipe"})\import rerep_no = 0
recipes_list = {}for recipe in recipes:
#hours and mins
hours = recipe.find('span',{'class':'hours'})
temp_hours = re.findall(r'\d+', str(hours))
h = (", ".join(temp_hours))
h = int(h) if h else 0
h = h * 60 mins = recipe.find('span',{'class':'mins'})
temp_mins = re.findall(r'\d+', str(mins))
m = (", ".join(temp_mins))
m = int(m) if m else 0 final_duration = h + m; rep_no += 1 recipes_list[rep_no] = [final_duration]
recipes_duration_df = pd.DataFrame.from_dict(recipes_list, orient ="index", columns=['Duration (in Mins)'])
Add the new Column into the dataframe.
recipes_df['Duration (in Mins)'] = recipes_duration_df
recipes_df['Duration (in Mins)'].head
The expected output -
<bound method NDFrame.head of
1 55
2 130
3 20
4 50
5 80
Name: Duration (in Mins), dtype: int64>
As for the column named “Duration”, we decided to drop it since there is another column named “Duration (in Mins)”, which is redundant. Hence, the updated recipes_df looks like this -
recipes_df.drop("Duration", axis=1, inplace=True)
recipes_df.head()
PHASE 3: LOAD
For the final process of this scraping and data cleaning session, we store (load) the data in a CSV file and download it. The following line of code achieves this -
recipes_df.to_csv("cleaned_recipes_data.csv")
This was it! This wasn’t so bad, was it?
Hope this was informative for you and helped you learn more about doing data analysis with pandas.
The complete codes can be found here.
Special thanks to our professor Dr Sabrina, who guided us throughout the subject and my friends Azzan, Hadi and Faris, who were behind this project alongside me.