Using Python to scrape and perform ETL (Extraction, Transformation, Loading) on BBC’s Chocolate Cake Recipes

11 min readJan 1, 2020

This article is based on a project that we undertook while having an introduction to data engineering. We used Python as our language of choice as it is the most widely used programming language for data scraping. Hope this article helps you learn a bit more about ETL and Python. Let’s get started!

Before we begin, we need to import the necessary libraries. We have used BeautifulSoup as our scrapper. We have also used the module requests. This allows us to send HTTP requests, which returns a Response Object with the response data.

from bs4 import BeautifulSoup
import requests
import pandas as pdurl= "https://www.bbcgoodfood.com/recipes/collection/chocolate-cake"headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}response = requests.get(url, headers=headers)
response

If all is good, you should see the following output -

<Response [200]>

The number 200 signifies that the action was successfully received, understood and accepted.

Next, the variable data is used to store the text of the HTML document of the site. The variable soup parses that text file.

data = response.text
soup = BeautifulSoup(data, 'html.parser')

PHASE 1: EXTRACTION

The source of our dataset is a public website for everyone to view and there is no restriction on scraping the site. This site is about recipes for various types of chocolate cakes and it contains ratings, level of difficulty, the time taken to bake the cake and so on. For the first step on attributes extraction, we extract title for each of the chocolate cake recipes.

Attribute 1: Title

titles = soup.find_all("h3",{"class":"teaser-item__title"})for title in titles:  
  print(title.text)

The output is as follows -

Easy chocolate cake   
Ultimate chocolate cake    
Chocolate fridge cake    
Chocolate caramel cake   
Double chocolate loaf cake    
Salted dark chocolate, rye & courgette cake    
Dark chocolate & orange cake
Chocolate & sesame loaf cake    
Dark chocolate, coconut & passion fruit mousse cake    
Easy vegan chocolate cake    
Gravity-defying sweetie cake    
Chocolate brownie cake    
Cookies & cream party cake    
Brooklyn blackout cake    
Malted chocolate drip cake    
Chocolate & lime cake    
Lighter Chocolate cake with chocolate icing    
Hazelnut latte cake    
Seriously rich chocolate cake    
Espresso, chocolate & chilli cake with coffee cream    
Chocolate meringue Mont Blanc cake    
Flourless chocolate & pear cake    
Pistachio & milk chocolate squares    
Chocolate fudge cake    
White & dark chocolate cake

Attribute 2: Description

descriptions= soup.find_all("div",{"class":"teaser-item__text-content"})for description in descriptions:
  print(description.text)
  print("\n")

The output is as follows -

Master the chocolate cake with an airy, light sponge and rich buttercream filling. It's simple enough for an afternoon tea but special enough for a party too...Indulge yourself with this ultimate chocolate cake recipe that is beautifully moist, rich and fudgy. Perfect for a celebration or an afternoon teaSophisticated and intensely chocolatey with a hit of juicy raisins and crunchy biscuit pieces, serve this decadent chocolate fridge cake as a teatime treat   This rich cake makes an impressive centrepiece for any summer tea party   Chocolate and cake are two of our favourite things, so what's not to love about this indulgent cake?   Not only are these chocolate squares a great way to use up a glut of courgettes, they're topped with rye crumbs and sea salt for a chocolate cake with a twist...   A dense, dark and devilishly delicious cake, this will be gone before you know it   Tahini gives this cake a subtle nutty flavour that partners perfectly with rich chocolate. This recipe makes 2 cakes, so why not freeze one as a treat for another day!   Sandwich our best ever chocolate sponge with a fruity coconut mousse and smother in a rich ganache for a stunning party centrepiece   This indulgent, fudgy vegan bake is topped with a rich frosting – you'd never guess that it's free from dairy, eggs, wheat and nuts...   Anti-gravity cakes are this year's must-have trend for birthdays or parties - this stunning Smartie cake will delight kids and grown-ups alike...   The perfect cake for brownie lovers, try a big slice warm with some ice cream   Stack up our best ever chocolate sponges and smother in an Oreo biscuit icing to make this towering celebration cake   This rich, dark sponge is filled and coated with a thick chocolate custard, then finished with crumbled cake - best eaten chilled   This layered drip cake is smothered in cream cheese icing and drizzled with a dark chocolate ganache - a stunning centrepiece for a big occasion   Give chocolate cake a zesty lift with lime buttercream filling, chocolate and lime icing and candied zest to decorate – an irresistible flavour pairing...   You'd never guess that this rich-tasting, light-textured sponge is lower in fat, calories and sugar thanks to ground almonds and yogurt   Give coffee cake the ultimate upgrade - layer with chocolate hazelnut frosting, add a creamy topping and sprinkle with crushed nuts   Dark, rich and delicious - the perfect dessert   The warmth of coffee and chilli combine beautifully with chocolate in this impressive dinner party dessert, finish with double cream   This towering cake has layers of chocolate sponge, meringue and chestnut cream - a stunning centrepiece for a festive dinner party   A light, but luscious cake made with hazelnuts - serve a slice for dessert with a dollop of crème fraîche   Try something different with Sarah Cook's Pistachio & milk chocolate squares - you won't be able to stop at just one piece!...   You just can't beat a good chocolate cake, and this one is guaranteed to disappear as soon as you cut the first slice...   For big celebrations this cake is a must, four layers of moist sponge, lashings of chocolate ganache and the crunch of Maltesers

Attribute 3: Duration

durations = soup.find_all('li',{'class':'teaser-item__info-item--total-time'})for duration in durations:
  print(duration.text)
  print("\n")

The output -

55 mins    
2 hours and 10 mins    
20 mins    
50 mins    
1 hour and 20 mins    
1 hour and 30 mins    
2 hours and 10 mins    
1 hour and 20 mins    
1 hour and 30 mins    
55 mins    
1 hour and 35 mins    
1 hour and 5 mins    
1 hour and 50 mins    
1 hour and 10 mins   
3 hours    
1 hour and 50 mins    
50 mins    
1 hour and 55 mins    
1 hour and 5 mins    
1 hour and 5 mins    
2 hours and 10 mins    
1 hour    
1 hour and 20 mins    
1 hour and 20 mins    
1 hour

Attribute 4: Ratings

ratings = soup.find_all('div',{'class':'star'})for rating in ratings:
  print(rating.text)

The output -

Attribute 5: Difficulty

difficulties = soup.find_all('li',{'class':'teaser-item__info-item teaser-item__info-item--skill-level'})for difficulty in difficulties:
  print(difficulty.text)

And, the output -

Easy  
Easy  
Easy  
Easy  
Easy  
Easy  
Easy  
Easy  
A challenge  
Easy  
More effort  
Easy  
More effort  
More effort  
A challenge  
Easy  
Easy  
A challenge  
More effort  
Easy  
More effort  
More effort  
Easy  
Easy  More effort

Attribute 6: Links

tags = soup.find_all('a')for tag in tags:
  print(tag.get('href'))

The output of this is pretty lengthy as there are A LOT of links in the site. Hence, we have not included it.

Attribute 7: Food Type

vegetarians = soup.find_all('li',{'class':'teaser-item__info-item--vegetarian'})for vegetarian in vegetarians:
  if vegetarian:
    print(vegetarian.text)
  else: 
    print("N/A")

The output is as follows -

Vegetarian 
Vegetarian 
Vegetarian 
Vegetarian 
Vegetarian 
Vegetarian

For the final process in attribute extraction, we combine all the attributes extracted into a pandas dataframe.

recipes = soup.find_all("article",{"class":"node-recipe"})rep_no = 0
recipes_list = {}for recipe in recipes:
  title = recipe.find("h3",{"class":"teaser-item__title"}).text
  
  link = recipe.find("a").get('href')
  
  abs_url = "https://www.bbcgoodfood.com" + link
  
  description = recipe.find("div",{"class":"teaser-item__text
    content"}).text  ratings = recipe.find('div', {'class':'star'}).text  duration = recipe.find('li',{'class':'teaser-item__info-
    item--total-time'}).text  vegetarian_tag = recipe.find('li',{'class':'teaser-item__info
    -item--vegetarian'})  vegetarian = vegetarian_tag.text if vegetarian_tag else "N/A"  difficulties = recipe.find('li',{'class':'teaser-item__info-
    item teaser-item__info-item--skill-level'}).text  rep_no += 1  recipes_list[rep_no] = [title,description,abs_url,ratings,vegetarian,duration,difficulties]  recipes_df = pd.DataFrame.from_dict(recipes_list, orient ="index", columns=['Recipe Title','Description','Link','Ratings','Food Type','Duration','Difficulty'])recipes_df.head()

The output is a nice-looking dataframe that consists of the recipes with their respective extracted attributes.

PHASE 2: TRANSFORMATION

There are few data processing techniques that we used for this data which is generally focused on data formatting and handling missing values. This phase is to ensure that the data is reliable to use for further analysis and visualization. Before we start the pre-processing steps, we need to identify these points:

What are the features of the dataset?
What are the expected types (int, float, string, boolean)
Is there any missing data?

We can identify these points by looking at our data. Let’s take a look:

From this, we can answer our first question which is the features of the dataset. These are the features we got:

Recipe Title (Title of the recipe)
Description (Recipe Description)
Link (Absolute URL link to each recipe)
Ratings (Recipe ratings from 0 to 5)
Food Type (whether it is a Vegetarian food or not)
Duration (Recipe duration)
Difficulty (Skill level: Easy, More Effort, A Challenge)

Next, what are the expected data types?

Recipe Title: String
Description: String
Link: String
Ratings: float or double
Food Type: String
Duration: String
Difficulty: String

Besides, there is only one missing data which is:

Food Type

After we observed these points, it is much clearer for us to identify which cleaning process is relevant for this data. So, let’s start cleaning!

Cleaning Task 1: Handling Missing Values

Based on the data that we have scraped, “Food Type” column contains missing values. From this, we need to fill it with meaningful values in order to avoid missing values in the data. In the column, there is only one category named Vegetarian and the others are stated as null values. So, we decided to replace all the null values with “Non-vegetarian”.

Before:

recipes_df['Food Type'].head()

The output -

1    Vegetarian 
2           N/A 
3           N/A 
4           N/A 
5           N/A 
Name: Food Type, dtype: object

Visualization

The missing values category is stated as N/A.

After:

Replace null values with “Non-Vegetarian”.

recipes_df['Food Type'] = recipes_df['Food Type'].replace({'N/A': 'Non Vegetarian'})recipes_df['Food Type'].head()

And, the output -

1        Vegetarian 
2    Non Vegetarian 
3    Non Vegetarian 
4    Non Vegetarian 
5    Non Vegetarian Name: Food Type, dtype: object

Visualization:

Create histogram to see the column insights.

%matplotlib inline
import matplotlib.pyplot as pltrecipes_df['Food Type'].value_counts().plot(kind='bar' , figsize=(10, 6), color='lightgreen')plt.title('Histogram of Recipes Food Type')
plt.xlabel('Food Type')
plt.ylabel('Number of recipes')

Cleaning Task 2: String types to numerical categories

For the “Difficulty” column, we can see that the categorical values are of the datatype string. So, we decided to convert the string values into numerical data. This can help the process of visualization to become much easier.

Before:

recipes_df['Difficulty'].head(15)

The output of this was -

1             Easy 
2             Easy 
3             Easy 
4             Easy 
5             Easy 
6             Easy 
7             Easy 
8             Easy 
9      A challenge 
10            Easy 
11     More effort 
12            Easy 
13     More effort 
14     More effort 
15     A challenge 
Name: Difficulty, dtype: object

Visualization:

After:

recipes_df['Difficulty']
   =recipes_df['Difficulty'].replace(to_replace=['Easy', 'More
   effort', 'A challenge'], value= [1, 2, 3], regex=True)recipes_df['Difficulty'].head(9)

The changed output is as follows -

Visualization:

%matplotlib inline
import matplotlib.pyplot as pltrecipes_df.plot(kind='bar', x='Recipe Title', y='Difficulty',
   figsize=(10, 6), color='lightblue')plt.title('Level Of Difficulties For Each Of The Recipe')
plt.xlabel('Recipe Title')
plt.ylabel('Level of Difficulty')
plt.show()

Cleaning Task 3: Change Letters to Lowercase

This is done to ensure that the naming convention of the data is standardized without having mixed-case values in the dataset. We chose “Recipe Title” and “Description” column to lowercase their values.

Before:

Column — Recipe Title

recipes_df['Recipe Title'].head()

The first five rows of data look like this -

1             Easy chocolate cake  
2         Ultimate chocolate cake  
3           Chocolate fridge cake  
4          Chocolate caramel cake  
5      Double chocolate loaf cake  
Name: Recipe Title, dtype: object

Column — Description

recipes_df['Description'].head()

The output -

1    Master the chocolate cake with an airy, light ... 
2    Indulge yourself with this ultimate chocolate ... 
3    Sophisticated and intensely chocolatey with a ... 
4    This rich cake makes an impressive centrepiece... 
5    Chocolate and cake are two of our favourite th... 
Name: Description, dtype: object

After:

Column — Recipe Title

recipes_df['Recipe Title'] = recipes_df['Recipe Title'].str.lower()
recipes_df['Recipe Title'].head()

The changed output-

1             easy chocolate cake   
2         ultimate chocolate cake  
3           chocolate fridge cake  
4          chocolate caramel cake  
5      double chocolate loaf cake  
Name: Recipe Title, dtype: object

Column — Description

recipes_df['Description'] = recipes_df['Description'].str.lower()
recipes_df['Description'].head()

The changed output-

1    master the chocolate cake with an airy, light ... 
2    indulge yourself with this ultimate chocolate ... 
3    sophisticated and intensely chocolatey with a ... 
4    this rich cake makes an impressive centrepiece... 
5    chocolate and cake are two of our favourite th... 
Name: Description, dtype: object

Cleaning Task 4: String time frame into numeric types

We decided to convert the “Duration” column from string into numeric types. The time will be formatted into minutes’ in order to standardize and ease the process of analysing the duration to complete the recipes.

Before:

Column — Duration

recipes_df['Duration'].head()

The output —

1                 55 mins 
2     2 hours and 10 mins 
3                 20 mins 
4                 50 mins 
5      1 hour and 20 mins 
Name: Duration, dtype: object

Now, we scrape the hour and minute values from the site.

hours = soup.find_all('span',{'class':'hours'})
mins = soup.find_all('span',{'class':'mins'})

After:

Create a new column that contains durations in minutes. We named the column as “Duration (in Mins)”

recipes = soup.find_all("article",{"class":"node-recipe"})\import rerep_no = 0
recipes_list = {}for recipe in recipes:
  #hours and mins
  hours = recipe.find('span',{'class':'hours'})
  temp_hours = re.findall(r'\d+', str(hours))
  h = (", ".join(temp_hours))
  h = int(h) if h else 0
  h = h * 60  mins = recipe.find('span',{'class':'mins'})
  temp_mins = re.findall(r'\d+', str(mins))
  m = (", ".join(temp_mins))
  m = int(m) if m else 0  final_duration = h + m;  rep_no += 1  recipes_list[rep_no] = [final_duration]
recipes_duration_df = pd.DataFrame.from_dict(recipes_list, orient ="index", columns=['Duration (in Mins)'])

Add the new Column into the dataframe.

recipes_df['Duration (in Mins)'] = recipes_duration_df
recipes_df['Duration (in Mins)'].head

The expected output -

<bound method NDFrame.head of 
1      55 
2     130 
3      20 
4      50 
5      80
Name: Duration (in Mins), dtype: int64>

As for the column named “Duration”, we decided to drop it since there is another column named “Duration (in Mins)”, which is redundant. Hence, the updated recipes_df looks like this -

recipes_df.drop("Duration", axis=1, inplace=True)
recipes_df.head()

PHASE 3: LOAD

For the final process of this scraping and data cleaning session, we store (load) the data in a CSV file and download it. The following line of code achieves this -

recipes_df.to_csv("cleaned_recipes_data.csv")

This was it! This wasn’t so bad, was it?

Hope this was informative for you and helped you learn more about doing data analysis with pandas.

The complete codes can be found here.

Special thanks to our professor Dr Sabrina, who guided us throughout the subject and my friends Azzan, Hadi and Faris, who were behind this project alongside me.

Using Python to scrape and perform ETL (Extraction, Transformation, Loading) on BBC’s Chocolate Cake Recipes

Written by ARINDRAJIT PAUL

No responses yet