Getting and curating data: the CraftCans experiment

You see, in this world, there’s two kinds of data my friend, the one, easy to obtain since you already have as a CSV file, and the one that has to be scraped from website and curated. In this story, I present the latter with the CraftCans list. We only use Python, BeautifulSoup and Pandas.

This story is mainly inspired from Jean-Nicholas Hould post (http://www.jeannicholashould.com/python-web-scraping-tutorial-for-craft-beers.html).

Usually, when we need data, especially in Data Science, we have two cases: either the data are available (on some specific sites such as https://datahub.io/ or https://www.data.gouv.fr/fr/ just to name a few), or we need to scrape them from website(s). Obviously, the first case is easy and the only thing to do is to curate the data.

Here, we will scrape the website to retrieve the data, then curate them to improve the quality for further uses.

The website is the one proposed by CraftCans, which is a beer-in-can seller (http://www.craftcans.com/).

CraftCans Website

And if we continue the parallel with the Good, the Bad and the Ugly. The database behind this website could be coined of the Good. Going to the database, it is then possible to see all the entries, and just like icy on cake, the data are stored into a table (http://www.craftcans.com/db.php?search=all&sort=beerid&ord=desc&view=text)

An excerpt of the database could be found below:

Excerpt of the CraftCans database

The different steps to scrape the database and get a CSV file are presented below and if you are in a hurry, you could git clone this repository: https://github.com/mphuget/DataScience

Acquiring data

First of all, doing some import

Import urlopen to access the webpage

In [22]:

from urllib.request import urlopen

Use BeautifulSoup to access the content of the webpage

In [1]:

from bs4 import BeautifulSoup

Use Pandas to transform data into DataFrame, it is then easier to modify the content and above all, to save the result in CSV (or in Excel)

In [3]:

import pandas as pd

We first fetch the webpage based on the URL provided on the website

In [5]:

html = urlopen("http://craftcans.com/db.php?search=all&sort=beerid&ord=desc&view=text")

We transform the webpage into something usable with BeautifulSoup

In [6]:

html_soup = BeautifulSoup(html, 'html.parser')

As mentioned before, all the entries are between and . Thanks to BeautifulSoup, we can gather all the tr entries in the webpage

In [7]:

rows = html_soup.findAll("tr")

Let us check how many entries we have

In [8]:

print(len(rows))
2425

But hey! It is written on the website there is 2162 entries (actually there are 2410 entries…). Did I make wrong? Well, no, do not forget the page contains more than the database

Just remember, each entry has 8 cells, so we will use this piece of information to reduce the number of entries

In [10]:

nb = 0
for row in rows:
cells = row.findAll("td")
if (len(cells) == 8):
nb = nb + 1
print(nb)
2411

Mmm, much better, we only 2411 entries whereas it is supposed to have 2410 entries. How can we remove this entry without listing all the entries? All the entries are not necessarily cleaned, sometimes there are NaN so we cannot assume we need to have a number to have an entry. Well, except for the beer id. Maybe, we can try on this one

A beer id is composed of a number followed by a ‘.’, so we could try if there is an entry that is not a number after removing the final letter

In [18]:

nb = 0
for row in rows:
cells = row.findAll("td")
if (len(cells)==8):
id = cells[0].text
id = id[0:len(id)-1]
try:
id = int(id)
nb = nb + 1
except ValueError:
print("Ouch!")
print(nb)
Ouch!
2410

YES! 2410 entries and 1 error, it is what we are expecting. What the code does?

Just counting the number of entries

nb = 0

Iterating on the different rows with tr, successively into row

for row in rows:

Find all the td elements into row and save them into cells

cells = row.findAll(“td”)

Check whether we have 8 cells, meaning this is a beer entry

if (len(cells)==8):

Store the first cell into the variable id

id=cells[0].text

Remove the last character (1. -> 1)

id = id[0:len(id)-1]

Try to convert into an integer

try: id = int(id) nb = nb + 1

If this is not a number, we print out a message

except ValueError: print(“ouch!”)

Then print the number of entries

print(nb)

Now, it is time to store the data into a table named beers

In [19]:

beers = []
for row in rows:
cells = row.findAll("td")
if (len(cells) == 8):
id = cells[0].text
id = id[0:len(id)-1]
try:
id = int(id)
beer_entry = {
"id": id,
"name": cells[1].text,
"brewery_name": cells[2].text,
"brewery_location": cells[3].text,
"style": cells[4].text,
"size": cells[5].text,
"abv": cells[6].text,
"ibu": cells[7].text
}
beers.append(beer_entry)
except ValueError:
print("Ouch!")
Ouch!

We check if the table has 2410 entries

In [20]:

print(len(beers))
2410

Cool, we have the current number of entries

Curating data

Next step is to curate data to remove errors, typos, missing values. And who could help us with that? Yes, pandas

In [23]:

df = pd.DataFrame(beers)

We print out the first lines to check

In [24]:

df.head(5)
abv brewery_location        brewery_name ibu    id           name    size  \
0 4.5% Minneapolis, MN NorthGate Brewing 50 2692 Get Together 16 oz.
1 4.9% Minneapolis, MN NorthGate Brewing 26 2691 Maggie's Leap 16 oz.
2 4.8% Minneapolis, MN NorthGate Brewing 19 2690 Wall's End 16 oz.
3 6.0% Minneapolis, MN NorthGate Brewing 38 2689 Pumpion 16 oz.
4 6.0% Minneapolis, MN NorthGate Brewing 25 2688 Stronghold 16 oz.
                style  
0 American IPA
1 Milk / Sweet Stout
2 English Brown Ale
3 Pumpkin Ale
4 American Porter

Everything seems to work. Great!

First of all, we need to modify the size content. There is a value and a measure unit. A number should be enough if there are all provided with ounces. First of all, we check that:

In [25]:

units = df["size"]
units = list(set(units))
print(units)
['12 OZ.', '32 oz.', '12 oz', '19.2 oz.', '24 oz. "Silo Can"', '16 oz. Alumi-Tek®', '16 oz.', '12 oz.', '12 ounce', '8.4 oz.', '16 oz. Alumi-Tek®\t', '12 oz. Slimline', '12 & 16 oz.', '16.9 oz.', '12 oz. ', '16 oz', '24 oz.', '19.2']

Oh my God! There were worst than that since there are several different versions, from the absence of measure unit (19.2) to different oz (OZ., oz., oz, ounce)

In [27]:

unique_unit = []
for unit in units:
try:
value = float(unit)
unique_unit.append(value)
except ValueError:
pos = unit.index(" ")
unique_unit.append(unit[0:pos])
print(list(set(unique_unit)))
['32', '24', '12', 19.2, '8.4', '16.9', '19.2', '16']

So, OK I am able to retrieve the numerical part of the size, so now, we create a function and use apply on DataFrame

In [31]:

def without_ounces(entry):
try:
return float(entry)
except ValueError:
pos = entry.index(" ")
return float(entry[0:pos])

In [32]:

df["volume"] = df["size"].apply(without_ounces)

Let us check the first lines

In [33]:

df.head(5)
abv brewery_location        brewery_name ibu    id           name    size  \
0 4.5% Minneapolis, MN NorthGate Brewing 50 2692 Get Together 16 oz.
1 4.9% Minneapolis, MN NorthGate Brewing 26 2691 Maggie's Leap 16 oz.
2 4.8% Minneapolis, MN NorthGate Brewing 19 2690 Wall's End 16 oz.
3 6.0% Minneapolis, MN NorthGate Brewing 38 2689 Pumpion 16 oz.
4 6.0% Minneapolis, MN NorthGate Brewing 25 2688 Stronghold 16 oz.
                style  volume  
0 American IPA 16.0
1 Milk / Sweet Stout 16.0
2 English Brown Ale 16.0
3 Pumpkin Ale 16.0
4 American Porter 16.0

We remove the notion of ounces and this is now a float number, then easier to consider for analysis

Time to remove the duplicating column in the DataFrame

In [34]:

del df["size"]

In [35]:

df.head(5)
abv brewery_location        brewery_name ibu    id           name  \
0 4.5% Minneapolis, MN NorthGate Brewing 50 2692 Get Together
1 4.9% Minneapolis, MN NorthGate Brewing 26 2691 Maggie's Leap
2 4.8% Minneapolis, MN NorthGate Brewing 19 2690 Wall's End
3 6.0% Minneapolis, MN NorthGate Brewing 38 2689 Pumpion
4 6.0% Minneapolis, MN NorthGate Brewing 25 2688 Stronghold
                style  volume  
0 American IPA 16.0
1 Milk / Sweet Stout 16.0
2 English Brown Ale 16.0
3 Pumpkin Ale 16.0
4 American Porter 16.0

Another source of problems is the % on abv column

First, we check if % is used on every line

In [36]:

nb = 0
abvs = df["abv"]
for abv in abvs:
if ('%' in abv):
nb = nb + 1
print(nb)
2348

2348? Well, that means some have no ‘%’. One more reason to curate data

We proceed like it was for size column

In [38]:

def without_percent(entry):
entry = entry.strip("%")
try:
return float(entry)
except ValueError:
return None

In [40]:

df["abv"] = df["abv"].apply(without_percent)

In [41]:

df.head(5)
abv brewery_location        brewery_name ibu    id           name  \
0 4.5 Minneapolis, MN NorthGate Brewing 50 2692 Get Together
1 4.9 Minneapolis, MN NorthGate Brewing 26 2691 Maggie's Leap
2 4.8 Minneapolis, MN NorthGate Brewing 19 2690 Wall's End
3 6.0 Minneapolis, MN NorthGate Brewing 38 2689 Pumpion
4 6.0 Minneapolis, MN NorthGate Brewing 25 2688 Stronghold
                style  volume  
0 American IPA 16.0
1 Milk / Sweet Stout 16.0
2 English Brown Ale 16.0
3 Pumpkin Ale 16.0
4 American Porter 16.0

The last column that needs to be treated as number is IBU or International Bitterness Unit

In [42]:

def to_int(entry):
try:
return int(entry)
except ValueError:
return None

In [43]:

df["ibu"] = df["ibu"].apply(to_int)

In [44]:

df.head(5)
abv brewery_location        brewery_name   ibu    id           name  \
0 4.5 Minneapolis, MN NorthGate Brewing 50.0 2692 Get Together
1 4.9 Minneapolis, MN NorthGate Brewing 26.0 2691 Maggie's Leap
2 4.8 Minneapolis, MN NorthGate Brewing 19.0 2690 Wall's End
3 6.0 Minneapolis, MN NorthGate Brewing 38.0 2689 Pumpion
4 6.0 Minneapolis, MN NorthGate Brewing 25.0 2688 Stronghold
                style  volume  
0 American IPA 16.0
1 Milk / Sweet Stout 16.0
2 English Brown Ale 16.0
3 Pumpkin Ale 16.0
4 American Porter 16.0

We did well, so maybe it is time to save this DataFrame so as to analyse it later

In [45]:

df.to_csv("beers.csv", sept='\t')

Voila! We have now time for some analysis

Analysing data

Since this dataset has not a lot to reveal, we just provide some ideas on what to do

First of all, we load the dataset and store it into a DataFrame

In [50]:

import pandas as pd
craftcans = pd.read_csv("beers.csv", sep=',', encoding="latin1")

We check the first lines

In [51]:

craftcans.head(5)
Unnamed: 0  abv brewery_location        brewery_name   ibu    id  \
0 0 4.5 Minneapolis, MN NorthGate Brewing 50.0 2692
1 1 4.9 Minneapolis, MN NorthGate Brewing 26.0 2691
2 2 4.8 Minneapolis, MN NorthGate Brewing 19.0 2690
3 3 6.0 Minneapolis, MN NorthGate Brewing 38.0 2689
4 4 6.0 Minneapolis, MN NorthGate Brewing 25.0 2688
            name               style  volume  
0 Get Together American IPA 16.0
1 Maggie's Leap Milk / Sweet Stout 16.0
2 Wall's End English Brown Ale 16.0
3 Pumpion Pumpkin Ale 16.0
4 Stronghold American Porter 16.0

Oops, we have one column without name, we need to fix that

In [53]:

craftcans.columns = ["id", "abv", "location", "name", "ibu", "beer_id", "name", "style", "volume"]

In [54]:

craftcans.head(5)
id  abv         location                name   ibu  beer_id           name  \
0 0 4.5 Minneapolis, MN NorthGate Brewing 50.0 2692 Get Together
1 1 4.9 Minneapolis, MN NorthGate Brewing 26.0 2691 Maggie's Leap
2 2 4.8 Minneapolis, MN NorthGate Brewing 19.0 2690 Wall's End
3 3 6.0 Minneapolis, MN NorthGate Brewing 38.0 2689 Pumpion
4 4 6.0 Minneapolis, MN NorthGate Brewing 25.0 2688 Stronghold
                style  volume  
0 American IPA 16.0
1 Milk / Sweet Stout 16.0
2 English Brown Ale 16.0
3 Pumpkin Ale 16.0
4 American Porter 16.0

Let’s play with the IBU and the different values

In [60]:

ibus = craftcans["ibu"]
print(min(ibus))
print(max(ibus))
4.0
138.0

Based on Wikipedia’s article on IBU, 4 is quite low and 138 is quite high

I am sure, many, many other analysis could be done on this dataset but the objective was essentially to describe how to scrape the website to fetch data.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.