Scrap data with Pandas in Python
Pandas is a fantastic tool to manipulate data. Did you know that it could also help you to scrap some quick data? Indeed, Pandas is not meant to scrap data off the web, but it can help you to get some simple structured data which could be really useful in some cases.
Our question: How many volumes of Dragon Ball have been sold?
Thanks to Pandas’ read_html function we can answer the above question pretty quickly. Pandas’ read_html function is straightforward, it loads a given html input from a given data source, which may be a file or even an URL. The function looks for table tags within the structured HTML data and returns the found tables as an array of DataFrame objects.
If you are not able to visualise the content until the end, I invite you to take a look here to catch-up!
Wikipedia is full of well structured tables for us to use. We will just look for “Dragon Ball” on Wikipedia to get this page : https://en.wikipedia.org/wiki/Dragon_Ball
If we scroll a bit more we can finally find what we need.
So, let’s get coding. First we need to scrap the data.
import pandas
found_data = pandas.read_html("https://en.wikipedia.org/wiki/Dragon_Ball")
print("Number of found tables: " + str(len(found_data)))
Here we find a total of 12 tables. Looking at them with the help of print, we can see that we want to use is at index 1, but the data are not clean. We now need to clean a little bit our data in order to make them useable and pertinent.
First we will remove “Overseas” and “Worldwide” rows as we don’t want to count them in our total. Next we will make a simple function to transform the text in integer.
import re
# Here is the function we are going to use to clean our number.
def formatNumberFromString(search_string):
output = 0
found_string = re.findall(r'[0-9,\,]+', search_string)
if len(found_string) > 0:
output = int("".join(found_string[0].split(",")))
return output# Getting only the two required columns.
db_sales = found_data[1][["Market(s)", "Volume sales/circulation"]]# Cleaning the data
db_sales.drop([1,18], inplace=True)
db_sales.loc[:,"Volume sales/circulation"] = db_sales["Volume sales/circulation"].apply(lambda x: formatNumberFromString(x))
Now that we got our data cleaned, what is left is to begin our analysis to answer the question above.
total_of_volume_sold = db_sales["Volume sales/circulation"].sum()
average_volume_sold_by_region = db_sales["Volume sales/circulation"].mean()
count_of_regions = len(db_sales.index))print("Dragon ball manga have been sold {} times. That's an average of {} by region for a total of {} regions.".format(total_of_volume_sold, average_volume_sold_by_region, count_of_regions)
As a result we get (updated up to 2022–06–14):
Dragon ball manga have been sold 289667054 times. That's an average of 17039238.470588237 by region for a total of 17 regions.
We can conclude that Pandas may not be the best scrapping tool ever, but it gets the job done for simple tasks to get some useful data like ISO country codes, and it is directly useable in our program.
If you enjoyed the article or found it useful, it would be kind of you to support me by following me here (Jonathan Mondaut). More articles are coming very soon!
Below is the full code for the program:
import pandas
import redef formatNumberFromString(search_string):
output = 0
found_string = re.findall(r'[0-9,\,]+', search_string)
if len(found_string) > 0:
output = int("".join(found_string[0].split(",")))
return outputfound_data = pandas.read_html("https://en.wikipedia.org/wiki/Dragon_Ball")
print("Number of found tables: " + str(len(found_data)))db_sales = found_data[1][["Market(s)", "Volume sales/circulation"]]db_sales.drop([1,18], inplace=True)
db_sales.loc[:,"Volume sales/circulation"] = db_sales["Volume sales/circulation"].apply(lambda x: formatNumberFromString(x))total_of_volume_sold = db_sales["Volume sales/circulation"].sum()
average_volume_sold_by_region = db_sales["Volume sales/circulation"].mean()
count_of_regions = len(db_sales.index)print("Dragon ball manga have been sold {} times. That's an average of {} by region for a total of {} regions.".format(total_of_volume_sold, average_volume_sold_by_region, count_of_regions))