Web Scraping With Python: Scrapy, SQL, Matplotlib To Gain Web Data Insights
I’m here today to give you an example how you can too use web scraping to gain insights from online available data. I’m gonna be using Scrapy the web scraping framework to scrape football stats from the web. Then I will process the scraped data. Finally, making it really useful for everyone. I will create some basic, simple analysis with matplotlib and pandas. Let’s jump right into it!

Web Scraping With Python: Scrapy, SQL, Matplotlib To Gain Web Data Insights
Now I’m going to show you a comprehensive example how you can make raw web data useful and interesting using Scrapy, SQL and Matplotlib. It’s really supposed to be just an example because there are so many types of data out there and there are so many ways to analyze them and it really comes down to what is the best for you and your business.
Scraping And Analyzing Soccer Data
Briefly, this is the process I’m going to be using now to create this example project:
- Task Zero: Requirements Of Reports
Figuring out what is really needed to be done. What are our (business) goals and what reports should we create? What would a proper analysis look like?
- Task One: Data Fields And Source Of Data
Planning ahead what data fields and attributes we’ll need to satisfy the requirements. Also, looking for websites where I can get data from.
- Task Two: Scrapy Spiders
Creating scrapers for the website(s) that we’ve chosen in the previous task.
- Task Three: Process Data
Cleaning, standardizing, normalizing, structuring and storing data into a database.
- Task Four: Analyze Data
Creating reports that help you make decisions or help you understand data more.
- Task Five: Conclusions
Draw conclusions based on analysis. Understand data.
Storytime is over. Start working!
Task Zero: Requirements Of Reports
As I said we’re gonna analyze soccer data. I think though “soccer data” is too big of a scope for us right now so let’s say we wanna analyze only goals and match results. We consider matches that has been played in the latest soccer season (2016–17). Regarding soccer leagues, it really depends on how good of a data source we can find. If we could find a website with roughly all the European soccer leagues not just the major ones(Italy, England, Germany, Spain, etc..) that would be the best.
What kind of reports to produce?
This is a key question and must be answered before moving on. We have to know what kind of reports we’re gonna create to know exactly what to scrape. So let’s figure it out!
We’re focusing on goals and results. It would be interesting to see some basic overall reports like:
- Average amount of goals
- Average amount of 1st half/2nd half goals
- Amount of home wins/away wins/draws
- The biggest goal difference between two teams
- The biggest comeback in the 2nd half
- Distribution of average goals along the season
So these are just some adhoc ideas. We could think about other thousands of ways to analyze football data. We’re gonna scrape data only to be able to produce these reports. Again, without knowing what exactly you want to have as an end result it’s just a waste of time to write the scraper. This task is done. Let’s move on to Task One.
Task One: Source Of Data And Data Fields
We want to find a website which has all the data fields we need. What are our data fields? We can figure them out having a close look at the report requirements we just wrote. Data fields are data points and they will be scraped by our future scraper. Putting the data fields together we’ll get a Scrapy item or a record in the database.
Going through the report requirements we will need at least these fields:
- Home goals 1st half
- Home goals 2nd half
- Away goals 1st half
- Away goals 2nd half
- Match date
- League
- Country
Scraping only these fields will be enough to generate some interesting reports about soccer matches in the latest season.
Looking for a data source
We are getting closer and closer to be able to start writing our scraper. The next step is to find the source of our raw data aka a website we can scrape. So let’s do a research!
I simply start off googling soccer stats.

There are a bunch of website providing soccer stats. We should keep in mind that we’re NOT looking for the fanciest and most modern website. Usually google shows the best sites on its first page for the given keywords but now we are searching for not the best one but the one which just has the data we need and is simple. Simple means now, that the HTML of the site is nicely structured and relatively clean. Also, if we visit the page with JavaScript disabled it’s still displaying the data fields we need. Because we don’t wanna do unnecessary JS handling in our scraper. Our goal now is to find a website which has got all the data preferably on one page or few pages. We are focusing on getting our data as simply and fast as possible.
After about 20–25 minutes of research in google, I found a website that meets the mentioned criterias fully: http://www.stat-football.com/en/
Next, have a look at robots.txt of the website. This file contains information about how our scraper or bot should behave on the site. It usually defines rules like disallowing some bots to visit specific pages or defines the minimum delay between two requests. To be ethical we have to follow the rules defined here. So check our chosen website’s robots.txt (it should be located at the root directory of the server):
User-agent: *
Disallow: /pismo.php
Disallow: /poisk.php
Disallow: /rassylki.php
Disallow: /en/pismo.php
Disallow: /en/poisk.php
Disallow: /pda/pismo.php
Disallow: /pda/rassylki.php
Disallow: /___x/
Disallow: /awstats/
Disallow: /webalizer/It mentions a few pages that should be avoided. I’m okay with that we don’t wanna mess with those pages anyway.
This kind of website is the perfect website to scrape. Consistent html, relatively fast to scrape through the pages. No Terms Of Use that explicitly prohibit web scraping on the site. Robots.txt is fine. No javascript/AJAX in the background. Nothing. That’s what I like to see!
As a side note, in a real world project you rarely get to choose which website to scrape because you will have probably no choice. Now we got lucky to find a website which is easy to scrape.
So until now, we made a clear picture what we really want to get as an outcome. Then we found a great site to scrape though it took a little time to find it but it was worth effort. We can move forward and design our Scrapy spider. Finally!
Task Two: Create Scrapy Spiders
This is what we’re gonna be doing in this task:
- Setting up Scrapy
- Plan the path of our scraper
- Inspecting the website
- Playing around with selectors in Scrapy Shell
- Writing the spider
- Writing cleaning pipelines
- Exporting, saving to SQL database
- Deploying to Scrapy cloud
Setting Up Scrapy
We need to install Scrapy and we’re gonna set it up inside a virtual environment. Making sure that our project is somewhat isolated. Let’s create a project directory and setup a virtual environment:
mkdir soccer
cd soccer
virtualenv env
source env/bin/activateBefore installing Scrapy we have to install its dependencies:
sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-devsudo pip install scrapyFinally, install the latest scrapy:
sudo pip install scrapyIf you have other operating system or any problems installing Scrapy, see the installation docs here.
Now we have Scrapy installed in our environment. Each Scrapy project has the same file structure. We can create a Scrapy project with scrapy startproject command.
scrapy startproject soccerThis way, we created an empty scrapy project, the file structure looks like this:

Inspecting the website
This is the part when we plan the path of our scraper. We figure out the path our scraper should follow and recognize the easiest way to get our data fields. As we open the web page make sure JS is disabled in the browser so we can interact with the page the same way scrapy does. This is what we can see right on the home page:

There are a bunch of leagues in the sidebar we definitely don’t wanna copy-paste the URLs of each. Let’s play around on the site a little bit and see how we can get to the page where our data is laid out.
We could open a league page to see what data is there:

On the league page there’s not much data we are interested in. But at the end of the page you can see there are several other pages containing data about the certain league. Let’s try them out and see which page should be the best to fetch data from.

The results page would be great for us but it doesn’t contain data about 1st half/2nd half goals data so it doesn’t fit us now. The fixtures page shows nothing probably because I have JS disabled and it uses JS to load data. Nevermind the next page fixtures-2 is perfect for us. It contains all the matches that have been played in the latest season also detailed information about goals on each match. All these on one page so we don’t even have to request way too many pages to scrape everything we need.
Until now we figured out which page contains the data we can access the easiest way. The next step is to inspect the html of the pages to write the proper selectors. We’re gonna need multiple selectors:
- League URLs on the sidebar
- Fixtures-2 URL at the bottom of the page
- Data fields in the fixtures table
The first two are for link following purposes. The last one is to fetch actual data.
Playing around with selectors in Scrapy Shell
Scrapy shell is an awesome command line tool to test your selectors quickly without running your whole spider. We have to specify a URL and scrapy shell gives us some objects we can work with like response, request etc. So first of all, figure out the selector for league page URLs on the side bar.

Having a look at the inspector of our browser, we could try “ul#menuA > li” as a selector and see if it selects the proper elements. In order to test it, we have to launch scrapy shell and use the response object:
scrapy shell http://www.stat-football.com/en/
...
In [1]: response.css("ul#menuA > li > a").extract()
Out[1]:
[u'<a href="/en/t/arg.php" title="ARGENTINA \u2014 current season, clubs"><b>Argentina</b>\xa0 \xa0</a>',
u'<a href="/en/t/arg10.php" title="Argentina D1 \u2014 current statistics">\xa0 \xa0 \xb7 Argentina D1</a>',
u'<a href="/en/t/aus.php" title="AUSTRIA \u2014 current season, archive, clubs"><b>Austria</b>\xa0 \xa0</a>',
u'<a href="/en/t/aus10.php" title="Austria D1 \u2014 current statistics">\xa0 \xa0 \xb7 Austria D1</a>',
u'<a href="/en/t/aus20.php" title="Austria D2 \u2014 current statistics">\xa0 \xa0 \xb7 Austria D2</a>',
u'<a href="/en/t/blr.php" title="BELARUS \u2014 current season, archive, clubs"><b>Belarus</b>\xa0 \xa0</a>',
[...]It selects all the URLs on the sidebar which is not good. We don’t need country URLs we only need URLs for actual soccer leagues. We have to slightly modify the previous selector to exclude the elements we don’t need:
In [2]: response.css("ul#menuA > li:not(.f11.t0.tar) > a").extract()
Out[3]:
[u'<a href="/en/t/arg10.php" title="Argentina D1 \u2014 current statistics">\xa0 \xa0 \xb7 Argentina D1</a>',
u'<a href="/en/t/aus10.php" title="Austria D1 \u2014 current statistics">\xa0 \xa0 \xb7 Austria D1</a>',
u'<a href="/en/t/aus20.php" title="Austria D2 \u2014 current statistics">\xa0 \xa0 \xb7 Austria D2</a>',
u'<a href="/en/t/blr10.php" title="Belarus D1 \u2014 current statistics">\xa0 \xa0 \xb7 Belarus D1</a>',
u'<a href="/en/t/blr20.php" title="Belarus D2 \u2014 current statistics">\xa0 \xa0 \xb7 Belarus D2</a>',
u'<a href="/en/t/bel10.php" title="Belgium D1 \u2014 current statistics">\xa0 \xa0 \xb7 Belgium D1</a>',
u'<a href="/en/t/bra10.php" title="Brazil D1 \u2014 current statistics">\xa0 \xa0 \xb7 Brazil D1</a>',
[...]I use the :not css selector to exclude the elements on the sidebar that have “.f11.t0.tar” class which means those are country URLs. So this is the full selector to select league URLs: “ul#menuA > li:not(.f11.t0.tar) > a”. I save it to a text file because we will need it in our spider.
Next, we have to figure out a selector for “fixtures-2” page.

This exactly the element we need let’s see the html.

To be honest I have never used or heard before about nobr html tag but that doesn’t matter we can write a selector anyway. As I see, this element we want to select is persistently the 4th nobr element. So let’s try this selector in scrapy shell:
scrapy shell http://www.stat-football.com/en/t/eng10.php
[...]
In [1]: response.css("td.im34.p0612 > nobr:nth-of-type(4)").extract_first()
Out[1]: u'<nobr><a href="/en/t/eng10.php?c=29" title="English Premier League \u2014 fixtures results">fixtures-2</a>\xa0 \xb7\xa0</nobr>'First, it selects the td tag that has got “im34.p0612” class then the 4th nobr tag inside that td. Also, I use extract_first() and not extract() because I want to return only one element not a list of elements like previously. As a side note, when you write selectors based on class and an element has multiple classes then in the selector you have to separate each of them with a dot(.). Like I just did.
Well, we got all the selectors to follow links to our end destination. Now figure out how to select actual data fields we want to scrape. This is an example page our scraper will visit:

First of all, we are going to write the selectors for the easily selectable fields: league, country. We can select the league field using the title right at the top.

Test in scrapy shell:
scrapy shell http://www.stat-football.com/en/t/eng10.php?c=29
[...]
In [1]: response.css("h1.f6.nw::text").extract_first()
Out[1]: u'2016\u20132017 English Premier League (D1)'It selects the correct title tag. That weird \u2013 is just some encoding stuff the selector works properly.
Now write a selector for the country field. I inspected it a little bit and probably the easiest way to fetch that is at the bottom of the page.
It’s laid out in a separated tag, selector:
In [2]: response.css("td.im34.p0612.t0.vat > a > strong::text").extract_first()
Out[2]: u'England'Okay, let’s move on to the fields that left: home_team, away_team, home_goals1(first half), home_goals2(second half), away_goals1, away_goals2
At first sight I hoped all the matches are in separate divs or somewhat separated because in that case it should be easy to scrapy each field. I hit up the inspector:

All the stuff is inside one pre tag. Which is not cool. Another problem is that the position of date field is not consistent. Sometimes it’s inside a span sometimes it’s not. Here we need some trickery. Simple selectors like we just wrote is not going to be enough now. A possible solution would be to iterate over each element one-by-one and recognize the item we want to scrape. Let’s hit up scrapy shell once more. Keep in mind, that we want to select all the elements. In order to do that one selector is not enough. We have to write multiple selectors and put them together to select every element of the table, something like this:
scrapy shell http://www.stat-football.com/en/t/eng30.php?c=29
[...]
In [2]: response.css("td.s14 > pre::text, td.s14 > pre > a::text, td.s14 > pre > span::text").extract()
Out[2]:
[u' 1 06.08.2016 ',
u'Bolton \u2014 Sheffield U ',
u' 1:0 (1:0)\n',
u' 1 06.08.2016 ',
u'Bradford \u2014 Port Vale ',
u' 0:0 (0:0)\n',
u' 1 06.08.2016 ',
u'Bury \u2014 Charlton ',
u' 2:0 (0:0)\n',
u' 1 06.08.2016 ',
u'Millwall \u2014 Oldham ',
[...]This way we have all the elements of the table in a list. Now we just need to figure out a way to recognize when we should populate a new item. Let’s have a look at the table again:

There’s one thing which is sure. Each match has one date. After the date we’ve got the teams and the result in the html structure. So we should iterate over the list of elements and recognize if the element is a date. If it is then it’s time to start populating a new item. This way we will need further data processing because we cannot scrape each field separately. So this is what I’m trying to do:
elements = response.css("td.s14 > pre::text, td.s14 > pre > a::text, td.s14 > pre > span::text").extract()
for i in range(0, len(elements)-2):
if ".201" in elements[i]:
match = MatchItem()
match["date"] = elements[i]
match["home_team"] = elements[i+1]
match["away_team"] = elements[i+1]
match["home_goals1"] = elements[i+2]
match["home_goals2"] = elements[i+2]
match["away_goals1"] = elements[i+2]
match["away_goals2"] = elements[i+2]
yield matchThe elements variable contains all the text shown in the html table in a list. The loop goes through the elements(date, teams, result) and I start to assign the fields to a match item when I recognize the field is a date by searching “.201” (like 07.12(.201)6 for example)in it. I could use regex to recognize date format but I guess there’s no soccer team in the world containing “.201” in its name. When I find the date I know that the next field contains the teams and the next one is the result.
We can test it in scrapy shell:
scrapy shell http://www.stat-football.com/en/t/eng10.php?c=29
[...]
In [4]: for i in range(0, len(elements)-2):
...: if ".201" in elements[i]:
...: match = {}
...: match["date"] = elements[i]
...: match["home_team"] = elements[i+1]
...: match["away_team"] = elements[i+1]
...: match["home_goals1"] = elements[i+2]
...: match["home_goals2"] = elements[i+2]
...: match["away_goals1"] = elements[i+2]
...: match["away_goals2"] = elements[i+2]
...: print matchWriting the spider
We’ve played around with the selectors, figured out what path our scraper will follow. How we will actually fetch data? In this section we’re gonna put all these things together and create a working spider…
