How I created a League of Legends High-Elo database using scrapy

Marcus Ferreira Teixeira
7 min readNov 11, 2018

--

During one particular boring machine learning class, I’ve come up with the idea of predicting League games based only on the team composition. For that I would need a fairly big database for training, so I started looking into Riot games API, but it seemed that I would have to do a lot of work to get what I wanted, not mentioning I would have my number of requests limited. So I realized:

If it is on the internet, I can simply crawl it

Given that I already had a considerable experience working with scrapy, creating a spider was nothing more than my natural response to this.

So that will be a beginner friendly guide on the thoughts I had during this (simple) implementation. Let’s begin with one question:

Which site should I crawl?

There are really a lot of LoL statistic focused websites, with full coverage of the available servers, such as lolskill, lolprofile and opgg. From those, the one which stood out as being more appropriate was opgg, due to it’s URL structure (which I will get in more details later) and by having GG on it’s name, ohh I miss the Good Ol’ times…

Ok so first of all, let’s install scrapy, simply go:

$ pip install scrapy

pip will do everything for you, and you’ll be ready to crank that code.

Now lets start the project:

$ scrapy startproject lolgames

Scrapy will automatically generate the files needed for the project in the following structure:

  • On /../spiders we have the spider files, that’s where we are going to code the crawling steps and parse the html data.
  • items.py is where the items are defined, in that case, the games, they are usefull so that we can scrawl any website and get our games with the same structure.
  • middlewares.py is interesting, here you can define how the http response will be processed and sent to the spider. I won’t be covering changes to those middlewares in this article, I’ll probably post something exclusively on that later on.
  • pipelines.py another really useful file that won’t be covered here. You can define what scrapy is going to do with your item after scraping it, that includes validation, noise reduction, duplicate checking and even send it to a database like MongoDB.
  • settings.py is where you define your settings (really unexpected right?). You can define your pipeline priority, the middlewares that are going to be used, and even custom settings.
  • scrapy.cfg has the project variables, the most boring file of the project.

You should check the documentation, it’s really well writen, intuitive and complete:

Now let’s go to the fun part:

The item

First of all, we need items!!!
They are declared using a simple class definition syntax and Field objects, we are going to need team compositions, the mmr of each game, server, the timestamp to remove duplicates, and, of course, the result. That will be:

The spider

Really simple, now we can start dem spiders, let’s do:

$ cd lolgames 
$ scrapy genspider opgg op.gg

That will create the basic structure for a spider, which is:

We have our start_urls that consists in a list of URLs from which the spider start crawling, and parse which is, by default, the first function to be called by the framework.

The ladders

Since the goal is to create a high elo database, we can simply parse the top players profiles and get their games, for that we would change our start urls to:

http://www.op.gg/ranking/ladder/

But if we analyze opgg’s URL, we can see that the first part is always the server, as in:

br.op.gg
jp.op.gg
euw.op.gg
na.op.gg
www.op.gg

and so on. With some python magic we would have:

Scrapy is going to create a request for each of those and then parse them using some nice async stuff in the background.

The profiles

Now we should start to think about our parse method.

Just by looking at the website we can guess it is structured like a list, so we can loop through the names and get to their profiles, for that, we have to create a xpath expression that returns us the profile links, so we go to them and do the parsing.

By inspecting the first player’s HTML element with Chrome’s dev tools (just press F12), you should think that you only have to look for a elements with the ranking-highest__name class, but that would only work with the top 5 players on that page, so by taking some time analyzing the page’s HTML, or just inspecting the player icon you should see that you can match any a element and still get the link to the profile.

Knowing that, we can just look for links with userName on it, then we have the expression:

//a//@href[contains(.,"userName")]

But, just to remove the duplicates from the top 5 players, we can exclude the ranking-highest__name class with the following expression:

//a[not(contains(@class, "ranking-highest__name"))]//@href[contains(.,"userName")]

Let’s create a file called constants.py and put it in there, always remember the number 1 rule from the Zen of Python:

1 — Beautiful is better than ugly.

So we’ll have:

Don’t forget to import the constants and the item.

The simplest way to visit the profiles would be to yield a request for each one of them, and only than we would have to worry about finally parsing the games. Note that the callback function parse_games is not yet declared, but we are going into that later on.

Pretty and simple.

The MMR comes as the easiest one, you can ask your little cousin to point it to you on the screen, it is right below the players name:

The games

The page is structured well enough so that it is possible to iterate through the matches and easily extract the timestamp, game type, so we can filter Ranked games, and the result of the match, according to the players perspective.

There is a picture for each of the champions, from which we can extract their name, and we also have the profile of each player of that game, which we can parse to get more high-elo games

They are also really well structured, so you’ll only need to iterate through the summoners to get all the champion data.

Let’s put all the xpath expressions in the constants.py file:

Keeping it simple, we need to iterate through the champions:

And through the summoners:

Note that I choose to addrelated field to the response’s meta dictonary, so that the program doesn’t iterate forever through the summoners of the server. Note that there could be a numerical variable that would passed when you call the spider, so that you could control de depth of the parser. But I’ll leave that one as a “homework” for you.

So now, iterating through the matches to extract the data we need only going into the related games once would require a code similar to:

And that’s it. The spider is finished!

Now we have to choose a output for the spider, since scrapy already has a way to output a csv file natively, we are going to use it with the -o flag.

$ scrapy crawl opgg -o games.csv

And there you have it!

It takes only about five minutes to parse 8000 games.

Your homework will be removing the duplicates (you can just use the timestamp lol).

Final words

Scrapy makes it really simple to asynchronously scrape data from websites, it is really simple to use and definetly worth the little amout of time you have to put to learn it. I’m doing another tutorials on things I used in this project, such as MongoDB pipelines, Celery tasks, opencv and Docker. Stay tuned!

The full project can be found in the repository
https://github.com/MarcusDEFGH/loldraft

--

--