Building a League data scrapper

Jonathan M
10 min readAug 19, 2017

--

We’ve seen in the previous post that Riot Games third party APIs are a treasure trove of data, with a lot of untapped potential. So, we have a need, and the data to resolve that need exist. How do we get our hands on it? In this section, I’ll describe the monstrosity I’ve created to get and store this sweet match data.

General infrastructure

Our overall service composition

The league data extractor is not a single service but a composition of multiple services, each running inside containers and orchestrated through a rather crude docker-compose file. In this post, I’ll describe what each service does and how they work together. But first, what is Docker and why did I choose to use it?

Docker

To explain what Docker is, we first need to explain what a container is. A container is basically all the libraries and tools needed to run a piece of software. Containers are not VMs however as they don’t bundle in a full operating system, only the libraries and settings required. This makes them much more lightweight and allows you to run multiple containers inside the same machine. A major advantage of containers is that it abstracts away all the possible issues related to setting up and handling a piece of software in different environments, making the development environment and production environment much more similar. This allows for greater productivity and maintainability. Containers are also particularly suited for micro services. With that in mind, Docker is simply a very good container manager and platform.

Now, onto our services!

Maokai, the sturdy DB that absorbs everything you send to it, and its little saplings.

Maokai

Maokai is just a Postgres DB, with 8 tables (summoner, match, participant, bans, teams, frame, participant_frame, and event). The most boring DB, really. It does not even store all of the match data, and we’re missing out on runes, masteries, and other global stats associated with participants in a match object, but that’s in the work for after the rune rework.

LDA database schema (made with dbdesigner.net)

There’s a lot that can be improved in this DB: Quite a few values are enums that could be optimized, most types could be improved, the event table could/should be split into different tables depending on the event type, etc. The mess that is the event table spawned so many partial indexes that it’s almost comical.

However the most interesting part of this schema is not how shoddily designed it is, but rather the lack of relationships between the summoner table and the other tables; it stands alone, disconnected from the rest. The reason for it is simple: privacy. Indeed, the statistics I planned on doing were relatively intrusive (e.g. jungle pathing) and I did not feel comfortable with knowing whose account was associated with the data. Note that if you plan on doing something similar but want the relationship, it is as simple as adding an account_id column in the participant table.

Finally, Maokai can be initialized with partial data from CSV file called seeds.csv that populates the summoner table with summonerIds and associated platformIds (aka EUW1, NA1, LA1, LA2, KR, etc.). The rationale behind this was to simplify the logic of the service in charge of getting data from the API, by standardizing the initialization of the service. By having summoners already present in the database, the data fetching service can behave as if it is resuming an import instead of having a special case for an empty table. It also allows the user to arbitrarily define from which summoners the service should get its initial match data. I chose to use a snapshot of all challenger players in all regions, mostly because it is the only reliable list of users that one can get without prior knowledge of the identity of said users.

Tech tack

  • Docker
  • Postgres 9.6
  • CSV?
Ezreal, the Edge Service Rate Limiter

Ezreal

The APIs that Riot Games exposes are rate-limited, meaning that you can only send so many requests within so much time (e.g. 100 API calls per 2 minutes). Going above that limit will cause the endpoints to return 429 error messages, and too many of these will get your project banned from using the APIs. So we need a way to respect these limits before we hit them. This is the role of Ezreal, which is this project’s counterpart to ESRL (Edge Service Rate Limiter) that Riot uses internally to do its rate limiting.

The rate limiting of the Riot API is actually not so simple: you have both an app rate limit, which is a limit applied to every single API calls, and a method rate limit, whose value differs depending on which resource we want to get. For instance, the rate limit for match data is 500 requests per 10 seconds, whereas the rate limit for summoner data is 20,000 requests per 10 seconds. Also, these limits are only applied per endpoint, which means that you could hit 40k summoners in a 10s window, if 20k are in EUW1, and the other 20k are in NA1.

The algorithm used by Riot for ESRL is described as similar to a Leaky Bucket Algorithm. However, its intricacies are not known. Not knowing the exact behavior, I settled for a set of hierarchical Leaky Buckets, which should resemble the original algorithm. I also reduced the maximum of each limit by 5–10% to account for the potential lag/jitter between the moment where a token is requested from Ezreal and the moment it is used to do a request to the APIs.

Since we have a small set of users for this limiter (the crawlers), which would do frequent queries to the limiter, I chose to implement Ezreal as a WebSocket server with long lived sockets on which clients could request tokens, and where the server would only reply on the socket once the token was available for the request.

Tech Stack

Elise, our friendly API crawler

Elise

So we have our database service, ready to weather everything (well, almost) that we throw at it, and our rate-limiting service. Now we need a way to import the data from Riot into our DB. And this where Elise comes into play.

In this project, all we really care about is match data. We don’t care about summoners, or any of their related objects (e.g. rune pages), nor do we care about tournaments, etc. So if we could import only match objects, it would be perfect. However, the only way to import a match is by knowing its matchId beforehand.

How do we get matchIds?

A first solution is to simply know of one matchId, import the data from that match and increment the matchId by 1 to try and see if that is also a matchId that yields a match. A major advantage for this solution is that it is theoretically extremely efficient in terms of API calls (we are only using our API tokens for match data). Additionally, the match we get should follow the same distribution as that of the player base; so, if there’s 10% plat+ players, they should be present in about 10% matches. This would be the holy grail for data science, as there would be no selection bias. However, this method is unreliable at best. There is no guarantee that the incremented matchId will actually yield a match, and a short practice run put the error rate at around 50%. As one might imagine, this rate of error would be frowned at by Riot, and could result in your API key being revoked, so don’t try this.

The second solution is to know the accountId for a player, fetch the matchIds from the recent match history of this player, and then import all the corresponding matchIds. However, now the question becomes how do we get accountIds? Well, from the matches themselves. Whenever we import a match, we can also save 10 accountIds, some of which we may already have used, but the vast majority should be newly discovered summoners. So here we have our simple import strategy:

  1. import player data if needed using the summoner API
  2. import matchIds from the recent match history using the match list API
  3. for each new matchId, import the corresponding match
  4. for each match, save the account data of each participant
  5. start over with one of the new accounts.
Elise and her spiderlings. (images taken from universe.leagueoflegends.com)

Regions and Spiderlings

We now have defined our import strategy, and we can get to work. In theory, this is enough to build our crawler. In practice, there’s an additional factor to take into account: Riot’s API data is split across regions. For instance, (nearly) all the data associated with the EU West server is stored in the EUW1 endpoint. You can’t get KR data from the EUW1 endpoint, and vice-versa. So if our initial data is from say the KR challenger queue, we will (almost) only ever get KR data. Additionally, we may only want to import data from some endpoints rather than all of them, or we may want to prevent crawling for a region while there are stability issues on that region servers.

This is where spiderlings come in. A spiderling is basically an implementation of our little crawler logic but only for a specific region. Elise is then simply in charge of spawning/killing these little spiderlings based on whatever command she receives, one for each supported regions.

Wtf, 2 Elises?

Elise code allows concurrency! You can spawn multiple Elise containers, each with their ten or so spiderlings, and there should not be any concurrency issues when interacting with Maokai. Spiderlings have been designed as stateless, so there is no risk of having conflicting states between n spiderlings all working on the same endpoint. Whenever a spiderling obtains data from the DB that it will use for fetching something (like a summoner or a match), it also updates that row to remove it from the pool of available summoners/matches. So, if you want to scale™, you can.

Sending commands to Elise

To make things more interesting, Elise subscribes to a Redis instance as subscriber for a command channel, which acts as a mask for the possibly many Elise instances. Elise instances can also publish on a separate channel issues they encounter. To be honest, the Redis instance is absurdly overkill for this task, and was initially planned for sharing the state of the rate limiter between multiple Ezreal instances (because scale™), but the result wasn’t really satisfactory, so I went back to a single instance design.

Tech Stack

Xayah and Rakan, our reporting duo

Xayah and Rakan

We have our DB, our Rate Limiter, and our Crawler waiting patiently for orders from a Redis Instance. We now need something to issue commands, gather reports from everywhere, and present it nicely. This were our duo comes in. Rakan is in charge of supporting Xayah in presenting the data to the admin of the crawl. Basically, Rakan is the reporting backend, while Xayah is the reporting frontend.

Rakan intermittently polls Maokai to check the progress on a few key metrics (like the number of imported matches in the DB), gathers error reports from Elise instances through Redis, combines everything together and sends the report to Xayah for display every few seconds (again I used a WebSocket server). Rakan could also get health reports from Ezreal, but this was not implemented.

Xayah v0.1 — not the prettiest dashboard, but it sill gets the job done

Xayah is simplistic React dashboard that displays the data from the reports.

Tech Stack (Rakan)

  • Docker
  • node 7
  • ws + pg + redis

Tech Stack (Xayah)

  • Docker
  • React 15
  • create-react-app (yes a dockerized create-react-app project, because why not?)
Zac, the shapeshifting GraphQL server

Zac (Coming Soon™)

A future step in this project would be to add a GraphQL server to act as a wrapper around Maokai for data science queries. This project was a way for me to experiment with the new and not-so-new shiny toys that have caught my attention in the past few months, and I am definitely not letting this one pass without a try.

Putting it all together, and actually running it

So most of the building blocks have been put together, we can now run it, and thanks to Docker, it basically works! Using a development API Key, I pulled over a few GB of data over a couple of hours, getting over 20k matches, which translates into 200k participants, 600k frames, 6M participant_frames, and 16.8M events to do analysis on, which should be sufficient for a first deep dive. Once I’ve cleaned this up a bit and if it gets the thumbs up from Riot, I’ll publish the repositories on GitHub and DockerHub for everyone to play with.

In the next article, we’ll analyze these 20k matches and see if we can learn anything about the meta-game and more specifically about the differences in gold per minute and xp per minute for each role, and the influence of these differences on the early game dynamics. This should be interesting!

--

--