Web scraping character stats from the Arknights wiki using Python and Selenium
Arknights is a tower defense mobile game that I have been playing for over a year now. The fans have created multiple tools and portals to aid in each other’s progress. Inspired by them, I decided to start working on a Data Science personal project for the game.
The result I am sharing today is the Python script to web scrape a dataset of character stats. This won’t be a tutorial, rather, it will be a show and tell of sorts where I explain my thought process and the progress of my code to arrive at the final solution.
If you just want the code, please visit my GitHub repository here.
Goals and challenges to solve
After deciding on my goals for the project, I had to find a dataset. In my case, I was looking for character stats i.e., their Attack, Defense, Resistance, etc. as they level up. I did not find this dataset online, but I did know of a place that recorded this data: the Arknights wiki over on GamePress. So, I came up with a script in Python that goes through the individual character pages, reads their stats and combines everything into a single CSV file.
The trick in scraping the dataset was in interacting with the page. Upon loading a character page (let’s take Phantom’s as an example), you see that by default his stats are show for level 1of the Non-Elite promotion level. But I was looking to get all stats, for all levels. In other words, I needed the script to programmatically interact with the arrows next to the character level to level them up, read the stats each time, and then repeat this process for each promotion level (Non-Elite, Elite 1 and Elite 2).
Choosing the libraries
The programming language was already chosen: Python. It is the language I am most comfortable in writing web scraping code and it is also the language I would use for the remainder of this project. The question was which library(ies) to use. I have used Beautiful Soup in the past, but I was aware that it could not interact with web pages. Whatever it reads at first from the page is all you can get. Don’t get me wrong, Beautiful Soup is incredibly useful, but the lack of interactivity was a deal breaker here.
So I (re)turned to Selenium. I have used it in the past, albeit less than Beautiful Soup, but I got a good grasp of its functionalities and how it would be able to achieve what I wanted. Unlike Beautiful Soup which reads the HTML, CSS and JavaScript of a page, Selenium can open a new browser window and essentially have a bot run your commands, i.e. interact with the page. It can still read the same contents that Beautiful Soup can of course, but you can point to a text box and input your credentials to authenticate into a website or click buttons as you would manually.
This was exactly what I was looking. Now that I had the programming language and the libraries chosen, I could start writing code.
The code: step by step
Now I will take you through each challenge I had to solve in the script in order to complete the script. I will assume you are familiar with basic web scraping techniques such as using the browser inspector to locate HTML elements, but as long as you understand HTML and Python you should be able to understand what’s going on in the code.
We will start at the most basic part: reading the stats of the default character level (Non-Elite level 1), and build from there until we arrive at a script that finds all the existing character pages and extracts all the data in each character page.
Oh, and please take into consideration Selenium will be running with the default window size, which was about half the size of my screen.
Closing the sticky ad
While this might not have been an issue if I had maximized the browser window or even ran Selenium in headless mode (without opening the browser window), I think it’s a good idea to share how I solved this issue.
As Selenium scrolled down the page, a sticky ad popped up on the bottom right corner. The issue was the ad overlapped with the level up buttons. When Selenium tried to click those buttons it would throw an error because the sticky ad “would receive the click”. Think of it as layers, the ad was on a layer above the level up buttons and Selenium would click whichever was on top…
You might be thinking “he just told Selenium to close the sticky ad” and that’s absolutely right. The magic was in getting the code to do that.
For the effect I needed to complete three steps in the code:
- Locate an HTML element to scroll down to
- Scroll down to that element to make the sticky ad appear
- Click the close sticky ad button
Here’s a code snippet to perform this:
Read data for a single level
With the sticky ad out of the way, I could achieve the most basic task of the process: read the stats for the default character level. This had quite a bit of back and forth because of my own requirements on which data to collect and playing around with the Selenium syntax, but in the final version I am scraping the following:
- Name: character name
- Rarity: character rarity (1–6 stars)
- Class: character class (guard, sniper, defender, etc.)
- Promotion Level: Non-Elite, Elite 1 or Elite 2
- Level
- HP
- Attack
- Defense
- Resistance
- Redeployment Time
- DP Cost
- Block Count
- Attack Interval
- CN Release Date: release date of character in the Chinese server
- Global Release Date: release date of character in the global servers
- Is Limited: is the character available on limited events only?
(those not explained are in-game stats)
Most of this data was straightforward to read. Aside from a lack of IDs on the HTML elements, which meant I had to locate elements by their HTML class, it was mostly a matter of “pick element with class X and read its text”. There were a few trickier, like the rarity which involved counting the number of rarity stars (img elements), and the Is Limited field, which involved looking for the word “limited” in a text box… Also had to take multiple stats as a single string and then split the text, but I’ll get to that in the code snippet.
Again, taking the interactivity aside, it was a straightforward web scraping endeavour. Could have been easier if more HTML elements had IDs, but thankfully the person/people who wrote these pages wrote class names as if they were ids because they weren’t repeated too much in each page. In hindsight, it was also good for Selenium practice.
Here’s the code snippet for this part. Note it doesn’t include the character name because that is not read from the page in the final solution, but from the loop that runs the scraping:
The snippet seems long, but in reality it’s just the same core action repeated multiple times: locate an element by class name (or id where possible) and read its text.
But there are some parts I want to call out. On lines 11 to 16, to find the operator rarity, the only way I had was to find all the img elements that exist for the rarity (this image) and count them. Character rarity ranges from 1 to 6, so characters will have that many star images for rarity.
Character availability and release dates were found on two tables with the same HTML class, lines 18 to 21 respectively.
I read the release dates first, and those are in the second table with one row for each release (Chinese and Global servers, respectively). The code was identical for each value, but the data was read from a different row, hence the different index. The Chinese release date is read on lines 23–27 and Global on 28 to 32. To find out if the character was limited, I had to go to the first HTML table and read the Headhunting text information (lines 33 to 37). If it included the word “limited”, then Is Limited was recorded as 1, else as 0. This condition was written as a one-liner with a ternary operator.
The other character stats (HP, Attack, Block Count, etc.) required string manipulation. These six stats were also split into two tables, but the data was deeply nested.
(please ignore the second set of stats, those are for an ability of Phantom, other characters don’t have this second set)
My solution was to read the text of each table at once, which returned the following for the first table.
HP
769
ATK
215
DEF
144
Each line contains the stat name, followed by its value on the next line. I called the splitlines
method on this string to automatically split each value at new line characters. So the string above would generate the list
["HP", "769", "ATK", "215", "DEF", "144"]
This was repeated for both tables of stats (from line 50 onward).
Get stats for all levels of a single character
Now that I was able to read stats for a single level, I had to:
- Read stats for all levels of the current promotion level
- Repeat the process for the remaining promotion levels
This was actually quite simple. The idea was to read the stats, click the level up button, read the stats, click the level up button, …, and then repeat this for each promotion level. This was a perfect target for writing a custom function that did that.
Levelling up requires clicking a right arrow button, but changing the promotion level requires clicking one of three buttons. So what I did was:
- Locate the three promotion level buttons
- Read the stats for each level of that promotion level (the read stats, level up repetition)
- Promote the character by clicking the next promotion level button
- Repeat step 2.
- Repeat step 3. (if needed, some characters have three promotion levels, others two and others just one)
Additionally, because the number of levels in a promotion levels varies depending on promotion level and character rarity, this custom function required a parameter: the number of levels, or number of times to read stats and click the level up button. Here’s the code snippet for that part.
Most of the code was already shown in the previous code snippet. The relevant part here is the logic to read stats, level up, and promote characters as necessary. Note I have also implemented some logic starting at line 16 for the varying promotion levels. It checks how many promotion levels are available in the character page.
The rank_buttons
variable is a list of buttons. Each button is a promotion level, hence why I could access an element of that list and call the click
method. Then the get_stats_per_level
function repeats that read stats-level up loop X times, according to the number of levels available.
If you’re an Arknights player, you may notice I am scraping too many levels for the characters with Elite 2 promotion (lines 31 onward), because that includes 4 star or more rarity. For example, a 4 star character has only 45 levels for Elite 0/Non-Elite, yet I am trying to get 50 (line 32). This results in the stats for level 45 being recorded for those extra levels (duplicated values).
I made this choice when considering the variety of level caps for different promotion levels and rarities, and performance drawbacks in the code. The level caps used were based on 6 star characters (the ones with highest level caps). Then I remove the duplicate rows at the end of the script before writing to the output CSV file (not included in this code snippet).
Lastly, note how the data is not being stored at all. I will explain how I handled this with pandas DataFrames later. Hint: the get_stats_per_level
function needs modifications.
Get all the characters available
At this point, the logic for scraping data is completed. But there are two missing components: to know which pages are being scraped, i.e., a list of character pages, and to store the data in a running pandas DataFrame for output at the end.
This part was very easy for me because this is not the first time I have scraped this Arknights wiki. Last time I also had to scrape a list of the characters available, but instead of stats I was looking at the URLs of the character art to create a wallpaper generator app with Streamlit! Enough with the self-promotion, here’s the code to get the list of available characters and the URL to their pages as a dictionary.
And look, I have actually used Beautiful Soup (and requests) for this part. To be honest I should’ve rewrote the code to use Selenium instead, but I already had this code that I knew worked so I reused it. There’s no interactivity required, simply load the page and scrape the information from a table of characters.
On lines 15 and 16, you see that I actually don’t scrape URLs. The name of the characters is actually an hyperlink to their pages (using a relative path). Again using Phantom as an example,
This means the value of the href attribute is appended to the base “https://gamepress.gg” URL. This information is stored as a dictionary that maps each character to the URL of their page.
Line 21 shows a hard-coded value. Delving into the game a bit, Amiya is the protagonist and recently she received an alternate form of a different class. Other alternate characters are their own separate characters, but Amiya’s two variants are the same character in-game. So I think that’s why the wiki doesn’t even show this Amiya (Guard) alternate in the table of characters. However, there are separate pages for Amiya and Amiya (Guard), hence me hard-coding the page for the latter. To avoid going through the trouble of writing more code, I decided it was okay to have this one hard-coded, as another Amiya shouldn’t be added to the game for a long time and alternates for other characters are added as new characters.
Finally, the dictionary is output as a pickle file. I did it in that previous project and I did it again here because it works as a nice checkpoint in case something breaks while scraping the character pages. The dictionary variable is used in the rest of the script, but I it is good practice to have that checkpoint if something goes wrong afterwards.
Here’s the code snippet with the code refactor. As before, I am omitting code that hasn’t changed since the last time it was shown.
Recording the data in a DataFrane for CSV output
With everything else in place, the last piece of the puzzle was to save the data somewhere as the script goes through the pages to output a CSV at the end.
We need to revisit the get_stats_per_level
function. Now that I had to save the scraped data in a DataFrame, the function needed access to that DataFrame. The DataFrame is initially created as an empty DataFrame outside the function. However, it is passed as an argument to the function, so a new row of data is written to it in each iteration of the loop that levels up characters. The DataFrame with the latest data is returned at the end of the function. Take a look at the following code snippet to get a better grasp of this new code.
I think the comments in the code are self-explanatory but, in essence, to add a new row to the DataFrame I create a new dictionary with the latest stats scraped and append it to the DataFrame.
If you wanna take a look at the complete script, you can find it on my GitHub here. And if you’d like the resulting dataset instead, you can find it here.
Closing thoughts
This was the start to a very fun personal project. Not only did I have a chance to work with data from a game I spend a lot of time on, but I got to work on Data Science while at it… well, Data Engineering at this stage. It was cool to revisit Selenium after using it a couple of years ago and learn more about web scraping.
But, given the nature of the work, this is a shaky script. It works today, but if GamePress decides to change these pages then they can easily break the script. APIs are always a better an alternative, that is, when they are an alternative… When APIs are not an option, web scraping can still get the job done if your requirements are flexible enough and/or you potentially have people to support the code in the long run.
Last but not least, feedback is very welcome, not just for the code but also for the article itself. I usually write tutorials-ish, but this time I decided to do more of a discussion on what I did and why I did it. Please let me know if you enjoyed reading it and what can be improved. Was I too verbose? Did I strike a good balance between telling about the code and actually showing it? Should I have elaborated more on the code explanations? At any rate, thank you for reading the article and I hope this code can help you in the future :)
P.S. if you try to run this script today or in the next few days, it will break because the wiki is already adding the newly-announced characters and their pages are still a work in progress. So if you try to run the script make sure to ignore the URLs for these characters. The code for this is in the final script (lines 132–135). Uncomment these lines if needed.