Data Mining Typeracer (Part 1)

Jarry Xiao
8 min readApr 24, 2020

--

About a month ago, I left my job at large quantitative hedge fund, where I worked as a software engineer for a little under two years. Since then, I have been on garden leave, which has been unfortunate timing due to the ongoing pandemic. After sitting around for a few weeks, I’ve realized that having few productive outlets outside of work leads to a dull and rather unfulfilling lifestyle. Instead of spending my days binging Netflix, gaming, and playing online poker, I’ve decided to start writing about topics that interest me and to work on side projects that will keep my skills sharp for when I re-enter the working world. In this first series of posts, I will document my analysis of data collected from Typeracer, the popular online competitive typing game.

Approach

One of the most important qualities that I gained from my first job out of college was an appreciation for messy datasets. I find it rewarding to gain new, quantifiable insights from places that people might not bother to look. Knowing where to find and how to collect and organize data is a crucial part of building systems that can aid in improving both decision making and product design.

I’m usually not a fan of buzzwords or oversimplifying complex topics, but I think the Data Science Hierarchy of Needs reasonably portrays the components needed to leverage data for impactful decision making.

Figure 1: Useful abstraction for people with business degrees

I hope to make my way up this pyramid with each subsequent post. The primary focus of this article is on data collection, but I want to first discuss how I came up with this idea.

Ideation

I want to preface this section by saying that I think that I am a mediocre typist. According to Typeracer, I type at an average rate of 74 WPM over the course of 67 races (though it’s possible that this number is a more like an exponential moving average).

Figure 2: Badge of mediocrity

You should ignore the fact that the badge says I’m in the top 12% of typists. Among my friends, I am undeniably on the slow end. After consistently taking losses on losses against people who consistently achieve impressive typing speeds of over 100 WPM, I decided to search through the site to see if there was a way to review my races in order to step up my typing game. That is when I stumbled on the following feature:

Figure 3: Race playback (this might be painful for some to watch)

As an engineer, I asked myself how a feature like this might even be implemented in the first place. To support playback, the client must somehow have access to keystrokes made by the player as well as the latencies between each keystroke. The backend must have access to some sort of parser that can read in this data format and then modify the DOM based on the contents of that payload.

With this hypothesis in mind, I was quickly able to confirm my suspicions by looking at the source code.

Figure 4: Yikes, but also jackpot

This might look ugly (because it is), but this is exactly the kind of messy dataset that I’ve come to appreciate. Typeracer stores this string for every race completed by a registered user on the site, as well as some standard analytics related to the race. This includes a list of mistakes made during the race and the average WPM throughout different sections of the text.

This variable, however, contains much more information than the site’s display.

I think this is an untapped dataset that can potentially spark a lot of innovative ideas for people interested in exploratory data analysis. Before getting to that though, I needed to figure out 1) how to decipher this string 2) how to store the cleaned data 3) how to come up with a good initial schema for the purposes of analytics.

Collection

For the remainder of this article, I’ll discuss my approach to deciphering and parsing this monster of a string.

Loading this string into a Python variable is straightforward. You can use the requests module to fetch the raw HTML, and then use BeautifulSoup to locate the correct section of the document. The difficult part is actually interpreting the string content.

To solve a puzzle like this, it’s useful to think about the context of how this data is being used. The goal is to provide the client an exact (or close to exact) playback of a particular user’s race. Therefore, the contents of the actual text should be stored somewhere in this variable. Here is the text:

We all know that something is eternal. And it ain’t houses and it ain’t names, and it ain’t earth, and it ain’t even the stars… everybody knows in their bones that something is eternal, and that something has to do with human beings. All the greatest people ever lived have been telling us that for five thousand years and yet you’d be surprised how people are always losing hold of it. There’s something way down deep that’s eternal about every human being.

Our Town by Thornton Wilder

Looking at the variable we’re trying to parse, we see that there is a formatting change somewhere in the middle.

Figure 5: Data segmentation

Specifically, it looks like there is a pipe (|) that separates the two halves. Let’s closely examine the first section of each half respectively.

In the first half, we see that the fourth element of the comma separated list starts with

W543e109 128a100l132l129 71k104n160o977w59 53t119h35a57t68

If we remove the numbers, we are left with

We all know that

which is the beginning of the target text! The numbers likely represent the number of milliseconds it took to type the preceding character.

We can confirm this in the second half.

0,3,543,0+W,109,1+e,128,2+ ,3,4,100,0+a,132,1+l,129,1+l,71,3+ ,7,11,104,0+k,160,1+n,148,2+w,19,3+o,202,4+ ,197,4- ,134,3-o,110,2-w,167,2+o,59,3+w,53,4+ ,12,5,119,0+t,35,1+h,57,2+a,68,3+t,43,4+

This is more difficult to interpret, but it should be clear that it represents the same underlying data as the first half, just on a finer level of granularity. Notice that 543 corresponds to the first character ‘W’ in both halves of the variable.

The other thing to identify is the presence of the symbols ‘+’ and ‘-’. If you examine this segment closely, you will find that it contains, in total, more characters than the same represented section in the first half. This is because I made a mistake typing the word “know” (take a look at the replay if you want to confirm this). The ‘+’ represents the addition of a character, and the ‘-’ represent the deletion of a character.

Of course we could represent the entire race as a sequence of additions and deletions, but that makes it more difficult to figure out where certain words end and begin. This is where the leading numbers play a role. In the first word, the number 0 represents the index of that word in the text and the 3 represents the number of keystrokes pressed to complete that word (‘W’, ‘e’, and ‘ ’). Additionally, if we look at the 3rd word, “know”, these corresponding numbers are 7 and 11 respectively (bolded in the the above quote). This word begins at index 7 of the text and took me a total of 11 keystrokes to type, including the deletions.

There are a few more really nitty gritty details related to the schema of this second half, but I won’t go into a lot of detail there. One example is that the schema has a specific format for when a user highlights text and replaces the highlighted section with a new character.

Disregarding edge cases, we can parse the schema using the following algorithm:

For each word, we are given the index of the word and the number of actions (N)

For the next N steps, we pop off the next two tokens.

  • The first is the time in milliseconds to perform the action.
  • The second is the a string with index of character in the string, a ‘+’ or ‘-’, and the character that was typed.

Load this data into a standardized schema.

Repeat until you have parsed all of the words.

Before concluding, I think it is important to bring up some of the pain points related to working with textual data. In the context of scraping this variable, I really only cared about that second half because it contained all of the necessary details to reconstruct the race. In order to accurately fetch that data (for any text on the site), I had to utilize regular expressions in every step of the tokenization.

I first needed to locate the pipe character (|), but I had to avoid pipes that were displayed in the text or typed by the player. Likewise, I needed to tokenize the second half by splitting on the commas (,), but I also had to ensure that all of the commas typed by the user were preserved.

To resolve this issue, I used a negative lookahead to filter away the important characters. This covers almost all of the edge cases, which goes to show how meticulous the process of acquiring and parsing text can be…

\|(?!,)
,(?!,)

Conclusion

Some readers might not care for the process of scraping and transforming data. Admittedly, I often find it frustrating and tedious. However, I think that those who are generally interested in the intersection of computation and statistics might appreciate the level of detail that goes into every component of putting together a cohesive system for data analysis.

In the follow-ups to this article, I hope to discuss the schema I used to store the data I pulled from Typeracer as well as some ideas I have on how to analyze that data. Most of my ideas revolve around the concept of analyzing the latencies of character transitions. I would like to explore whether the latencies correlate with factors like how far each key is from the home row, which fingers were used to press each respective key, and what keyboard layout the typist is using. Read more about about this analysis in Part 2!

I also foresee some huge engineering challenges left to undertake if I want generate a truly comprehensive dataset. My Macbook Pro does not have the capability to store and process even a fraction of the data that exists on the site, and performing the initial data loading process will undoubtably take a ridiculous amount of compute.

To close off, I’ll leave you with some personal stats that I’ve uncovered from analyzing my own races. These are my 10 slowest and 10 fastest character transitions respectively. These stats probably aren’t very useful given the low sample size, but you might find them interesting nonetheless.

Figure 6: Slowest transitions (left), fastest transitions (right)

You can find all the code (scrapers, database schema, and analytics) at https://github.com/jarry-xiao/typescraper.

--

--

Jarry Xiao

data {engineer, scientist} | uc berkeley eecs class of 2018 | passionate about building systems to enable quantitative decision making