Data Mining Typeracer (Part 2)

14 min readMay 3, 2020

Last week, I introduced you to my side project where I scrape data from Typeracer and delve through the precise measurements from race data. You can find Part 1 here.

This week, I’ll first discuss my data model and then I’ll present some analysis and visualizations.

Data Model (Storage)

Last time, I left off by describing the steps I took to scrape raw data from Typeracer’s website. In order to more effectively organize the data, I created a PostgresSQL database. The goal was to develop a schema that would be conducive for performing filters, joins, and aggregations with relative ease.

Here is what I came up with:

Figure 1: keystrokes (top left), qwerty (top right), users (bottom left), texts (bottom right)

The most important tables here are keystrokes and qwerty.

keystrokes contains all of the character transitions pulled from the site along with metadata tying it back to a specific user and text. ch_prev and ch contain the previous and current character respectively. The latency in milliseconds between ch_prev and ch is stored in ms as an integer. forward_prev and forward are booleans that indicate whether or not the character in question was deleted. They are False if the Backspace key was pressed.

qwerty contains metadata about each character in relation to the traditional QWERTY keyboard layout. hand and shifted are both essentially boolean columns. hand is either ‘L’ or ‘R’ depending on which hand is used to type the character stored in ch. shifted is true when the corresponding character requires the user to press the Shift key and false otherwise. digit is a integer between 0 and 4 that maps each character to the digit most commonly used to press its corresponding key. Note that this map might not perfectly match your typing style; it’s just a modeling assumption I made about the general typing population.

0: thumb
1: index
2: middle
3: ring
4: pinky

Lastly, row corresponds to the keyboard row where each character is found. I chose index 0 to map to the home row, and mapped the other rows to their relative distance from the home row. This is displayed visually in the diagram below.

Given this setup, here is how some characters are represented in the qwerty table:

     ch, hand, digit, shifted, row
f: ('f',  'L',     1,       f,   0)
$: ('$',  'L',     1,       t,   2)
.: ('.',  'R',     3,       f,  -1)

I manually generated the data for qwerty in a spreadsheet and then wrote a Python script to populate the table.

Exploration and Analytics

Because the latency data is incredibly noisy, I decided to focus my exploration on the users for whom I had the most data.

For these power users, I was able to collect over half a million character transitions per user. (I probably could have collected more, but I figured a sample of over 4000 races would be sufficient). Most of these users publicly list their keyboard layout as QWERTY, so I was confident that the metadata assumptions I made were mostly accurate.

There are so many topics to explore here, and there’s no way that I could fit all of them into a single article. I’ll touch upon a few of these topics and leave you with a preview of what I plan to look at next week.

Measuring improvement

The most interesting datasets are often multi-dimensional. Typeracer’s data is both temporal (each character transition was logged at specific point in time) and spatial (the physical location of the keys likely plays a role in the transition latency). One interesting metric in the temporal domain is tracking how racers improved over time.

There are only two ways to improve at Typeracer (and typing in general).

Make fewer mistakes.
Make faster character transitions.

Improvement occurs whenever the positive change in one category outweighs the negative change in the other.

During this section, I will examine both latency and mistake data from power users and analyze how they correlate to their overall WPM.

Latency data is easy to define. Each character transition is stored in keystrokes so I can just query the average latency for each given day filtered on user.

SELECT DATE(race_date) AS race_date, AVG(ms) AS latency
FROM keystrokes k
JOIN users u ON u.user_id = k.user_id
WHERE username = %USERNAME%
GROUP BY race_date;

Mistakes are trickier to compute with the base schema. Originally I thought a comparable metric was the ratio of the amount time spent pressing the Backspace key in a day to the total time spent typing for that day. This isn’t a perfect metric because it doesn’t account for the forward portion of the mistakes, but adding that would require a much more sophisticated query (or an updated data model).

SELECT DATE(race_date) AS race_date,
       SUM(ms * (1 - forward::int)) / SUM(ms) AS mistake_score
FROM keystrokes k
JOIN users u ON k.user_id = u.user_id
WHERE username = %USERNAME%
GROUP BY race_date;

I realized that I also needed to write a new scraper to pull WPM data from the site, so I decided to pull the site’s accuracy numbers as well. I used that data to build a table with following schema. I do not know the exact methodology the site uses to calculate WPM or accuracy, but these are the site’s only openly displayed metrics.

Because I wasn’t really in the mood to write a huge SQL query, I pulled all the necessary metrics into separate pandas DataFrames and joined them together in memory.

Here are scatter plots displaying how four (anonymized) users improved over time:

Figure 5: Scatter plots for different users

For each user, the graphs represent WPM (top left), accuracy (top right), latency in milliseconds (bottom left) and mistake score (bottom right) all plotted against time.

Because I only scraped a limited sample of races, the latency data does not correspond perfectly to WPM even though all of the site’s metrics are derived from the raw latencies. If a user averages over 20 races a day, I may have only sampled latencies from two of those races on a particular day whereas I aggregated all of the recorded WPM numbers from the user’s races on that day.

All of the users did show significant improvement over the course of their race history. It’s evident that this improvement is present for both latency and accuracy. However, for almost all users, it appears that the accuracy measure saw only modest improvement whereas many of the above users were able to shave off 20 to 30 percent from their initial latency measurements.

User 19 (bottom left quadrant) showed the most impressive growth, bringing their WPM from under 40 to over 100. Their corresponding latency stats were reduced from the initial measurements by around 60%. This particular user lists their keyboard layout as “Other” on the website, so my theory is that he or she used Typeracer to learn that new layout from scratch.

It was also interesting to see similarities between the correlation structures of each user despite the vast differences in the data’s shape in the temporal domain. These similarities should not come as too much of a surprise though because the correlation between the following metrics do not depend on race date. This means that if you were to identically shuffle these metrics across time the correlations would still be same. Correlation is also agnostic to linear transformations of random variables. In the first series of scatter plots, I standardized the accuracy and latency data (a linear transformation) to make the visualization easier to interpret.

Figure 7: Scatter plots of (normalized) latency and accuracy vs. WPM.

Figure 8: Correlation matrices for each user (each table maps to the data from the scatter plots in the same quadrant in Figures 4 and 6)

As expected, typing speed (WPM) is positively correlated with accuracy and negatively correlated with transition latency in each of these cases.

Accuracy and latency also appear generally appear negatively correlated with each other to varying degrees. My interpretation as to why this happens is that as typists improve, they tend to both type faster (minimizing latency) and make fewer mistakes. If you were to see a data point with a relatively fast latency, it’s more likely for the corresponding accuracy to be relatively high, though for many users this relationship isn’t particularly strong.

Physical constraints on typing speed

The spatial aspect of Typeracer data sheds light on the physical limitations of typing. Using the qwerty table, we can form new groups based on the metadata related to each character in context of the QWERTY keyboard layout.

Shifted Characters

It would be natural to compare the latency difference between characters that require the use of the Shift key and those that do not. Intuitively, you probably think that transitioning from characters that require the Shift key should be on average slower than transitions from non-shifted characters. We can visualize and quantify the extent to which this statement true by plotting and aggregating the data. The following histogram was taken from the latency stats of a QWERTY user (User 5) with an average WPM of over 200.

Figure 9: Normalized latency distribution

On the histogram, transitions that would fall in to the Shift category would be bigrams where the first character is shifted and the second character is not. The no Shift category contains bigrams where both characters are not shifted. Here are the complete aggregated stats related to this same user’s data:

Figure 10: Aggregated latency stats based on Shift category

We can also visualize which characters have the longest transition times on the QWERTY layout.

For shifted characters (that occurred more than 20 times in the data), here is a heat map where the magnitude is a measure of the average time it takes to transition from any other character to the current character. Note that if a key is highlighted, in this diagram, it refers to the character that is mapped to by Shift → key.

Figure 11: Heat map of average transition times to highlighted Shifted character

Figure 12: Aggregated stats based on keyboard row location (top) and finger map (bottom)

Based on the above results, we see that for the user in question is slowest at typing shifted characters on the top row of the keyboard, and is slowest at transitioning to his or her right pinky.

We can repeat this exercise with the non-shifted characters.

Figure 13: Heat map of average transition times to highlighted non-Shifted character

Figure 14: Aggregated stats based on keyboard row location (top) and finger map (bottom)

Once again, it seems like transitions to keys in the top row and to keys pressed by the right pinky are the slowest.

Keys on the top row require a greater reach for the user, and they also appear less frequently in text. I think both of these reasons contribute to this latency difference.

A similar line of reasoning can explain the difference between the left and right pinky. The left pinky is mapped to the ‘a’ key, which is very commonly used in text and located on the home row in the default position. The other keys that it maps to, ‘q’ and ‘z’, are much less frequently used. The right pinky, on the other hand, is used for characters like ‘,-, “, and ? which all occur relatively frequently, but require more of a reach.

What Causes Slow Transitions?

Before concluding, I want to take a look at some specific character transitions and give some insight on what the data has to say. For simplicity, I only looked at non-shifted characters transitions.

Figure 15: Character transition average latency histograms

These histograms correspond to the same users I looked at in the previous section. It appears that the majority of transitions are relatively fast, but for each of these users, there are a small group transitions that are significantly slower.

We can look at the sample skewness of these distributions to further quantify this imbalance. A positive skewness means that the distribution’s mass is concentrated on the left, and the distribution has a long right tail.

If we look at User 5 (the fastest user in this dataset), you will find a similar distribution to the others.

Figure 17: Character transition average latency histogram (User 5)

I’m interested in this user’s data because there is a physical limit to how quickly someone can type, and User 5 is definitely close to approaching it.

Even for the fastest typist, certain transitions are more difficult to make than others. Here is a sorted list of transitions for User 5 that take greater than 100 milliseconds on average.

Figure 18: Slowest character transition for User 5

I selected a couple of these transitions to visualize on the keyboard.

Figure 19: Transitions are represented by blue → orange

When you feel out these transitions on the keyboard, they are all a little uncomfortable and clumsy, so it makes sense that it’s statistically more difficult to make these transitions quickly. You might also notice that all of them are transitions on the same hand and most of them use the same digit. By joining with the data with the qwerty table we can confirm this.

Figure 20: Slowest transitions mapped to fingers

In the above table, row_span is defined as the difference between the target row and the source row, so the transition c → t would have a row_span of 2. We could also ignore the row_span and just look at the finger transitions.

Figure 21: Slowest transitions ignoring keyboard row

Because of the high variance in the data, it’s certainly possible that the ranking indicated by either of these tables could be shuffled around. It’s also important to realize that different users might struggle with different character transitions. Many of these transitions will be universally slow on QWERTY, but others might be more narrowly tailored to specific users. By analyzing the non-universally slow character transitions, you could potentially figure out how a particular user can improve at typing.

While you might expect close symmetry between the right and the left hand, certain character transitions happen so infrequently that it’s unlikely that there’s enough data from Typeracer to compare the two. For example, c → r and c → t are both incredibly slow transitions on the left hand. It’s probably also true that , → u and , → y are slow because they correspond to the same rows of the keyboard and use the same digits on the right hand. However, these transitions will almost never be found in text without a space in between them, so we don’t observe them in the data.

Conclusion and Next Steps

There are many more things I could have explored or discussed based on the data that was collected. A few of the things I looked at or wanted to look at but didn’t get a chance to discuss included:

Transition latency differences between the two hands and whether that has a statistical relationship to handedness.
More aggregations from additional partitions of the data.
Using the best typists as a reference point for what I can do to improve at typing.

In the next and final installment of this series, I plan to compare data from users using a Dvorak keyboard layout vs. a QWERTY layout. Below are heat maps of character frequencies in relation to their physical locations on QWERTY and Dvorak keyboards respectively. (I’ve ignored the Space key because it has the largest frequency by a significant margin.)

Figure 22: QWERTY character frequency heat map

Figure 23: Dvorak character frequency heat map

Immediately, you notice that in the Dvorak layout, the most commonly used keys are found in the home row.

There are 3 primary questions I want to answer:

Are there any character transition latencies that are drastically different between these two keyboard layouts? For example, using (hand, digit, row) notation, the transition c → r maps to (L, 2, -1) → (L, 1, 1) on QWERTY and (R, 2, 1) → (L, 3, 1) on Dvorak. Intuitively, the Dvorak transition appears much more seamless. Is this reflected in the latency data of self-reported Dvorak users?
Given the digit transition speeds recorded for words typed on a QWERTY keyboard, what kind of improvement, if any, might that same user experience on Dvorak? (This is assuming that the user could reach a similar level of familiarity with a Dvorak keyboard.) How does the Colemak layout theoretically compare?
Based on bigram frequencies and transition speeds, is it possible to design an “optimal” keyboard layout that would theoretically outperform QWERTY, Dvorak, and/or other popular layouts? There are many hurdles to overcome with this particular problem (I think it’s probably intractable), so I would love to discuss ideas with any interested readers.

I discuss this problem and my approach in Part 3.

Acknowledgements

All of the keyboard visualizations in this article were custom-generated by code that I wrote, but the skeleton code was derived from this repository. The keyboard images are pulled from here (QWERTY) and here (Dvorak) respectively.

Figure 24: QWERTY base layout (left), Dvorak base layout (right)

You can find all the code (scrapers, database schema, and analytics) at https://github.com/jarry-xiao/typescraper.