Age of DataFrames 2: Polars Edition

Published in

datamindedbe

9 min readMay 23, 2024

Age of Empires II: Definitive Edition banner (copyright Microsoft) and Polars logo (copyright Ritchie Vink)

As one of the newer kids on the block, Polars offers an exciting alternative to PySpark for small to medium sized datasets, as already evidenced in other blogposts. Here, we’ll dangle our toes in the water with a more focused topic to showcase how powerful and expressive polars can be: e-sports prediction. We’ll have a look at a tournament of the video game Age of Empires II: Definitive Edition (AoE), and try to make a prediction on how the first round will go. Aside from a similar age (Polars had its first commit in June 2020, AoE was released in November 2019), the AoE community has spawned many community projects to maintain vast databases, making it ideal for a small project like this. I hope that by the end of this blogpost, you will share my joy in the features of Polars that just work 😄

In order to make a prediction, I’ll first have a look at some of the data that has been gathered by the community, then make a simple model to calculate a win chance, and then end by applying this model to the tournament.

If you want to play around with these concepts, here is a link to the GitHub repo you can use to download the data and get started.

📖 Data source

For our data, we scraped from aoestats.io, which maintains a database of online games played. I’m currently only interested in matches that are officially ranked, which corresponds to raw_match_type being between 6 and 9 (inclusive). I will add this as a filter to the reading.

I’ve already pre-processed the data, having done a bit of cleaning. After filtering on the match type, the data looks like this:

df_m = (
    (pl.scan_parquet("data/aoe2/matches/*/*.parquet"))
    .filter(pl.col("raw_match_type").is_between(6, 9))
    .select(pl.all().shrink_dtype())
)
df_m_c = df_m.collect()
print(f"Match data is about {df_m_c.estimated_size('gb'):.2f} GB")

A neat trick the Polars offers is the .shrink_dtype() expression, which shrinks numeric types to the smallest version that supports the values currently in the dataframe. Quite handy to minimize memory usage!

Why filter: Eagle-eyed readers will also know of the existence of the semi-join, which I could also have used in order to filter the data by making a small dataframe with a raw_match_type column. In testing, I found a minute difference in performance between the two (0.4s versus 0.6s). My takeaway here is that the semi-join probably has some overhead. For filtering on larger amounts of values or on values that are not known beforehand, I would recommend the semi-join.

Now, aside from the matches, I also need the data of the players involved in these games. This is an ideal case for the semi-join!

df_p = pl.scan_parquet("data/aoe2/player/*/*.parquet")
df_p_c = (
    df_p.join(other=df_m_c.lazy(), on="game_id", how="semi")
    .collect()
    .select(pl.all().shrink_dtype())
)
print(f"Player data is about {df_p_c.estimated_size('gb'):.2f} GB")

As the dataset contains almost 12 million played games and 42 million records of players in those games, the dataset is large enough to play around with and just small enough to fit in memory on my modest laptop (once I close Chrome 😉).

🔮 Win chance prediction: making a lookup table

Of particular interest in this dataset is the rating or elo columns, which is a number that represents how strong a player is. This concept originated in chess and is widely used among different e-sports in order to rank players. As a naive estimator for a player’s chance to win a match, we can use the rating difference. First, let’s transform the data so it can be fed it into a classifier, and then make a lookup table for the rating difference:

rating_diffs = (
    df_p_c.join(
        other=df_m_c.filter(
            raw_match_type=6 # For filtering a column on equality, Polars offers some syntactical sugar
        ),  # Restrict the data to 1-vs-1 games to get a more accurate image
        on="game_id",
        how="inner",
    )
    .select(
        "match_rating_diff", "winner"
    )  # We only want the rating difference and the winner-flag
    .drop_nulls()  # Placement matches have players with no rating calculated yet, can be removed
)
rating_diffs

For the classifier, I simply took the Gaussian Naive Bayes classifier available in scikit-learn. This is not necessarily the most suitable choice, and I will also not split the data into a training and testing set. For building good models, these steps are absolutely essential, but I want to focus on how to use Polars, not on machine learning 😉 Just for fun, I’ve also scored the model to see how it performs:

from sklearn.naive_bayes import GaussianNB
import numpy as np

gnb = GaussianNB()
gnb.fit(X=rating_diffs.select("match_rating_diff"),
        y=rating_diffs.select("winner")
       )
gnb.score(X=rating_diffs.select("match_rating_diff"),
          y=rating_diffs.select("winner")
         )
>> 0.5476460560275515

Our classifier scores 55%, so slightly better than a straight-up coin toss! The classification selects who has the higher win percentage, somewhat boring. However, the model can also give the win percentage, not so boring! Let’s evaluate this in steps of 5 ELO points over a decent range, so we can see the win chance changing smoothly. Since we need to interface with the classifier, the Polars version of a User Defined Function is required, which can be either map_elements(...) or map_batches(...). The exact difference is quite nuanced, and it has a large entry in the documentation. In this situation, map_batches can be used to evaluate all the data at once, rather than evaluating record per record.

winning_rating_diffs = pl.DataFrame(
    pl.int_range(start=-1000, end=1000 + 5, step=5, eager=True)
    .cast(pl.Float64)
    .alias("Rating diff")
).with_columns(
    pl.col("Rating diff")
    .map_batches(lambda x: gnb.predict_proba(X=np.atleast_2d(x.to_numpy()).T)[:, 1])
    .alias("Win chance")
)

Polars recently also added support for plots, so let’s use it!

winning_rating_diffs.plot.line(
    x="Rating diff",
    y="Win chance",
    width=800,
    height=800
)

What a beautiful plot of the lookup table!

🪄 The magic `join_asof` function

The main point of a lookup table is to speed up calculations by fetching pre-calculated results. However, normal (equi)joins as implemented in SQL work on exact equality for matching, which does work for lookup tables in certain usecases, but not for this one. This can be circumvented by using non-equi joins and then filtering down all the matches, but this approach is quite heavy on the logic and the code can be quite opaque and difficult to maintain. Enter join_asof.

The join_asof function in Polars is meant precisely for cases such as this, where the matching isn’t exact but can be either the closest lower value (strategy=”backward”), closest higher value (strategy=”forward”) or closest value (strategy=”nearest”). With this function, retrieving the entry in the lookup table with the closest match in rating difference is very simple:

test = pl.DataFrame({"Rating diff": [0.0]}).sort("Rating diff")
test.join_asof(
    other=winning_rating_diffs.sort("Rating diff"),
    on="Rating diff",
    strategy="nearest"
).select("Win chance").item()

>> 0.49997980412308385

With no rating difference, the chance of winning is 50%. Using this tool, let’s predict the course of a tournament!

🇧🇪 BELCup 2024

The tournament I chose was BELCup 2024, the first tournament for only Belgian players (I might have a small bias 😉). By creating the list of players along with their player ID, we can verify the seeding used and calculate each player’s rating. Unfortunately, one of the players could not be found in either the leaderboard or any other database, so the list of 34 players has been reduced to 33.

With these profile ID’s, the games by these players can be extracted from the dataset:

df_belcup = (
    df_p_c.join(other=players_belcup,
                on="profile_id",
                how="inner"
            )
    .drop("profile_id")
    .join(
        other=df_m_c.filter(raw_match_type=6)
                    .select("game_id", "started_timestamp"),
        on="game_id",
        how="inner",
    )
    .select("name", "new_rating", "started_timestamp")
)

The seeding of the players was done based on the mean of the maximum and current rating, which is easy to calculate since Polars supports several statistical calculations that combine columns:

belcup_ratings = (
    df_belcup.group_by("name")
    .agg(
        pl.max("new_rating")
        .alias("Maximum ELO"),
        pl.col("new_rating")
        .sort_by("started_timestamp")
        .last()
        .alias("Current ELO"),
        pl.col("started_timestamp")
        .sort()
        .last()
        .alias("Current ELO timestamp"),
    )
    .with_columns(
        pl.mean_horizontal("Maximum ELO", "Current ELO")
        .alias("Seeding ELO")
    )
    .sort("Seeding ELO", descending=True)
).with_row_index(name="Seeding spot", offset=1)

The differences are due to the database used here going back until September 2022, and some people changed rating since registering 😄

We can make predictions for the first round by calculating the rating difference between the players and using our lookup table. First, we make a table that captures the match ups:

df_first_round = pl.DataFrame(
    {
        "Player 1": ["[BraBros] Koen", "KnightofSaintJohn", "SmellyLeopard", "Caféglacé", "TrCL.Welcometorapture", "the_belgian",],
        "Player 2": [ "Pairu", "Squatting Slav in Tracksuit", "PAP.KingBug", "ADCuitA", "NeutralWouter", "KEVAIN_25",],
    }
)
predict_first_round = (
    df_first_round.join(
        other=belcup_ratings.select(
            pl.col("name").alias("Player 1"),
            pl.col("Maximum ELO").alias("ELO 1 Maximum"),
        ),
        on="Player 1",
        how="left",
    )
    .join(
        other=belcup_ratings.select(
            pl.col("name").alias("Player 2"),
            pl.col("Maximum ELO").alias("ELO 2 Maximum"),
        ),
        on="Player 2",
        how="left",
    )
    .with_columns(
        cs.numeric().fill_null(1000)  # Need a default elo when it isn't available, so we fill any null in a numeric column with the value 1000
    )
    .with_columns(
        (pl.col("ELO 1 Maximum") - pl.col("ELO 2 Maximum")).alias("Rating diff"),
    )
    .with_columns(pl.col(pl.UInt16).cast(pl.Float64))
    .sort("Rating diff")
)

With this table ready, join_asof adds in the winning chances of Player 1:

predict_first_round.join_asof(
    other=winning_rating_diffs.sort("Rating diff"),
    on="Rating diff",
    strategy="nearest",
)

As a first estimate, the calculated win chances predict very balanced match ups, except possibly the last few matches.

Now, the first round is a Best Of 5 (BO5), meaning the first person to win 3 games advances. I’m going to use the binomial distribution to calculate how likely every possible outcome of the series is, and present both the most likely outcome and the overall win chance of the predicted winner.

import scipy.stats as stats
from numpy.typing import NDArray


# Prepares a LazyFrame listing all possible winning end results for
# a given Best-Of series.
def create_bestof_records(BO: int) -> pl.LazyFrame:
    N = (BO // 2) + 1
    df = pl.DataFrame({"Wins": list(range(N + 1))}).lazy()
    return (
        df.join(other=df, how="cross")
        .rename({"Wins": "Wins Player 1", "Wins_right": "Wins Player 2"})
        .filter(
            ((pl.col("Wins Player 1") >= N) | (pl.col("Wins Player 2") >= N))
            & (pl.col("Wins Player 1") != pl.col("Wins Player 2"))
        )
    )


# Another UDF, where the entire column is propagated and can be calculated in a vectorized way
def calculate_win_chance(df: pl.Series) -> NDArray:
    p1wins = df.struct.field("Wins Player 1").to_numpy()
    p2wins = df.struct.field("Wins Player 2").to_numpy()
    p = df.struct.field("Win chance").to_numpy()
    p = np.where(p2wins > p1wins, 1 - p, p)
    wins = np.where(p2wins > p1wins, p2wins, p1wins) - 1
    total_games = (p1wins + p2wins) - 1
    return stats.binom.pmf(wins, total_games, p) * p


BO = 5
N = (BO // 2) + 1

prediction = (
    predict_first_round.lazy()
    .join(other=create_bestof_records(BO=BO), how="cross")
    .with_columns(
        pl.struct(
            "Wins Player 1", "Wins Player 2", "Win chance"
        )  # Passing multiple columns at once to a UDF can be done by creating a struct column
        .map_batches(calculate_win_chance)
        .alias("Score probability"),
        pl.when(pl.col("Wins Player 1") == N)
        .then(
            "Player 1"
        )  # In general, strings are interpreted as column names, not literals
        .otherwise("Player 2")
        .alias("Winner"),
    )
    .with_columns(
        pl.col("Score probability").sum().over("Winner").alias("Winner probability")
    )
    .filter(
        pl.col("Score probability")
        == pl.col("Score probability").max().over("Player 1", "Player 2")
    )  # The combination of Player 1 and Player 2 determines each match-up uniquely
)
prediction.collect()

Ok, the predictions have been made, let’s see how much reality is wrong!

Only the last two predictions do not match the results, but the win of ADCuitA is due to a forfait (try modeling the chance of that happening!), and the Koen-Pairu result only differs by a single game! Funny enough, these are the games with players where the database did not have any information about the rating. Overall, I’m happpy with the result of this simple model.

Speaking as NeutralWouter: way less happy about the actual match result…

🤓 Conclusion

With this small example, I hope I’ve given you enough of a taste of Polars to make you want to tinker with it. As a reminder, I’ve made the code available in this GitHub repo. Feel free to replicate my results 📋, or further deepen the examples with more advanced techniques 🚀. After all, estimating how an entire match will go based purely on rating is not a good idea, as football fans of a Belgium without any world titles will know. Let me know in the comments what you think!