Forecasting Player Activity for Apex Legends using Time Series Data πŸ“ˆ

Salman Hossain
7 min readAug 28, 2022

--

Decomposition of player activity from time series data for Apex Legends from https://steamdb.info

Learning how to forecast can be a challenging experience. One of the aspects of forecasting when dealing with univariate data is understanding that your data is no longer independent. All of the previous data has some influence on future values. This makes it tricky to do the normal machine learning approaches such as doing train/splitting the data and cross-validation.

Project Overview 🌐

We want to be able to forecast and identify player activity patterns in Apex Legends and predict growth or decline. In this project, I looked at Apex Legend’s player activity the data was collected from https://steamdb.info obtaining time series CSV files.

Problem Statement πŸ”©

How to build a performant predictive model from univariate time series data? In order to approach this problem determine the time series characteristics of our datasets such as trend, seasonality, noise, and stationary as well. Some common time series models like Autoregression and FB Prophet for forecasting and along with metrics to evaluate their performance.

Metrics πŸ“

In order to evaluate the performance of the models, we will be using Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) as metrics to evaluate our regression models. The RMSE will give us an idea of the standard deviation of the residuals, aka how far from the line of best fit the data points are. The MAE metric refers to the difference between the observation and the true value which will help determine the accuracy of the models.

Part 1: Data Exploration ❕

The main goal of data exploratory analysis when dealing with time series data is to discover these characteristics:

  • Seasonality
  • Trend
  • Stationary

Along with those we can compute various averages:

  • Simple Moving Average
  • Exponential Moving Average
Summary Statistics for Apex Legends since its initial launch in Feb 2019 (Left), Correlation Matrix of Players and Twitch Viewers (Right)

A summary of the player count shows that Apex Legends averages around ~179,000 players which is quite a lot, deviating by ~62,000. The highest peak ever was ~344,000 players on a given day!

Number of players in the month of June with a simple moving average trendline (Left), Same plot but with exponential moving average (Right)

The simple moving average indicates that there is a slight downward trend for player activity in June. The slope of the trendline is 11 which when considering that players fluctuate around 100k to 300k is not statistically significant. Thus can say that the player activity is rather constant for the month of June.

Violin Plot of Player Activity for the days (Left), Number of players at a given hour of the day (Right)

Violin plots 🎻 are underrated graphs that can show not only show some summary statistics like max, min, and median but also visually indicate the distribution, unlike a box and whiskers plot 😸.

This allows us to observe for example that the weekend is more uniformly distributed relative to the weekdays, particularly Saturday and Sunday.

Overview of the cyclical pattern of players and twitch viewers throughout the day for the month of June.

We can see that there is a definite pattern to the univariate time series. A predictable rise and fall throughout the day. This will be important later when discussing models to forecast.

Decomposition of the time series data into components of trend, seasonality, residuals

Observed = Trend + Seasonality + Residuals

One of the assumptions we need to check that is important for the autoregression model is that our time series is stationary which means constant mean and variance.

Looking visually it can be difficult to judge if the mean and variance are changing, so there is another way to check this. Though from the simple moving average we can see that for the most part, the average isn’t changing that much however we can make sure if that change is statistically significant or not with a statistical test.

Conducting Dicky-Fuller test is a hypothesis test that we can perform to know if the time series is stationary. The null hypothesis of this test is that the time series is non-stationary. So we need our p-value to be less than 0.05 in order to reject our null hypothesis. You can check out the eda.ipynb in the GitHub repo where can see below the p-value is less than 0.05 so we can indeed reject our null hypothesis that it is non-stationary and say that our data does have stationary mean and variance.

Part 2: Models πŸ“‰

In this project, 3 different models are tested: Naive, FB Prophet, and AutoRegression.

In order to compare the performance of these different models, we write up a code. As mentioned before creating splitting/training sets with univariate data is different from the typical tabular data. Thus the code below performs cross-validation with 4 splits of the univariate data with the help of the TimeSeriesSplit function from sklearn. After the splits, the model is applied and evaluation metrics (RMSE and MAPE) are computed for each split. The RMSE and MAPE are averaged and printed.

Naive Model:

The naive model predictions compared to the observed test values.

The naive model assumes that the next values will be whatever the previous value is. Serves as a control group as a baseline performance comparison for the other models.

FB Prophet Model:

The FB Prophet model uses an additive model (AM) to make predictions compared to the observed test values.
Additive Model Equation

The FB Prophet model is a tool built by Facebook to automate high-quality forecasts for business applications. It is an additive model approach having the ability to tweak trends.

Benefits:

  • No hyperparameter tuning.
  • Automated so can scale well.
  • Not necessary to know much about the dataset prior to feeding it to the model.

Drawbacks:

  • Additive models assume seasonal variation to be constant over time.
  • Harder to explain than the AR model.

AutoRegression Model:

The autoregression model (AR) predictions compared to the observed test values.
AR Model Equation

Benefits:

  • Flexible handling of a wide range of time series patterns.
  • Easier to explain what the model is doing.

Drawbacks:

  • Hyperparameter tuning
  • Prior information is needed on the time series data (stationary).

Part 3: Conclusion πŸ“ƒ

Three models were tested to forecast player activity. The naive model served as the baseline for evaluating the other models. The FB Prophet model was developed by Facebook for real world business applications. Last but not least the autoregression model was developed for univariate data with known characteristics such as period, trend, and seasonality.

Throughout creating these models the most difficult aspect was the autoregression model since it required an extensive amount of exploratory data analysis beforehand. Another challenge is dealing with univariate data means that the values are dependent on each other and not independent unlike other types of datasets. This means that normal cross-validation techniques can not be used.

The FB Prophet model does rather well considering there is no hyperparameter tuning involved and no prior information about the characteristics of the univariate data is needed. The accuracy of the model is 87% with RMSE ~19120. One noticeable trend in the FB Prophet model is the exaggeration of the downtrends.

While the FB Prophet model does well it does not beat the performance of the autoregression model. The autoregression model does have a downside which requires a solid understanding of the underlying data. Whereas the FB Prophet model did not require us to understand any of the data. This can also be a con for FB Prophet model in that it is more difficult to explain as it uses the additive model and is not as flexible as the AR model.

The autoregression model is a more intuitive model and easier to understand. Similar to linear regression the univariate data is decomposed into linear combinations and thus one advantage is the model is more explainable. The disadvantage is that this model while more performant than any of the other models does require some knowledge of the data.

Both the autoregression model and FB Prophet models seem to indicate that while there is a slight downward trend it is not statistically significant as shown by the hypothesis test. We also discovered that our time series data is stationary which is a requirement for the AR model.

In the future, different models perhaps an LSTM model or tweak the AR model to use the moving average to create a full-fledged ARIMA model. However FB Prophet and the AR model that was tested both perform extremely well as it is.

Check out the Github Repo πŸ‘ˆ

--

--