Git Gud — Chapter 8

5 min readJun 26, 2024

Goodness of Fit — Applying the Idea

We may attempt to use the acquired relative ratios of critical counts in a chi-square goodness-of-fit test. The frequencies and relative ratios from the PerfectMM Group vs PerfectMM population results will serve as our (quasi-)true values. These will be used to calculate the expected frequencies in every sample we test to determine how close or far off a matchmaking algorithm is to the PerfectMM benchmark. This can be seen in the table below.

The critical count numbers, ranging from 0 to 197 will be considered as categorical variables. The ‘true’ ratios will be calculated by adding all frequencies from when critical count is zero to the frequency when critical count equals a number of our choosing(N). This is due to one of the conventional conditions of the chi-square goodness-of-fit test where the expected value has to be at least five.

For our initial tests, all frequencies from when critical count = 0 to when critical count = 22 were added(0 ~ 22). This included 66,177 cases from our PerfectMM Group vs PerfectMM Pop. Critical Count results. The ratios were calculated based on this inclusion and are shown in the table below.

The purpose of the following tests will be to determine if the distribution of the critical counts from a group of games under a specific matchmaking algorithm follows the distribution of ‘PerfectMM Group vs PerfectMM Population Critical Counts’ or what we should now consider as ‘True Critical Count Distribution’. Given a significance level of 0.01, the null hypothesis and the alternative hypothesis would be as below.

H0: The distribution of critical counts of a sample follows the True Critical Count Distribution
Ha: The distribution of critical counts of a sample does NOT follow the True Critical Count Distribution
( α = 0.01 )

The degree of freedom will be 22 (df=22) as there are 23 categories and our threshold will be 40.289, according to widely used chi-square distribution tables. In other words, if our value is bigger than the threshold, we will reject the null hypothesis.

The table above shows the results of the three times the tests were performed. The first row is clearly obvious as the test was being performed on the same sample that the ratios were deduced from. A new sample may as well be supplied with a different random seed for a proper test. However, the result in the second row where the NearPerfectMM group was tested more than suffices for such additional efforts as the NearPerfectMM group has already been proven to be highly similar to the PerfectMM group. Therefore, this result may replace a different PerfectMM group generated with a new seed.

The third row requires some explanation. As seen in its histogram for critical counts, there are no records that had no critical counts: the frequency where critical count equals zero is zero. This trend continues to such a point that it is infeasible to perform the test because too many expected values become zero.

Application to Real Data(Live Data)

So far, all the comparisons have dealt with generated records by hypothetical players. This naturally makes us wonder how real records of games played by actual human players would fare in our procedures.

Records of approximately sixty thousand players who had played more than 300 games were gathered across different ranks such as Challenger, Grand Master, Master, Diamond, Emerald, Gold, Silver and Bronze. Due to their small numbers in size, players who were Challengers, Grand Masters and Masters were combined into one group whereas other ranks were left to their own.

For testing these records of real players, a similar process would be performed. Only the number of categories(critical counts) was widened because the size of each group enabled more critical counts to have expected frequencies higher than five. As a result, all frequencies from when critical count = 0 to when critical count = 50 were added(0 ~ 50).

The purpose of the following tests will be to determine if the distribution of the critical counts from a group of games played by actual players follow the distribution of ‘PerfectMM Group vs PerfectMM Population Critical Counts’ or what we should now consider as ‘True Critical Count Distribution’. Given a significance level of 0.01, the null hypothesis and the alternative hypothesis would be as below.

H0: The distribution of critical counts of a sample of real data
follows the True Critical Count Distribution
Ha: The distribution of critical counts of a sample of real data
does NOT follow the True Critical Count Distribution
( α = 0.01 )

The degree of freedom will be 50 (df=50) as there are 51 categories and our threshold will be 76.154, according to widely used chi-square distribution tables. In other words, if our value is bigger than 76.154, we will reject the null hypothesis.

Despite the minute margin that the Bronze group exhibited, the table above shows that no sample managed to keep their null hypothesis. This is somewhat intriguing as none of the histograms resembled that of StreakMM Group vs PerfectMM population where the distribution was heavily left-skewed. This may mean that although invisible to simple inspection with our eyes, the distribution/s of critical counts of actual game records of players is/are different in terms of their relative ratios of critical counts.

We have observed that it is possible to distinguish which matchmaking algorithm may have been responsible for a group of records. The premise here would be that the group is the result of one single matchmaking scheme not a composite of two or more (eg. only PerfectMM or only StreakMM or only Live Data). By using their collective statistics, subtle differences may be found that suggest the use of one algorithm over another.

Next: Chapter 9

Previous: Chapter 7

Git Gud — Chapter 8

Written by matteia