The NAF has just recently started using the Glicko tournament ranking system developed by Mark Glickman, and implemented by Nick Harding (mubo). This rating method has become very popular over the last few years as an upgrade to the ELO ranking system, which is the one that the NAF have been using to date. It was originally designed to improve the rating system for chess, but has since seen use in games such as CS:GO, Guild Wars 2, and Dominion Online. This article hopes to explain just what it is, why it is an improvement over the ELO system, and why in particular it is a great for for Blood Bowl.
As I hope there is a wide audience for this article we are writing it with the intention that people will enjoy it whether they are looking for a technical or a non-technical explanation.
The ELO system of rating was invented as a way of grading the skill levels of chess players by mathematician Arpad Elo. It uses an exponential difference between the ratings of the two players combined with the result of the game to adjust the ranking of each of the players. The idea is that over time all players will have played enough games for their rating to match their “true” skill level. To understand this we need to make a definition of the difference between an observable and a hidden value for a characteristic.
Let’s take someone’s height as an example. This is an observable characteristic. You simply need to just go and measure their height. You need no further information, because the value of someone’s height can just be measured directly. This is not the case for someone’s skill at a game. There is no way to directly measure how good someone is at a game, in the same way we can just measure how tall they are. Therefore this is a hidden characteristic. We need to infer what that value is by looking indirectly at other things and then calculating it. We do this by looking at the results of the games that they have played, and who they have played against, and then try and come up with a score that reflects their ability. This means that we know that their ranking is not their true skill level, but an approximation of it.
When we were taught about distributions at school it would have been about an observable characteristic within a population. Taking height as an example again we can create a chart that shows how the height of a population is distributed such that if we were to pick someone at random from that population we’d have an idea of what their height might be. In this example we know that height follows a normal distribution. The average height is 177cm in this example. This is the mean average (as opposed to the median or mode), which also uses the special character μ (called mu). Knowing the average is all well and good but it doesn’t tell us about the distribution. For that we need to know the standard deviation, the amount at which height deviates from the average. In this example it is 4cm, and uses the special character σ (called sigma).
From this we know that if we take the mean (μ) height and we look at 1 standard deviation (σ) in each direction then we get a range of 173cm – 181cm. If we look at 2σ the range is 169cm – 185cm. And so on. This is important because when we know these values we can work out the probability of someone picked at random being in that range because the normal distribution follows a specific pattern.
The area under the line shows the percentage chance of the person having a height within that range. So there is a 34.1% chance that if we picked someone at random they would have a height between 177cm and 181cm. What is the chance that someone has a height of at least 169cm? Well we can also calculate that as 13.6% + 34.1% + 34.1% + 13.6% + 2.1% + 0.1% = 97.8%. So just by knowing the μ and the σ we will know what the chance of someone falling into a certain height range will be.
So why is this important? Well what you are usually not taught in school is that you can do this same thing in reverse for a hidden characteristic. If the ranking that someone has is distributed around their true hidden skill, then their true hidden skill is distributed around their ranking – with the exact same distribution! That is to say that the chance of having a ranking of 170 when your true skill is 180 is the same as the chance of having a true skill of 180 when your ranking is 170. That makes sense because all I am saying is that if A = B then B = A, which is a pretty important aspect of maths! And this is crucial because it means that we can use your calculated ranking to find what your true skill is. Your calculated ranking will be the distribution (μ and σ), and the true skill will lie somewhere within that distribution, with the chance being the same as randomly selecting someone from it.
So let’s say that your calculated ranking right now was 180 (this is your μ score), and the standard deviation for you was 15 (this is your σ score), then your true skill level would have a 34.1% chance of being between 180 and 195.
This is critical to understand as it is the entire basis for the Glicko rating. When we want to assess how good someone is – that is we want to discover what their true skill is even though it is hidden – we want information about the distribution as well as the score that they hold at that moment in time. Their true score will be somewhere within that distribution. However there is one slight difference. Although it works in the same way we cannot use the term “Standard Deviation” as that means something quite specific with a specific way of calculating it. Glicko uses a different calculation method, and because of that it needs a new symbol, which in this case is φ (called phi, which is pronounced “fi”, rhyming with “high”). But it works exactly the same.
So now we know how all that applies to distributions we can discuss why it is an improvement on Elo.
The problem with Elo 1 – Distribution
So as you probably have already worked out the biggest problem with Elo is that it just uses a μ score. In Elo when two players face each other all that matters is their respective rankings. This means that every 180 rated Elo system player is considered the same, whether they got there in 10 games or 10,000 games. But we know that someone who got there in 10 games is likely to have a true skill that is well wide of that 160 score. I mean they could be on the first steps to a 300 ranking! Or they could have fluked two tournaments before ending up down with a 140 rating for the rest of time. We just don’t know; but ELO thinks it does.
For example (and refer to the above distribution charts), when we want to find a likely true skill range for a player we might say we want to know the range one standard deviation each side of the average. Let’s take the following players:
- (A) μ = 180, φ = 15 ; true skill 68.2% between 165 – 195
- (B) μ = 180, φ = 25 ; true skill 68.2% between 155 – 205
- (C) μ = 180, φ = 50 ; true skill 68.2% between 130 – 230
Now let’s say that you are a player with a ranking of 165. Well there’s a good chance that player A is definitely better than you (84.2%). But what about players B and C? We can see that even at just 1 distribution mark player B could be under your ranking, and player C has such a wide range it is hard to judge how good they are at all. But for Elo all these players are the same – they are all ranked 180.
In order to cope with this chess tournaments wait until a sufficient number of games have been played to get the width of the distribution down. It then calculates that players score for the first time from all of those results and then retrospectively goes back and adjusts all of their opponents by this calculated score. It’s a fudge method of trying to overcome a fundamental flaw in a rating system that does not care about distributions.
The problem with Elo 2 – Sensitivity
The next biggest problem with ELO is that we have to manually enter how sensitive we want the change in ranking to be from the result of the game. To understand this we need to think about the information that each game is giving us. We want game results that are more telling to have a bigger impact on adjusting the rating of those players, as this will be getting us closer to knowing what the true skill of those players are. But games that are not giving us new information we want to not have a big impact. Chess, and the NAF Elo system does this by introducing a ‘k’ factor. The higher this number the more sensitive the adjustment to your score is. If this number is too big then your rating will just bounce around the place depending on what your last result was. Too small and it will lag behind the changes that it should be making (imagine improving your game to compete with the best but still having a 155 rating because each win only gives you +0.02 rating!).
The problem is that this sensitivity should not just a single value. Not all games give us the same amount of information. If you are playing someone new to the game then they are bound to have a low ranking because it takes time to build that ranking up. But what if they are a genius at the game that has only just started playing tournaments? Playing this person is a low information value game, because the ranking system has no idea how good they are. On the other hand you could play someone that has put 250 tournament games into a single team and they play it all the time. The ranking system should have a great idea of how good that player is, so this is a high information value game. A good ranking system will adjust more to playing the latter player than the former. Elo does not do this – you get a single k-value for all games.
Chess does not have this, so instead they fudge it again by giving a high k-value to lower ranked players (who tend to be inexperienced), and a low ‘k’ to high ranked players. For NAF tournaments we only have the one number, other than we double it for majors. This is done to reward people for doing well at a big event, but the problem is that the more people go to an event the more likely you are to play people who are either inexperienced or either well above or well below your skill level. These are low information games – they tell us nothing new – so really the k-value should be lower. But it is impossible, on a practical level, to manually adjust the k-value for every single match up. The Glicko system does this.
So the most important thing that most people are asking is why is my Glicko score the way it is? So this is to do with the distribution above, and what we think the probability of your true skill being. And this is to do with certainty / confidence levels. How certain are we of your true skill? How much confidence do we have? The way that the Glicko system works is to assign you a minimum ranking that we are very confident that your true skill level is. In this case “very confident” is about 99%. And because this is a minimum ranking, when we say that your score is 1725, what we are really saying is that it is 1725 or more with a 99% confidence of that being true. And it just so happens that 2.5 deviations below the average is about 99% of the area.
Score = μ – ( 2.5 * φ )
Step 1 – Monthly Update
One important difference for the Glicko method over the Elo method is that Glicko is calculated monthly. This does not affect the games that are actually played – these are evaluated sequentially just like Elo. Rather it is required for the decay calculation, which is in Step 2. Before this can be done we need to create a cut-off point to know which games should be considered. All games up to and including the last day of each calendar month are included in that update. For example this article was published in early April 2018, about the same time as the March 2018 Glicko ranking update which would be all games up to and including those played on the 31st March. There is one exception and that is where a tournament spans two calendar months; in this case 31st March was a Saturday and 1st April a Sunday. In these months the cutoff point is pushed back a day so that the entire tournament is included. So the March 2018 cut off is actually the 1st April rather than the 31st March.
If submissions are slow in coming the NAF will release anyway in order to prevent long delays holding up the ranking update. The aim is that each new update will occur around 7 days into the following month. This gives time to get the data entered and processed. If tournaments are missing because of a slow submission they are simply added into the next processing month, so games will never be missed.
Step 2 – Who played?
The second step is to work out who played in that month, and when we say “who” this is always a combination of coach-team. So whilst Wulfyn-Dark Elves may have played, Wulfyn-Lizardmen might not have. Those who did not play are classed as inactive and will experience a decay to their ranking driven by an increase in their φ. This is slow at first (nobody gets punished for taking a month off), but if it starts to linger into years then your rankings will start to take a big hit and, if your φ goes over 100, will drop from the ranking list altogether.
phi_star = math.sqrt(min(self.glicko.phi, dnp.phi ** 2 + dnp.sigma ** 2))
Also in this step we work out if anyone has played with a new race. This is important because this combination won’t have a score yet. In Elo this team would just get a default starting score of 150, but in Glicko we take the median average of all other teams that player has to generate their μ. This is because there is an understanding that a good coach is unlikely to have a beginner score just because it is a new team, as the ability to play the game, whilst preferential to some teams, is not independent across teams. The φ score is given as the maximum of all teams that person has previously played. Whilst we have little confidence in how good this person is with this team we know that playing a different team is not totally independent so we can use some information to help derive this value.
elif method == “median”:
mu_vals = [v.mu for v in self.rankings.values()]
phi_vals = [v.phi for v in self.rankings.values()]
_mu = np.median(mu_vals)
_phi = np.max(phi_vals)
Step 3 – Calculate Score
The first part of this step is to calculate what we think the expected score should be. This is a fancy way of saying “what percentage do we think a player has of winning?”. If we think that chance is 80% then it has an expected score of 0.8.
def expect_score(self, rating, other_rating, impact):
return 1. / (1 + math.exp(-impact * (rating.mu – other_rating.mu)))
This takes the difference in the μ rating of the two players as an exponent. We also multiply that by the impact, which is the weighting applied to that game based on the confidence that we have in the information being provided. In Elo this is the fixed k-value, but remember in Glicko this varies based upon how certain we are about the true skill of the players involved; the φ scores.
def reduce_impact(self, rating):
return 1 / math.sqrt(1 + (3 * rating.phi ** 2) / (math.pi ** 2))
The higher the φ the lower the value of the impact function (as it essential boils down to 1 / x.φ). When this impact value is then multiplied by the difference in the rating it has the effect of suppressing that difference. This means that players with a high φ score are considered as more equal in ability.
Consider the following scenarios:
- Scenario A
- Impact = 0.80 // Player 1 μ = 1600 // Player 2 μ = 1500
- Difference in μ = 1600 – 1500 = 100
- Weighted Difference = 100 * 0.8 = 80
- Scenario B
- Impact = 0.64 // Player 1 μ = 1900 // Player 2 μ = 1775
- Difference in μ = 1900 – 1775 = 125
- Weighted Difference = 125 * 0.64 = 80
These 2 scenarios are considered equivalent in terms of the difference in player ability because the higher φ scores that grant the lower impact of 0.64 in Scenario B dampens the effect.
Now we have the expected score we need the actual score, which is a simple tally of the result:
- Win = 1.0
- Draw = 0.5
- Loss = 0.0
Finally we compare the expected score with the actual score and then use the difference to adjust the μ, with the φ also reducing as an additional game has been played.
def rate(self, rating, series):
variance_inv += impact ** 2 * expected_score * (1 – expected_score)
mu = rating.mu + phi ** 2 * (difference / variance)
And that’s basically it! The games are all processed, the scores saved, and then uploaded to the NAF website. There’s more detail behind the scenes (I’ve just given the highlights in the code that I have copied into this article), but that should give everyone an idea of how it is calculated. If you have any more questions please do visit the forums and ask there! Or for those that are brave, you can go directly to Mark Glickman’s paper on the subject: http://www.glicko.net/glicko/glicko2.pdf
SHARE THIS POST
VIEW FORUM POST