Heretic's Corner - Dealing with the Rating System

The ELO* rating system has been adopted by Magic for some time now, for reasons I have yet to understand. I don’t hold that opinion out of some deep-seated frustration borne of not understanding ELO. In fact, I understand it about as well as anyone, as will be made evident throughout this article. It is precisely because I do understand it that I think it is inappropriate for Magic.

Ratings are adjusted after each match based on a formula that compares how well you did versus how well the formula predicted you would do based on the difference between your rating and that of your opponent. If you exceeded expectation, your rating is assumed to be too low and gets adjusted up, whereas if you did worse than expected, you rating is assumed to be too high and gets adjusted down. This is eventually supposed to settle down to the point where you hover around your “true rating” which is supposed to be a measure of your skill level.

Even if we assume that a person’s skill rating doesn’t change that much, the above is ludicrous. Take three players, Alan, Barry, and Chris. They begin, as every new DCI player does, with a rating of 1600. Let’s assume that somehow the only sanctioned matches Alan and Chris have are against Barry, and that they play enough matches so that all three players’ ratings stabilize. Barry wins about 76% of his matches against Alan, but Chris wins 76% of his matches from Barry. Over time, the ratings will stabilize to around 1400 for Alan, 1600 for Barry and 1800 for Chris. So far, so good.

The problem with this is that the system makes several implications, none of which are justified. First, is that Chris will win, according to the formula, 91% of matches against Alan. This assumption is dubious at best. Next, once another person enters the mix, the chances of the formula being able to predict performances against every other player goes down dramatically. With the vast database (over 180,000 people in the DCI global rankings) of players around, the chances that the active players’ ratings will actually converge at all are virtually nonexistent, even if we assume that only 1% (that is 1,800 players) actually play on a regular basis and that we only care about the system converging for these players. Therefore, the very assumptions that the ELO system is built around fail to hold even under ideal conditions.

The factor which determines the rate of adjustment is called the K-value. This is a number that determines the greatest amount by which a match can change ratings. In cases of ratings that are near one another, the change will be close to half the K-value; changes equal to the full K-value only occur in cases of a complete upset (in a 32K event, that happens when an underdog who is 720 points or so lower than his or her opponent wins). However, where are people most likely to have ratings that are vastly different than their “true rating”? Friday Night Magic events. But those are also the events that have the lowest K-values, so a person who plays only in FNM events who is head and shoulders above the competition will take forever to increase to something close to a true rating. How long? Assuming every match is against a 1600 player, it would take a 73-0 match record to reach 1799. By the same comparison, a person who wins a 40K event (Grand Prix or the like) against all 1600-players with a 17-1-1 record will hit around a 1799.** But aren’t events like that usually populated by people who have been around for so long that their ratings should be established by now? So the events where the ratings can change rapidly are populated by the people whose ratings are likely to be close to their true rating, whereas the events that don’t change ratings much have the people who are the most likely to not be near their true rating.

And just what is a person’s true rating, especially in Constructed? Suppose a person plays well enough to earn a rating of 1840. This is about an 80% win percentage against an “average field” of 1600. Now assume that the format rotates and this person’s killer deck rotates out. Should this person’s rating drop if the new state of the format is such that no deck and no player can post better than a 70% win percentage against the field? Even a 70% win percentage against an average field drops the “true rating” around which this person’s rating will hover by nearly 100 points (1747 to be exact).

Another consideration is that ratings only change when there are new matches to put in them. For reasons that my memory is too frail to remember, the qualifier tournaments for PT Paris in 1997 counted in the Eternal ratings. I did well enough there to post an above average rating during that season. But since I haven’t played in an Eternal event before or since, my rating will never change. Now take the person who builds up a high enough rating to get invites he or she never uses. You look at the ratings and see “1950” or whatever year after year. Is the person sitting on the rating or maintaining it through new matches? Unless you happen to know the person in question and can ask directly, how would you be able to tell? Meanwhile, if you have a situation where only the top X players get invites, having someone sit on ratings makes it harder for everyone else.

Besides, if the ratings were useful, wouldn’t Pro Player awards be based on them? Instead, we track lifetime and annual points based on performance in certain events. Lifetime points are used to determine some things, such as Hall of Fame eligibility, whereas annual points are used for Player of the Year awards. This system seems to work well enough there, and I am convinced that a modified version of that system can work in general.

In this grand plan, instead of a person beginning at 1600 and being adjusted up or down on a per-match basis, you would begin at 0 and move up based on performance. (I will refer to this as an additive system, because any adjustment can only add points to your rating.) Ratings adjustments can be calculated on the spot, with no knowledge of any other player’s ratings needed. There is never a time when your lifetime ratings drop.

To handle the problem of people who earned points eons ago, it would be easy enough base invites due to rating on how many points were earned in the last year. But how do you “discount” older performances in an ELO system? Chess awards bonus points if recent performances were far in excess of predictions, and Arimaa assigns a rating uncertainty (K-value) based on how recent your latest activity has been, but neither one solves the issue of removing the older ratings.

In an additive rating system, it makes sense to assign bonus multipliers for high-profile events. Whereas moving 200 points in ELO rating might be impressive for randomly winning a Grand Prix, it doesn’t reflect the true value of the win, and that rating boost will get demolished if a losing streak hits. Of course, you still get Pro Points for it, but having a 980 (or 990) point addition to a rating that starts at 0 says a lot more, and can never be taken away.

The possibilities for people to find reasonable goals based on rating multiply. What if we switch over and, by applying the new formula to the matches already in the database, we find some big-name player at a “low” rating of 2850? Now imagine someone seeing that and making it a goal to earn that many rating points in a year. That could prompt someone into entering a late-year Grand Prix just to get the remaining 30-50 points to achieve this goal, or it could spark a race between two people on opposite sides of Europe trying to compete for the Continental Amateur of the Year. Awards such as Pro Player of the Year are intelligible because the Pro Points are additive, and doing the same for everyone can only improve matters. If it’s done right, Pro Player Points could become just a subset of standard Rating Points. We can even keep the distinction between Limited and Constructed (and Eternal, if we want) and even break it down by format.

(For example, “Joe Gamerdude is a father of three kids and racked up 2,500 rating points in Kamigawa Booster Draft. His hobbies include swimming in Cloudcrest Lake, throwing a sink into Takenuma and beating people up.”)

In a world of additive ratings, you can never be fully assured that you will remain on the top of any lifetime ranking list for any decent period of time. Perhaps I play everything I can to get to 75,000 points (and a supposedly insurmountable points lead in my state) by the end of the decade. What’s to stop another person from doing the same thing to beat me? Leave a high additive rating sit and it becomes a target.

So how would I award points? I’m so glad you asked. I created a system based on the original ranking formula used by Wizards in 1994 before ELO was adopted. Since this is more ancient than 99.9% of the current Magic community, I will explain the old system.

Back in these days, all sanctioned Magic tournaments were single elimination. (On another note, lobbying for other formats, such as Swiss events, was my first entry into Magic heresy.) Everyone who played in the first round earned 10 rating points. Those who survived and played in the second round earned an additional 20, for a total of 30. Likewise, those who played in the 3^rd round earned another 30 points, and so on. The winner of the event was treated as playing in an extra round alone, so a person who won a 16-person single elimination event (4 rounds to eliminate everyone) would have earned 150 points (10 + 20 + 30 + 40 + 50).

The first thing to note about this system is that every player who entered got points, even those eliminated in the first round. The second thing to note (if you’re a math geek like me) is that the points any person earns in the event are 10 times a number you would get by adding successive integers, so the formula is well-known. So taking the successive integer formula and multiplying the result by 2 is exactly what my formula says people in a “perfect” single elimination (i.e., one with exactly 8 players, or 16, or 32, etc. so that no one gets a bye) will receive, if we plug in the number of rounds the player plays. It is also 1/5 of the number of points the original DCI system would have awarded.

Rather than bore you with the details of how that formula morphed into what I have now, I will give you my formula and show you how the values coincide. I would derive for each player two numbers: W, which is the number of matches the person won plus 1, and V, which is a value which starts at 2 and goes up by one each time you advance into a new bracket. I would then award W x V rating points. This produces for the 16-person single elimination event:

Place Wins W V My Old

9-16 0 1 2 2 10

5-8 1 2 3 6 30

3-4 2 3 4 12 60

2 3 4 5 20 100

1 4 5 6 30 150

As you can see, each award I give is exactly 1/5 of that from the old system.

For imperfect single elimination, the lowest V is based on which “perfect” elimination value the number of players is closer to. So an 11-person event is closer to 8 (the lower bound) than 16 (the upper bound), but a 13-person event is closer to the upper bound. Those values that are exactly in between (12, for example) are treated as being closer to the upper bound. The “bottom” V for upper bounded events is again 2, whereas it drops to 1 for lower bounded events. I’ll show you a 20-person single elimination in the same format as above. As there are byes, I’ll give two lines where a bye is involved.

Place Wins W V My Old

17-20 0 1 1 1 10

9-16(bye) 0 1 2 2 20

9-16(no bye) 1 2 2 4 30

5-8(bye) 1 2 3 6 50

5-8(no bye) 2 3 3 9 60

3-4(bye) 2 3 4 12 90

3-4(no bye) 3 4 4 16 100

2(bye) 3 4 5 20 140

2(no bye) 4 5 5 25 150

1(bye) 4 5 6 30 200

1(no bye) 5 6 6 36 210

This abandons the 1/5 rule above in favor of trying to balance the event against the ones we have a good formula for. When the number of players is close to a lower bound, there are many people who get a first-round bye. This system treats those people exactly the same as if they had played in the “perfect” elimination with the lower number of people (16 in this case). Those that don’t have byes are treated slightly worse off than would be the case if they had been in the event with more players, but then again, they are likely to have faced one or more people who did have byes, so this balances. By similar argument, setting the lowest value of V at 2 for those events closer to the upper bound balances most people (who in this case are the ones without a bye) against those in the “perfect” event where people play the same number of rounds they did, at the “expense” of giving the players with byes a small points bonus.

Now we have the tools to expand this system to every tournament format. We can use the basis of the elimination charts generated above to determine each player’s V for the tournament based on their finish. In the rare event that two people are legitimately tied for a position, we average the V those finishes would receive. We know how many matches the person won, so W is easy to find as well. One simple multiplication and we’re done.

There are only three real adjustments that I make at this point. The first adjustment is for big events. Big events should be worth a lot more than small events, even if the small event has the same number of people. A 32-person Qualifier Tournament should be more impressive than a 32-person Friday Night Magic. As a simple means of accomplishing this, I would take an event’s current K-value and divide by 8. This makes the formula for any player in any place in any event K/8 x W x V. If this were to be adopted, the inflated K-values would no longer be needed and could be refigured to deal with the new system, so that Worlds could count as a “6K” event, meaning that the awards would be 6 x W x V.

The second concerns drops. In the elimination format, drops are not an issue, as the game automatically drops people when they lose. But in a Swiss, a person who loses the first round can go on to win the event, and people can drop after any round. But also notice that our W was figured as one plus the number of matches won. In terms of ratings, the fact that each match increases your W, and may increase your standing to the point where your V increases, might lead more people to remain in events past their viability for prizes, especially if the K is big. This is something I personally like, so I offer the further incentive to stay by giving the +1 in W only if the player stays until his or her natural end in the event. If you drop, you lose part of the W component of the rating award you would have received had you stayed to the end. I know this is the opposite position than that of most Tournament Organizers I know, but that discussion is for a future Heretic’s Corner, if I choose to go into it at all.

The third issue is that of draws. Most rating systems count draws at half the rate of wins. For standing purposes, draws only count as 1/3 of a win at DCI events. So the question that needs to be asked is, do we want to extend this to the rating awards?

The problem is the degree to which fractions enter the system. Count draws as losses and you still get random half-points from tied V scores. Count them as 1/2 of a win and quarter-points enter. But count them as 1/3 of a win, and you have the real possibility that someone might end up with an award of 5.8333 rating points or some such nonsense. This can be eliminated by multiplying the awards by a certain factor (no more than 6), but whether such an adjustment is worth making is a topic I am completely ambivalent about.

So a 16-person Friday Night Magic running 4 rounds of Swiss with no Top 8 will award a first place person 5 (4 wins plus 1 makes a W of 5) times 6 (9^th-16^th is the lowest tier, which gets 2, 1^st place is 4 tiers higher, hence V for this person is 6) for a total of 30 rating points. By the same token, a person who goes 16-1-1 and wins a 400-person Grand Prix will get 980 or 990 points***, assuming a K/8 of 5.

This system is fully adjustable based on what people would want to emphasize. If the consensus opinion doesn’t like the idea of everyone who stays in an event getting points, the system could be adjusted to only award non-zero V values to people finishing in certain positions, such as Top 8 or “1/5 of those who entered, rounded up” or whatever. Maybe the starting V is too low and should be raised to 3 or 4. Maybe some event should offer 10 times normal points. Whatever ends up being decided is fine with me, as long as everyone follows the final plan. But to me, sticking with an ELO system that flies in the face of Magic reality is worse than any additive system we can devise.

*The ELO rating system is named for Arpad Elo, a mathematician who designed the basic formula for chess. (By convention, ELO [in all caps] is the name of the rating system and “Elo” refers to the mathematician, not that the Magic Floor Rules care.) The formula used is one of a class of functions that have certain mathematical properties: 1) Every rating difference must produce a Win Expectancy (WE) number between (but not equal to) 0 and 1. 2) My WE versus you plus your WE versus me must add up to exactly 1. 3) Every WE between 0 and 1 must have a rating associated with it, if you allow fractional ratings. 4) The function is always increasing (i.e, the more I am above your rating, the higher my WE will be).

**The spreadsheet I am sending along with this confirms this, given the assumptions listed.

***980 if draws count as 1/3, or 990 if they are counted as 1/2 of a win. The reader is encouraged to figure out what W and V are for this example.