Thursday, March 10, 2016

Under the Hood: Formula One Elo Ratings

Over the last several years, Elo ratings have expanded beyond chess to become a popular framework for calculating power ratings for teams in the NBA and NFL, as well as international baseball, soccer, and hockey teams. Elo provides an excellent framework for initial analysis of teams or players, especially when the actual event has events, moves, or strategy which are difficult to quantify at a fine granularity. Some sports for which there are Elo ratings -- such as basketball and baseball -- have been analyzed in more detail, and we now have a better understanding of how to evaluate both teams and individuals.

And some websites just like to churn out Elo rating
systems because they say "Elo" around the office
the same way Kristian Nairn says "Hodor" at work.
However there are many sports -- notably hockey, soccer, and football -- which are more difficult to pick apart. Football and soccer have more players than basketball or baseball. Soccer and hockey don't really have solidly defined possessions in the same way as baseball or basketball. At a first approximation, the outcome of a game -- and its score -- are what we have to go on.

This is where Elo comes in.
Elo is well suited for these situations for a handful of reasons. Elo rating systems
  • only need to know who won and who lost;
  • can provide win probabilities prior to a game; and
  • work well in a (relatively) closed system in which there are a fixed number of teams or players.

Formula One

In this regard, Elo ratings are a good framework for approaching Formula One. The question is how to structure the system and choose certain parameters to handle the nature of Formula One. Elo rating systems have a few parameters to suss out, including
  • the number of initial points each participant gets;
  • the magnitude of points transferred from losers to winner (the K-factor); and
  • the potential separation of teams or players into divisions.
Formula One is also potentially vulnerable to Elo inflation. This occurs when marginal players or teams overperform for long enough to be promoted into the professional tier, lose repeatedly (thereby dumping their points into the pool at the top tier) and then retire without having claimed any points. Over time this leads to a situation in which top players or teams have significantly higher ratings than their equally-talented counterparts from previous years. It becomes difficult -- if not impossible -- to compare players or teams between eras.

The specific application of Elo to Formula One is this: every race is a round-robin tournament in which you compete with every other driver. Therefore if you finish third in a field of 18 cars, you are said to have tallied 15 wins (the cars behind you) and two losses (the cars in front of you. Furthermore, because the data set does not indicate if DNFs are due to crashes (presumably partially the driver's fault) or a mechanical failure (presumably the car's fault), drivers that do not finish the race are treated as if they did not compete at all.

K-Factors

In the Elo system, the K-factor controls how the magnitude of the swing between winners and losers. A large K-factor includes new information rapidly, while a small K-factor resists overreacting to new results. The World Chess Federation uses a range of factors depending on the experience of the players; new players use a factor of 40, while experienced players can use a factor as small as 10. The NBA Elo model at FiveThirtyEight uses a fixed Elo of 20, which incorporates new results at a relatively high rate.

Ultimately we settled on a modified version of a K-factor scale formerly used by the U.S. Chess Federation:

K = AdjType * max((800 / Ne), 16)

In this formulation, Ne is the total number of races in which a driver has participated. There is an effective floor on the K-factor of 16. Effectively this means that during the first two-and-a-half seasons of a driver's Formula One career they will have a relatively large K-factor.

On top of the experience adjustment, there is an multiplier based on the type of result: was this a qualifying session, or a full race? The race adjustment is simply 1.0, whereas the multiplier for qualifying is 0.1. Qualifying involves only a handful of laps, but has the advantage of reliably including every car in the field. This is especially important when we go back more then twenty years, where anywhere from 20-40% of the cars would fail to finish due to mechanical failures.

Adjustments

Once we have a K-factor for each driver, we can adjust the ratings of each driver based on the results of a race or full qualifying session. For each pair of drivers A and B we can calculate the expected probability that A would defeat (or finish in front of) B. (The details of this model can be found on the Elo wiki page.) For each driver we can calculate these odds and then sum them up. This gives us the expected score, while the result gives us the actual score. A player's rating is adjusted as follows:

R'a = Ra + K(Sa - Ea)

Let's say that a driver with a rating of 1200 races against a driver with ratings of 1000, 800, and 600. The sum of the individual win probabilities would be 2.638. If that driver has a K-factor of 16 and finishes first (i.e., wins 3.0 contests), their new rating would be

1200 + 16(3.0 - 2.638) = ~1206

However if the driver with a rating of 600 (and similar experience) wins, their new score would be

600 + 16(3.0 - 0.362) = ~642

Sharp-eyed readers may have noticed that this rebalancing is only point-neutral if all drivers have the same K-factor. However this is effectively never the case, as there will always be drivers in the point-swing-heavy early portions of their careers. In order to get everything to balance out, we calculate an adjustment ratio for each race. That is, we take the sum of the total points of all drivers before the race and divide it by the sum of the total points of all drivers after the race. We then multiply each driver's post-race rating by this ratio, preserving the total number of points in the pool after each race.

Elo Inflation

However we still run into the situation where young drivers are brought into Formula One for a season (or even a few races), fail miserably, dump all their points into the pool, and exit. There are two main ways we fight inflation:

  1. Revert each driver back to the mean slightly at the end of each year; and
  2. Use a smaller, fixed K-factor for new drivers for a number of races.

The first approach I blatantly stole borrowed from the FiveThirtyEight NBA Elo ratings.
Love you guys!
This forces everyone back to the mean slightly each season, preventing long-term inflation.

The second approach limits the rate at which new drivers can bleed points into the pool until they've been around for a little while. Currently both the K-factor and the minimum number of races are set at 16, meaning that for a driver's first season they're somewhat capped in how many points they can lose. Note that because of how Elo works, though, excellent rookie drivers such as Villeneuve or Hamilton can still climb the ranks at the same rate as experienced drivers.

Putting it all together

In the end, after crunching 36 years of Formula One data ... we'll get to some real fun stuff tomorrow. But for now here are some numbers you can use as a reference point when looking at long-term Elo scores:

1700Legendary season
1600Title contender
1500Regular podiums
1250Regular points
1000Average

With that in mind, let the ratings begin!