Tuesday, November 17, 2009

Introducing Regression-Based Analysis

I'd like to thank Justin for the opportunity to share my competing algorithm with the TFG readers. I have taken a different path deliberately to explore different ways of predicting college football games. I'm a big fan of the tempo-free statistics, but I thought of several components that Justin's algorithm implicitly neglects. In particular, the TFG method focuses solely on teams' average performance rather than their variances. My regression-based algorithm (RBA) explicitly models a team's offensive and defensive performance as a function of opponent strength and places a premium on turnovers and penalties, which can turn the tide of a single-game contest. (See Arkansas-Florida 2009, where Arkansas forced four turnovers to get within striking distance and got absolutely boned by penalties late in the 4th.)
I agree with Justin that all statistics need to be scaled by the number of plays to generate a tempo-free statistic. Unless otherwise specified, all the statistics in this description may be computed as points per 100 plays. The last step of the algorithm will be to scale the tempo-free point total by the expected number of plays.
Estimating Offensive and Defensive Efficiency
My algorithm begins by computing offensive and defensive statistics as a function of opponent strength. For simplicity, I will only discuss the offensive efficiency because the same computations are used to compute the defensive efficiency. I approximate the offensive efficiency using least-squares approximation because it minimizes mean-square error. We model the offensive efficiency as the linear equation Y = m*X + b, where X is a random variable representing the opponent's strength and Y is the offensive efficiency. We use least squares to estimate the parameters (m,b).
The following figure illustrates Auburn's offensive and defensive efficiency. The horizontal axis indicates opponent strength. The vertical axis shows efficiency per 100 plays. Each cross represents the measured offensive efficiency in a game. Each circle represents measured defensive efficiency. These blue and red lines show the least-squares approximation of the offensive and defensive efficiencies, respectively. From this plot, we observe the expected behavior. Auburn's offense scores fewer points against stronger opponents and their defense allows more points against stronger opponents.
Let's contrast this performance with Florida. Florida's performance is relatively constant across opponents of all strengths, implying that they impose their offensive and defensive will against all their opponents.
Impact of Turnovers, Takeaways, and Penalties Turnovers and penalties represent behavior synonymous with sloppy play. Conversely, takeaways indicate exceptional play. In general, we expect the former to reduce an opponent's efficiency and the latter to improve an opponent's efficiency. For simplicity, we will restrict our discussion to turnovers, as the same analysis is used for takeaways and penalties. To determine the impact of an event on a team's score, we compute the difference between the team's score and its mean score. For example, if a team scores 21 points in a game but averages 27, we indicate an impact of -6 points. I started out simply using the points per turnover as a metric: points/turnover * turnovers/play * plays = points/play. However, this approach is full of mathematical FAIL. Actually, we observe an increase in efficiency when a team has only one turnover because even great teams tend to commit at least one turnover a game. To accommodate this behavior, we model efficiency as a joint probability distribution of turnovers and points. The following figure illustrates the points per 100 plays a team loses for committing a given number of turnovers.
The solid line indicates the mean number of points. We assume a Gaussian distribution for a given number of turnovers. The error bars indicate one standard deviation from the mean (68% of outcomes lie within this range).An interesting observation is that turnovers are not a good estimator when few turnovers are committed. This makes sense because if we have few turnovers, the game's outcome must have been determined from other sources. We also observe a general downward trend as the number of turnovers increases because, as expected, turnovers negatively impact a team's performance. A small rise occurs at seven turnovers because few samples in this range.
Generating Final Scores The final score may be computed by summing all the random variables in the RBA: score = ((Po + Pd)/2 + Ph + Pto + Pta + Pn) * N where Po = offensive points/play, Pd = opponent's defensive points/play, Ph = home field advantage (or away field penalty), Pto = turnover derating, Pta = takeaway bonus, Pn = penalty derating, and N = number of plays.
Due to the central limit theorem, the summation of a series of random variables converges to a Gaussian distribution. Furthermore, the summation a*X1+b*X2 of two Gaussian variables N(m1, v1) and N(m2, v2) is a Gaussian N(a*m1+b*m2, a^2*v1+b^2*v2). From these two computations, we may compute a closed form solution for the final efficiency. For simplicity, we only use the mean for N instead of its probability distribution. (Frankly, the math requires over 14 pages of dense mathematics to explain.)
Conclusion
That's how the RBA algorithm works. I'm still relatively new to the practice of teaching my computer to pick college football games, so we can expect a few tweaks here and there throughout the remainder of the season and during the offseason. Your suggestions are welcome!