Thursday, August 30, 2012

Behind the Scenes

As our regular readers may have noticed, during the regular season we churn out a decent amount of content each week.  Weekly top 25, full rankings, conference projections, plus our new in-game win probabilities; and all of that doesn't include the Twitter feed. Given that we have day jobs and other things holding our attention, how exactly do things work behind the scenes?

This blog post is here to give you a peek behind the curtain. A look under the hood, as it were.

Ready?

Pull back the curtain after the jump ....



Ta-da!

Actually that's not the whole picture. Here's the whole picture:


All-in-all the blog includes about 125 different scripts and applications, written in C++, Python, Perl, Bash, and Ruby. The core component of the Tempo-Free Gridiron predictor is approximately 1000 lines of C++ and has remained effectively untouched sine 2010. Eddie's Regression-Based Analysis system is approximately 3,200 lines of Python and has been tweaked somewhat the last few years in an attempt to catch me, but generally has remained static.

Both the TFG and RBA predictors read a standardized CSV format and output to similarly standard CSV files. This enables us to write scripts that fetch from our data sources and dump the output into a common format shared by our systems; once predictions are made, we have other scripts that parse the output of our programs and do useful, time-saving operations. Given that this blog is created and maintained in our Copious Free Time once we're done with work and our families are sick of spending time with us, we wouldn't be able to generate this much content -- nearly two posts per day during the season -- week-in and week-out without a large amount of automation.

It all starts Sunday morning when new data gets posted to NCAA.org with statistics from last week's games. We screen-scrape that data and turn the 100+ HTML and CSV pages into a single input CSV that reflects all games going back to the 2000-01 season. That file is then fed to our respective prediction algorithms, which spit out our predictions for the remainder of the season. My system finishes in about 15 seconds, while Eddie's clocks in at just under seven minutes (apparently repeated linear regression and full round-robin simulation is expensive).

The output files then get churned through the heart of our blog: dozens of shell and perl scripts that generate all the content you see on our blog, plus supplementary output that lets us be part of the Massey College Football Ranking Comparison page, the Prediction Tracker, CFBPredictions, and run our multiple pick'em contests. At the heart of this pile o' scripts is a 1200+ line Perl module that parses our library of team IDs, conference mappings, and various team names (abbreviated, school name, and full school-plus-mascot), not to mention the results of existing games and our predictions for the future. These 100+ scripts run in under a minute and produce a collection of HTML, CSV, and text files.

Once we have our HTML files for the week, our scripts automatically upload them to our blog and tag them as being drafts. What follows is one of the few manual steps in the process, and one that I would automate if I could: I have to go through each of the uploaded blog posts and set when they should go live. Depending on the schedule of games and how far along we are in the season, this could be as few as seven posts (top 25 and full rankings for TFG and RBA respectively, a weekly recap, a feature of Saturday matchups, along with Saturday predictions) or as many as eighteen (all of the above plus conference projections, the status of undefeated teams, and a few other days' worth of predictions). By far this takes the most amount of time.

The next manual step is updating our pick'em pools on FunOfficePools.com. This year we're running three pools -- one open to the public, and one each to a pair of websites we frequent -- and this also takes a while. FunOfficePools gives us the most amount of flexibility in setting up our pools and running them, but the UI can be a little repetitive to use at times. I have to select the 12 or 15 games the computer identified as "interesting", and then make my own picks. Together these two manual steps can easily take 45-90 minutes.

At this point note what hasn't happened yet: us writing any content. The discussion of the top 25, any trivia posts, and the projected status of the remaining undefeated teams comes after all these other automated and manual steps. It shouldn't be surprising that a pair of computer geeks puts off the writing until the very end, but there you go.

This large batch processing and writing happens on Sunday, but we have two processes that are constantly running in the background. The first one is our Twitter auto-updater; this program runs a few times each hour and checks the blog to see if any new posts have gone live. If a new post has gone live then we craft a tweet based on the title of the post and send it out through the Twitter API.

The second background process is a bit more complex and involves our in-game win probabilities. If a game is happening at the moment, we grab the current status of that game and run it through our in-game prediction model. This model uses both pre-game predictions as well as a library of game status probabilities (e.g., a team that is up by N points with T minutes left is X% likely to win) to predict which team will win. For each game we will tweet in-game probabilities at the end of the first, second, and third quarters. Also, if a sizable underdog has a shot at winning, we will tweet an #UpsetWatch to let people know they should start following the game; if it looks like they're about to wrap up the upset, we'll tweet an #UpsetWarning.

There you have it. Nearly two years and 15,000 lines of code later we've created the self-writing blog. Now if only we could automate content creation itself. Until then we'll keep plugging away as long as you keep reading.


Follow us on Twitter @TFGridiron