Why Arsenal wins the title in my model

When you’re into football analytics, you’ve got to stick your neck out from time to time, and come up with some predictions. It’s what probably drew most people to analytics in the first place, yet at times it is a disappointing affair.

Random

Football is to a large degree a random sport, and we’ll just going to have to accept (and appreciate!) the variability that is undeniably present in the beautiful game. When enjoying this sport as a fan I welcome the surprises and unpredictability; when trying to construct and improve my predictive model I usually despise it.

In this post, we’ll dive straight into the model’s predictions for the final standing of the English Premier League. So, without further ado, here it is.

20140110 Boxplot projected league table English Premier League 2013-14The colored boxes are the predicted points, with the black line indicating a mean predicted number of points. Outliers are indicated by the dots and the 95% confidence interval makes up the line for each team. Colors code the league winners, CL qualification, CL qualifiers qualification, EL qualification and relegation.

Yes, it’s out there. My model rates Arsenal. It even put them on top, but please don’t leave it at that and carry on reading…

I intended to go over the teams from top to bottom, but ended up writing so many words on Arsenal, that we’ll have to save the other teams for a follow-up post. This post will use Arsenal as a case study to explain to workings of the model on the fly. Once we reach the end of the piece, you’ll probably have a good feel for the ratio behind the model, which is essential if you want to appreciate what I say here.


The model

Even a casual reader will quickly note the color gold on Arsenal’s bar, indicating that they are predicted to have the highest points total after 38 matches. However, the gap with City is less than a tenth of a points, and you can note the overlap between the range of predicted points.  Knowing that City are now a point behind in the league table, this indicates that the model rates City as the stronger team, but only by an insignificant margin.

The range of predictions stems from a repeated run, in this case 10.000x, of simulations of the remaining matches. For each match, the odds of a home win, draw or away win is estimated on the basis of my Expected Goals (ExpG) model. Each team’s ExpG for and against are based on this season’s shot info, which is obtained via Squawka, and is driven by OPTA data.

The ExpG is obviously influenced by raw shot numbers, as in general shooting more is a good thing, but it also takes into account shot location, shot type and some other elements that (behind the scenes) I’ve shown to drive shot quality.


The projections

Both Arsenal and City are estimated around 78 points, with hardly anything to separate the teams. I do realize that this contrasts a bit with other models around, so it’s probably worth some words. The bookies, as well as other respected predictive models, give City a better shot, which may well be true. ExpG based models you’ll want to check here are by @ColinTrainor and @Cchappas and by @MCofA.

It’ll stay an unsettled argument which of the predictions is the better one, as neither Arsenal winning the league will support my model, nor City winning it will support others. The randomness of this sport simply dictates that both teams have a shot, but we’ll never find out the all-knowing underlying truths.

There are reasons why my model likes Arsenal so much. But it’s not that Arsenal have most shots, or even the best ExpG for (4th) or against (6th). In that respect, City dominate them, pairing the best offense with the second defense.

I’ll pause for a second and let you wander. How can an ExpG based model rate City significantly higher, both offensively and defensively, and still predict City to take just a single point more from the remaining matches?

 

Dynamics

To put it simple, Arsenal have this season been better where it matters most: at even scores. Performance at even scores determines who takes the lead, and who will be a goal behind. Subsequently, being a goal up or down influences your performances characteristics and here’s your flywheel effect.

Over all game states, Arsenal may not have been the best defense out there, but on even scores they’ve been tighter than City and Chelsea. For the sake of accessibility of this piece, I won’t throw all detailed even game state numbers at you, but Arsenal’s defense at even game state conceded just 0.75 ExpG per 90 minutes, which easily beats their rivals. The model recognizes that Arsenal conceding just six goals at even score this season – of which two in the 6-3 loss at City – could well be the result of the underlying performance at that Game State.

In this sense, Arsenal resembles Ajax’ numbers in a post I wrote for Volkskrant blog ‘De Zestien’, when PSV posted better overall shot numbers, but Ajax was still the preferred team for the 2012/13 title, which they eventually went on to claim.

 

Self doubt

Now, here’s a paragraph I will always cherish. I love my self doubt, and I think any sensible predictor can hardly have enough of it. So, why may the above paragraph may be untrue, and would Arsenal still need to be rated lower than their rivals?

For one, the effect at even Game State may not be a repeatable thing. Arsenal saw Aaron Ramsey convert at a rate he’s never done before, and he will never do again. Still, three of his eight goals gave Arsenal a lead  where they went on to win the game.

Then there are striker issues. Giroud shows more and more evidence that his disappointing conversion rates are his natural skill level, rather than the result of us not having enough shots to study (ref: Colin!). And Arsenal have serious injury issues up front, with Walcott missing the remainder of the season. In contrast to Giroud, Walcott has consistently shown excellent finishing ability, and his absence may well reflect in a dip in conversion for Arsenal. My team based model does not (yet) recognize individual player absences.

Yes, it’s mathematically possible to enter historical conversion rates into the model. So far, I’ve refrained from doing that, since it’s dangerous to assume that these historical rates hold any predictive value. In the case of Giroud and Walcott, yes, we now have a reasonable sample of shot to assume a statement on their conversion rates, but for most players, we just don’t know. Shot sample sizes are too low, and shots  too heterogenic in nature and opponents differ too. For all you know, aiming to catch all the signal around may allow a lot of noise to enter the model  and worsen the predictions.


In the end

So, here’s a piece that is about Arsenal, but it goes into detail about the model too. I felt this is needed, since I intend to use the model more and more, in order to benchmark teams, make predictions and keep understanding what happens.

And for the other teams, I guess we’ll walk by them one by one after the weekend. As the model has new information by then, the predictions may be a bit different…

2 thoughts on “Why Arsenal wins the title in my model

  1. Tom

    Another great post! So the whiskers are 95% CI and the boxes are 1 SD? Interesting how the 95 % CIs are noticeably enormous. (This is not meant as a criticism of your model, btw—it’s actually really nice to see someone reporting error margins!)

    Is the same data available for previous seasons? You could take the data from the halfway point of last season and see how your model’s prediction compares to the final standings…

    Reply
  2. 11tegen11 Post author

    Thanks, Tom!
    Indeed, the spread is huge, and even just that small bit of information is worth a piece in itself, isn’t it?
    It nicely illustrates the uncertainty in predicting outcome in football.

    And I will (at some point) show more validation stuff, like holding the performances historically against eventual outcomes. But I hope to do that on a multiple leagues sample, and I’ll first put out some more outcomes generated by the model…

    Reply

Leave a Reply