Expected goals are the poster boy of football analytics. Anyone who is not living under a rock and is even remotely interested in football and stats will have been confronted with expected goals in one form or another. On 11tegen11, we’ve made the case for expected goals, usually shortened to xG or ExpG, being the single best predictor for future match outcomes, better than points, goals, shots or shots on target. Even though, in predictive modelling the best results are obtained by combining expected goals with these other metrics.
Many of these predictions I share via Twitter, but full explanations and Twitter don’t combine that well. On the other hand, making blog posts out of all of these predictions is a hassle and it doesn’t seem necessary, since the images are quite easy to read. Still, in order to evaluate the quality of different predictions around, and to allow fair comparisons with other predictors, a detailed overview of the ‘behind the scenes’ is needed. Here we go!
Step 1: Expected Goals
At the heart of the predictions lies the concept of Expected Goals. In that sense, the predictive model is in fact a series of multiple models. First, Expected Goals is computed to turn each goal scoring attempt into a number between 0 and 1, representing the odds of that attempt producing a goal. All details about the how and what of my expected goals model have been described before.
Step 2: Composite Team Rating
The next step is to generate an estimate of team strength. Going from a team’s xG for and against to team strength can be done in a variety of ways. Earlier, I’ve used xG ratio, then moved to net xG, but [as explained before] my money is now on a thing called a Composite Team Rating.
xG ratio = xG for / (xG for + xG against)
net xG = xG for – xG against
The Composite Team Rating, or CTR, is a combined number that integrates xG, points, goals, shots, shots on target and last season’s points and goals. For the sake of readability of this post, I’ve moved the detailed explanation of computing CTR to a footnote (A) below.
CTR is presented as two numbers: the expected amount of goals scored and conceded against a hypothetical league average team. The average CTR in a league is therefore always equal to zero, and CTR’s can’t be directly compared between leagues.
Step 3: Predicting goals scored
In this step we need to translate CTR into match odds. Suppose we have Team A playing Team B, we then have four numbers: Team A CTR for and against, and Team B CTR for and against.
Using a big database of historical matches, we compute each team’s CTR for and against going into each match, and check number of goals by the home team and the away team in that particular match separately. So, when Team A plays Team B in match day 15, we use Team A and B’s CTR for and against based on matches 1 to 14, and we look at the goals scored by team A and B in match day 15.
I’ve disregarded the first 9 matches of each season, since CTR values (as all in-season metrics) have more stability after a certain number of games. This leaves around 12.000 matches to work with, so we get a decent model to estimate goals scored by the home and away team based on both team’s CTR values. Again, see the footnote below (B) for more technical details regarding this model.
Step 4: Predicting match outcome
Now that we have an estimated number of goals scored by Team A and Team B, we can use these numbers to generate the odds of either team winning the match, or it finishing in a draw. Obviously, this process is done for each match still to be played, and for matches already played the actual outcome gets odds 1, and the other outcomes get odds 0. This answers one of the most frequently asked questions: future schedule is incorporated in league predictions.
But how to move from ‘Team A will score around 1.8 goals and Team B will score around 1.2 goals’ to an estimated odds for Team A winning, Team B winning, or a draw?
Goals in football matches closely follow what’s called a Poisson distribution model. So, if we assume Team will score an average of 1.8 goals, this distribution tells us the odds of Team scoring exactly 0 goals, or 1 goal, or 2 goals, etc.
From these values, we can derive the odds of Team A scoring more goals than Team B, of vice versa, or both teams scoring an equal number of goals. Bingo! In fact, I compute the odds of Team A outscoring Team B by one goal, two goals, three goals, etc. separately, in order to use goal difference as a factor too.
One caveat here is that I stated that goals in football matches closely follow Poisson distributions, but they don’t exactly follow that distribution. Probably related to game state effects, i.e. two teams playing more cautiously when scores in a match are level, Poisson distributions tend to underestimate the odds of draws. Pragmatically, I correct this by raising the odds for draws at the expense of wins and losses by the amount needed to get predicted outcomes in line with historical draw percentages.
Step 5: Computing league outcome
Now that we’ve got estimated odds for wins, draws and losses, the hardest part is done. Note that I’ve called this step ‘computing’, rather than ‘predicting’. Using a Monte Carlo simulation with random numbers, we simulate the remainder of the season multiple times. I use 1.000.000, but this number isn’t so important, as long as it doesn’t produce wildly different outcomes after each simulation.
The outcomes of this simulation can be presented in various ways. We get 1.000.000 estimations of points totals for each team that we can show as medians, distribution and outliers.
We can translate the estimated points in league positions for each team, and show the distribution in these.
We can do the reverse thing and look at how often each team finishes in a certain position, most notably as league winners. Or we could use the combined odds of finishing in a number of positions, like top-4 or bottom-3.
In the end
The process is the result of several years of development, thinking, fine-tuning, and learning how to script what’s in your head in a way that, at the very least, works. It’s been fun to create this, it’s even more fun to share the outcomes with you, and it would be fun if you’ve got ideas to improve the further!
I’ve you’ve still got any questions, feel free to shoot, as I could always update this reference post, which would be preferable to explaining the same stuff [in far too few words] on Twitter over and over again.
Feel free to check out the footnotes, at your own risk of being overloaded with statsy stuff.
CTR consisted of two separate linear models [lm in R]: one for goals scored and one for goals conceded.
It’s input is series of variables computed over the present season. This list is the same for both the model for CTR for as the model for CTR against.
Goals for & against, shots for & against, xG for & against, passes for & against, completed passes for & against, final third passes for & against, passes in the deep zone for & against, completed passes in the deep zone for & against, passes in the very deep zone for & against, completed passes in the very deep zone for & against, distance of passes into the deep zone for & against, distance of passes into the very deep zone for & against.
Also added are a few variables mixed from the present season and the previous season. The mix depends on the number of matches played in the present season. For example, after 80% of the present season, these variables consist for 20% of last season’s value and 80% of the present season’s value.
Regressed goals for & against, Regressed points per game
All of these variables, 27 in total, enter a linear model with outcome variable ‘goals scored in the next match’. A backward elimination is applied to preserve only those variables that independently correlate with the outcome.
Presently, for goals scored these are: shots for, xG for & against, complete passes in the final third, passes in the deep zone for & against, distance of passes into the very deep zone for and against, regressed goals for & against. Sometimes the direction of the correlation seems weird (f.e. xG against has a positive relation with the outcome goals scored), but this has to do with the input variable showing a lots of correlation among themselves (multicollinearity).
For goals conceded, these are: goals against, xG for & against, passes for, completed passes in the final third against, passes in the deep zone against, completed passes in the deep zone for and against, passes in the very deep zone against, regressed goals against and regressed points per game.
Offensively, these are the most important predictors according to their ‘lmg’ value in the relaimpo package of R. I’m happy to see these are all the factors that make intuitive sense to people watching football matches.
- Regressed goals for
- xG For
- completed passes in the final third
- completed passes in the deep zone
- shots for
- distance of passes in the very deep zone
Defensively, we get this top-5.
- Regressed goals against
- Regressed points per game
- xG against
- goals against
- xG for
This is a quite straightforward model (lm in R), also consisting of two separate models.
For home goals scored, and for away goals scored we use the same four variables as predictors and either home goals or away goals as outcome. Predictors are home team CTR for & against and away team CTR for & against.
For home goals scored we get this top-4, which makes intuitive sense.
- Home team CTR for
- Away team CTR against
- Home team CTR against
- Away team CTR for
And this is the top-4 for away goals scored, which is the reverse of the other top-4.
- Away team CTR for
- Home team CTR against
- Away team CTR against
- Home team CTR for
Since home and away goals are estimated in separate models, this also takes care of home advantage. I use a model based on all leagues, so this [wrongly] assumes equal home advantages in each league. However, splitting this model per league reduces the sample to works with. Choices. Something to think about for the future.