If football analytics was a Hollywood movie, Expected Goals would definitely be the poster boy. The influx of attention for football analytics during the recent World Cup meant a lot of attention for the concept of Expected Goals, or ExpG as its mostly referred to. With that attention came two very important questions, that I’ll try to address in this post. What is ExpG? And how do you compute it?
What is ExpG?
Expected Goals is assigning each goal scoring attempt a number between 0 and 1, to represent the chance that this goal scoring attempt results in an actual goal.
I use a model that I have revised completely over the summer, so this makes for a perfect time to explain the full workings of it. Expected Goals 2.0, here we go…
Suppose I tell you that a football match has just finished and I ask you to estimate the number of goals for each team. You know nothing. Not the teams, not the occasion, not the shot numbers, and nothing that happened on the pitch.
You’d probably say both teams have scored around 1.4 goals, since 2.8 is a good estimate for the average number of goals per football match. Since you have absolutely no information about the match at hand, estimating this average of 1.4 goals per team should lead to the smallest difference between your estimate and the actual goals by each team.
In building a model, the difference between your estimate and the actual outcome is called the error, and you should be aiming to keep the error as small as possible.
(don’t look down at the .gif yet)
Now, I tell you that the match at hand had 10 shots by team A and 14 shots by team B. Would this change your estimate of 1.4 goals for each team?
Since we know that on average 1 in 9, or 11% of shots results in a goal, it would make most sense to estimate 1.1 (10 * 0.11) goals for team A and 1.54 (14 * 0.11) goals for team B.
This is your most basic expected goals model at work. In fact, it is what we’ve been doing for years, with Total Shots Rate. The total number of shots is a nice, but far from perfect, indication of the number of goals you can expect.
Let’ s add some more information to our model, and for the sake of readability of this piece, I’ll give you all visual information on a single goal scoring attempt that we’ll use as an example of the current ExpG model that I use on 11tegen11.
Here’s what the ExpG model sees.
- The match situation is open play
The models discriminates between seven match situations: open play, corners, direct free kicks, indirect free kicks, penalties, rebounds and first time attempts.
- A non- league match
This fragment, in case you hadn’t noticed originates from the Spain vs. Netherlands match at the past World Cup. For each league, different conversion rates are computed for each match situation.
- Game State
The score line during this attempt is 0-0, so the odds of scoring are slightly reduced. Shots at even game state are converted a bit less than shots at GS +1, or even GS -1.
- Shot location
The angle to the goal is 22 degrees and the distance is almost 15. Note the absence of units for distance, I don’t compute yards or meters, just an abstract number based on coordinates. In terms of modelling, it’s all about the relative difference between different goal scoring attempts, and not about getting the distance correct in absolute terms.
To compute the angle to the goal, I compute angles to both goal posts and take the difference between those two numbers. The number you get represents the view a player has on the goal. It represents how much of a 360 degree circle around the player is represented by the goal. For more lateral positions and more distance from the goal, the number goes down. I prefer this method over a simple angle to the middle of the goal, since works better for close ranges, where most shots are taken.
- Shot type
This is a shot, rather than a header. Given the location, this makes a huge difference in terms of ExpG.
- Though ball
The shot has been assisted with a through ball. This is a big plus for ExpG, since through balls generally reduce the number of defenders able to contest or block the shot.
The shot has not been assisted with a cross. Crosses are bad. They have a negative influence on ExpG. It’s easy to get loads of crosses in, so in terms of trying to score goals they may be good for some teams at some times, but it’s harder to score when the goal scoring attempt comes off a cross than when the same goal scoring attempt does not come off a cross.
The attacking team has taken three touches. More touches taken reduces ExpG, since (generally) defenders have more time to get in position to defend.
- Vertical speed
In the build-up of play, the attacking team has moved the ball forward at 2.87 per sec. Note the absence of units for distance, since this is again an abstract number based on coordinates. More important point: quicker vertical movement leads to higher ExpG.
None of the above items are used because I personally think they are important for ExpG measurement. They all show up as significant factors in a multivariate regression analysis that I’ve run on some 160.000 goal scoring attempts in various match situations and various leagues. Just like we tried to minimize the error in our initial two estimations in the early stages of this article, a complex regression models tries to minimize those errors for large numbers of shots and large numbers of potentially important factors for ExpG.
In the end, for open play shots, the above mentioned factors prove to be important. For different match situations, different factors are important. You can imagine that vertical speed is not important to score from corners, or that for indirect free kicks the number of touches is not important (the defense is set to defend anyway). The joy of a multivariate regression model is that it’s not up to you to decide which factors to use (and then having to defend your choice on blogs and twitter), it’s the model that advises you which factors to use and how to weigh them.
In the future, we may discover new items to measure. If the multivariate regression model then suggests them to be of significant influence on ExpG in certain match situations, they will be added for those match situations. The model is a living thing. If I can improve it, I will.
The most frequently heard comment on any ExpG model is probably the fact that defensive pressure is not incorporated. That’s both true, and not true, depending on how you define defensive pressure.
Since all data is based on ‘on-ball events’, we don’t have any direct information on the position of defenders and goalkeepers. In isolated cases, this can be quite frustrating. Sometimes a goalkeeper is stranded way out of position, and your model ends up underestimating the ExpG of that goal scoring attempt.
The model may not have direct information on defender and goalkeeper positioning, it does have a lot of indirect information on it. Game State, vertical speed, crosses, through ball and number of touches all carry some information about the amount of defensive pressure that is present for a goal scoring attempt. Obviously, direct information would be preferable, but even with this indirect information, for 99% of attempts we get a good sense of defensive pressure.
In the end
With this piece, I’ve opened up about as much as I can on the workings of my ExpG model. There is no single formula that I can give. It’s not as simple as ‘shots from this zone get 0.12, headers from that zone get 0.07’.
Each goal scoring attempt is judged on the basis of its relevant contextual information. The result is the best estimate I can create for each goal scoring attempt. Using the best contextual information can teach you so much about football, let’s have a lot of fun with it this coming season!