# Expected Goals 2.0 – Some light in the black box

If football analytics was a Hollywood movie, Expected Goals would definitely be the poster boy. The influx of attention for football analytics during the recent World Cup meant a lot of attention for the concept of Expected Goals, or ExpG as its mostly referred to. With that attention came two very important questions, that I’ll try to address in this post. What is ExpG? And how do you compute it?

What is ExpG?

Expected Goals is assigning each goal scoring attempt a number between 0 and 1, to represent the chance that this goal scoring attempt results in an actual goal.

I use a model that I have revised completely over the summer, so this makes for a perfect time to explain the full workings of it. Expected Goals 2.0, here we go…

Modelling

Suppose I tell you that a football match has just finished and I ask you to estimate the number of goals for each team. You know nothing. Not the teams, not the occasion, not the shot numbers, and nothing that happened on the pitch.

You’d probably say both teams have scored around 1.4 goals, since 2.8 is a good estimate for the average number of goals per football match. Since you have absolutely no information about the match at hand, estimating this average of 1.4 goals per team should lead to the smallest difference between your estimate and the actual goals by each team.

In building a model, the difference between your estimate and the actual outcome is called the error, and you should be aiming to keep the error as small as possible.

(don’t look down at the .gif yet)

Shots

Now, I tell you that the match at hand had 10 shots by team A and 14 shots by team B. Would this change your estimate of 1.4 goals for each team?

Since we know that on average 1 in 9, or 11% of shots results in a goal, it would make most sense to estimate 1.1 (10 * 0.11) goals for team A and 1.54 (14 * 0.11) goals for team B.

This is your most basic expected goals model at work. In fact, it is what we’ve been doing for years, with Total Shots Rate. The total number of shots is a nice, but far from perfect, indication of the number of goals you can expect.

Attempts

Let’ s add some more information to our model, and for the sake of readability of this piece, I’ll give you all visual information on a single goal scoring attempt that we’ll use as an example of the current ExpG model that I use on 11tegen11.

Here’s what the ExpG model sees.

1. The match situation is open play

The models discriminates between seven match situations: open play, corners, direct free kicks, indirect free kicks, penalties, rebounds and first time attempts.

1. A non- league match

This fragment, in case you hadn’t noticed originates from the Spain vs. Netherlands match at the past World Cup. For each league, different conversion rates are computed for each match situation.

1. Game State

The score line during this attempt is 0-0, so the odds of scoring are slightly reduced. Shots at even game state are converted a bit less than shots at GS +1, or even GS -1.

1. Shot location

The angle to the goal is 22 degrees and the distance is almost 15. Note the absence of units for distance, I don’t compute yards or meters, just an abstract number based on coordinates. In terms of modelling, it’s all about the relative difference between different goal scoring attempts, and not about getting the distance correct in absolute terms.

To compute the angle to the goal, I compute angles to both goal posts and take the difference between those two numbers. The number you get represents the view a player has on the goal. It represents how much of a 360 degree circle around the player is represented by the goal. For more lateral positions and more distance from the goal, the number goes down. I prefer this method over a simple angle to the middle of the goal, since works better for close ranges, where most shots are taken.

1. Shot type

This is a shot, rather than a header. Given the location, this makes a huge difference in terms of ExpG.

1. Though ball

The shot has been assisted with a through ball. This is a big plus for ExpG, since through balls generally reduce the number of defenders able to contest or block the shot.

1. Cross

The shot has not been assisted with a cross. Crosses are bad. They have a negative influence on ExpG. It’s easy to get loads of crosses in, so in terms of trying to score goals they may be good for some teams at some times, but it’s harder to score when the goal scoring attempt comes off a cross than when the same goal scoring attempt does not come off a cross.

1. Touches

The attacking team has taken three touches. More touches taken reduces ExpG, since (generally) defenders have more time to get in position to defend.

1. Vertical speed

In the build-up of play, the attacking team has moved the ball forward at 2.87 per sec. Note the absence of units for distance, since this is again an abstract number based on coordinates. More important point: quicker vertical movement leads to higher ExpG.

Regression

None of the above items are used because I personally think they are important for ExpG measurement. They all show up as significant factors in a multivariate regression analysis that I’ve run on some 160.000 goal scoring attempts in various match situations and various leagues. Just like we tried to minimize the error in our initial two estimations in the early stages of this article, a complex regression models tries to minimize those errors for large numbers of shots and large numbers of potentially important factors for ExpG.

In the end, for open play shots, the above mentioned factors prove to be important. For different match situations, different factors are important. You can imagine that vertical speed is not important to score from corners, or that for indirect free kicks the number of touches is not important (the defense is set to defend anyway). The joy of a multivariate regression model is that it’s not up to you to decide which factors to use (and then having to defend your choice on blogs and twitter), it’s the model that advises you which factors to use and how to weigh them.

In the future, we may discover new items to measure. If the multivariate regression model then suggests them to be of significant influence on ExpG in certain match situations, they will be added for those match situations. The model is a living thing. If I can improve it, I will.

Defensive pressure

The most frequently heard comment on any ExpG model is probably the fact that defensive pressure is not incorporated. That’s both true, and not true, depending on how you define defensive pressure.

Since all data is based on ‘on-ball events’, we don’t have any direct information on the position of defenders and goalkeepers. In isolated cases, this can be quite frustrating. Sometimes a goalkeeper is stranded way out of position, and your model ends up underestimating the ExpG of that goal scoring attempt.

The model may not have direct information on defender and goalkeeper positioning, it does have a lot of indirect information on it. Game State, vertical speed, crosses, through ball and number of touches all carry some information about the amount of defensive pressure that is present for a goal scoring attempt. Obviously, direct information would be preferable, but even with this indirect information, for 99% of attempts we get a good sense of defensive pressure.

In the end

With this piece, I’ve opened up about as much as I can on the workings of my ExpG model. There is no single formula that I can give. It’s not as simple as ‘shots from this zone get 0.12, headers from that zone get 0.07’.

Each goal scoring attempt is judged on the basis of its relevant contextual information. The result is the best estimate I can create for each goal scoring attempt. Using the best contextual information can teach you so much about football, let’s have a lot of fun with it this coming season!

## 21 thoughts on “Expected Goals 2.0 – Some light in the black box”

1. Ross Taylor

“They all show up as significant factors in a multivariate regression analysis that I’ve run on some 160,000 goal scoring attempts in various match situations and various leagues.”

With 160,000 observations, nearly any variable is significant at the 1/5/10% level :P.

1. 11tegen11 Post author

That depends totally on your multivariate model. Most of the multivariate regressions use stepwise elimination (or addition) by study variable and present only those variables that improve the model as significant.
In univariate analysis, yes, most variables would be of significance at that N.

2. Shawn Spence

Great stuff. How did you get the raw data for your analyses, particularly the goal angle since I’m guessing that doesn’t show up too many places.

Have seen a few other things about play that it’s suggested factor in expected goals, like the speed of the attack, but this is very good. Would also be interested about whether any of these factors are markedly more correlated than the majority.

3. staty

“To compute the angle to the goal, I compute angles to both goal posts and take the difference between those two numbers.”

I am not sure if “difference” makes sense here. As far as I understood the further explanation of what angle you compute, I’d assume that you use the “angle of view” described by kickdex once: http://blog.kickdex.com/post/52303980749/angle-of-view

If this was the case, “sum” would be the correct word instead of “difference”, I suppose.

Anyway, great work and good explanation.

1. Matthias Kullowatz

I would assume he measures the angle to each post against a line running perpendicular to the endline and through the shot location. If that line passes through the near post, then that particular degree measure would be 0. The difference of the angles would then represent just how wide the goal posts were relative to where the player was.

4. Gigi

Being late, I can only second what others said above and asked:

Great stuff, very interesting reading – but where do you get the detailed data from?

I’d just assume you use Opta data as provide by Statszone and/or Squawka (for location/coordinates of shots) and Whoscored (for information on )?

If so, I wonder how you deal with

(a) discrepancies between the information and the images (e.g. shots are missing, in addition, or differently characterized (e.g. saved instead of blocked)
(b) different degrees of accuracy of shot locations as displayed by the above mentioned sites.

Thanks a bunch!

5. Mr10B

Great article! For adding more stats on defensive pressure in the ExpG 3.0 model one could think # defenders within x units squared of shooter. Or whether or not defender is between shooter and goal. The angle the defender has in comparison to the shooter. Etc. Probably a mixture of these factors combined should give you a “defense position” statistic that can be usefull. I think there are some possibilities here.

And yes, where do you get your data?

If you ever think of expanding this one man job into a team job, e-mail me ;).

1. 11tegen11 Post author

Your best bets for data are Squawka, WhoScored and ESPN.
Combining information from various sources, as one would imagine, offers the best insight that is currently available in terms of data.

If you can find any information on positions of players that don’t have the ball, like you suggest for ExpG 3.0, I’d be very interested. In the meantime, though, I’m fairly sure that such information does not exist within the public domain. If it would, it would always have been part of ExpG 2.0, since I’m forced to use proxies for defensive pressure, rather than direct information on that parameter.

Cheers, and thanks for the offer!

6. Rianne

I am wondering, what are the effect sizes of your models? Because that seems very relevant to me, because you would want to know how much of the variance in the prediction is explained by the model. Using ExpG as if that variable would actually be the Expected Goals (as I think is done many times in the other articles) seems not fair to me.

7. Joe Mulberry (@joemulberryTID)

Prozone are currently working on passing options using tracking data, their presume each defensive outfield player has a 2m reach either side of them, and using this they calculate how much free ‘passing angle’ is to each free player. This allows them to calculate passing risk and this correlated to location of the passing (pitch zone) and success rates gives them a calculation of a players ability to play high risk passes as a number 10.

*The 2m reach needs some refining as the distance from the passer is a variable that is not currently taken into account.

I would suggest that a goalkeeper could be given a average ‘reach’ (although with more complexity this would include variable such as GK height, GK distance from shot). This reach and the positions of the shot and the GK could give a VisGA (Visible Goal Area) for each shot. This could be used as a key variable to improve the calculations of ExpG within version 3.0.

A weakness would be the GKs ‘Chip Susceptiblity’ but this could be calculated and factored in.

Thoughts author? Any one?

1. Mr10B

Sounds great, but i think it will be hard to find the necessary data to calculate this. Any suggestions?

8. Pingback: Introduction to Analytics in… Soccer