It’s July 5th, 2014. One hundred and twenty-one minutes of football in the World Cup quarter final between favorites Holland and outsiders Costa Rica have not resulted in a goal, and a penalty series is imminent. Dutch manager Louis van Gaal makes the infamous move to remove goal keeper Jasper Cillessen in favour of Tim Krul, and Holland wins the penalty series to reach the semi-final of the World Cup.
I vividly remember seeing this very unusual move and immediately thought: “how on earth could they have found any evidence to support this decision?”. Competitive penalties in football are such rare events that goal keeping skill when it comes to stopping penalties should be near impossible to evaluate. In fact, Van Gaal may well have used this move to trick the Costa Ricans into thinking Krul was quite special at stopping penalties, thereby influencing the odds in his and Holland’s favour.1
Low hanging fruit
It turns out that I was in the minority to ignore any historical penalty stopping data. Football and data were becoming hot those days, and the low hanging fruit was of course Cillessen’s horrible track record of 18 competitive penalties faced, 18 goals conceded.
Strong opinions sell a lot better than fine nuance, particularly at the highpoint of emotion that is a football World Cup. Ever since the summer of 2014, Cillessen is associated with the words ‘penalty trauma’ and ‘syndrome’. And I’m sticking to professional media outlets here; on Twitter Cillessen was ridiculed in every possible way, bringing the tweeters their desired high number of retweets. Feed the people sweet low hanging fruit, and they will happily swallow.
How can a goal keeper that doesn’t stop 18 consecutive penalties still be average? Even a simple statistical test called Chi-square would tell you there is just around 1% chance that Cillessen is average at stopping penalties.2
There are several reasons to assume that Cillessen, despite his poor track record, is an average penalty stopper.
- Cillessen may have around 1% chance of being average, but he is far from the only goal keeper being studied in this respect. There are over a hundred goal keepers who faced at least 18 penalties and this influences how we should view that fact that one of them, Cillessen, produced this odd series. If you’d try it often enough, you’ll sometimes just toss a long series of heads with a perfectly normal coin. As the wiki page on the statistical phenomenon called ‘Bonferroni correction’ says it: “as we increase the number of hypotheses being tested, we also increase the likelihood of a rare event.”
Simply said, study enough goal keepers and one will indeed not stop 18 penalties in a row, despite being a completely average goal keeper.
- To study binary outcomes (yes or no, goal or no goal), 18 observations is a very small set. The idea of statistics is to use numbers to make as reliable statements as possible. Imagine three of the next five penalties not being scored. Does this make you completely turn around on the claim that Cillessen can’t stop penalties? If that’s true, you shouldn’t have made the claim in the first place.
This problem, calling an effect while in fact there is no effect, happens much more than we think. Studies finding ‘significant’ effects sell better than negative studies. If this is true in the scientific world, imagine how it works in the world of journalism where people fight each other for clicks and reads.
- It’s easy to assume that a goal keeper with a 0/18 record is a very poor penalty stopper, while in fact that only thing that we test with our “1% chance” is that Cillessen is different from the average goal keeper at stopping penalties. This test does not tell you anything about the effect size. It doesn’t mean that stopping 0/18 is Cillessen’s actual level, but rather that based on this small set of 18 observations Cillessen may well be below average. Whether this mean that Cillessen stops 75.9% instead of 76% or 50% instead of 76%, we can’t say. But any reasonable thinking will lead you to assume that a goal keeper that is generally above the level of rival professional goal keepers in many other aspects, can’t be tens of percentage points behind on stopping penalties. So, if any effect exists at all, it is probably a few percentage points at most.
So, should we just ignore the whole 0/18 thing? Probably not.
At best, this point in time, July 5th, 2014 serves as the moment we recognize that Cillessen may have a penalty issue, though odds are he probably hasn’t. Can you imagine a newspaper or twitter feed scoring any points with such a balanced point of view?
The best thing to do would probably be to start tracking his numbers prospectively and see if a new set of observations confirms what we found in the pilot study. Only then can we reliable say if there’s a true issues with Cillessen and penalties, or just a coincidence.
From penalty 19 onwards, Cillessen saw the first three penalties being converted (Sep 30 2014 APOEL – Oct 13 2014 Iceland – Nov 9 2014 Cambuur), a fourth one hitting the post (Aug 27 2015 Jablonec) and the fifth one being converted (Sep 3 2015 Iceland). This combines for a 1/5 record.
Obviously, it is debatable to ascribe a penalty that hits the woodwork to Cillessen’s penalty stopping skills, but changing the definitions of our study halfway because the outcome doesn’t really suit our desired statement is fraud with numbers, however tempting it is. Think of the headline ‘Cillessen still hasn’t stopped a penalty in his career’.
In the end
We will follow Cillessen’s new 1/5 set with close attention until the set has another 18 observations and see where we stand. Chances are that we could publish our revolutionary findings and claim that new research shows that Cillessen doesn’t have a penalty syndrome. Chances also are that no one will be interested in such a headline, and mass media will have found new situations where numbers can be abused in order to get clicks and sell papers.
1 For a nuanced view on events surrounding the Cillessen-Krul substitution, read this post.
2 Based on Cillessen stopping 0 from 18 penalties versus other keepers in my dataset stopping 1020 from 3229 penalties.