Building an Expected Goals Model in Python

Introduction

Expected Goals remain a controversial tool of measurement. Misunderstood by many, misused by more; like it or not, expected goals is here to stay. It has gained a lot traction in the mainstream over the last year to the delight and dismay of those in the analytics community.

I’m not going to get into what expected goals are. Far too many articles have been written about the metric, including by me in some dark corner of the internet. But if you want an in depth read, be sure to go here, here, and here.

Essentially, we are going to take distance from goal, angle, and shot context to calculate the % chance of a shot becoming a goal. For today’s article we will look at some data which was -REDACTED- from -REDACTED-. I will use data from the top 5 leagues (Premier League, Bundesliga, La Liga, Serie A, and Ligue 1) from season 10/11 to 17/18  to build an expected goals model, see how it performs on the test data, and again how it holds up for this season.

The Model

I have chosen to use Logistic Regression. This isn’t for any special reason, but it is faster to run through than SVMs or Random Forests and it doesn’t require fiddling around with hyper parameters or weights on first run. Also, we’re dealing essentially with a classification problem. Does a shot result in a goal or not(1 or 0). 

The data I’m using isn’t ready out of the box. I have a lot of code to parse this data and run through event and qualifiers to get extra data – isThrowin, isThroughball etc…. Here are the columns I have loaded from my dataset.

Of course we won’t use all of these columns, but let’s see what some of them are:

  • xM & yM: these are  x and y coordinates in metres. 
  • metersToGoal: how far away the chance was from the opposition goal.
  • angle: the players view of goal when the shot was taken
  • minute: when did the shot take place
  • isShot: we’re building a shots model so this should be 1, this is only to filter the original dataset
  • isGoal: this is what we are trying to model -> this will be our y later on
  • isOwnGoal: these are poison to xG models. These will be filtered out.
  • isBC, isPen, isOpenPlay, isHead, isFreeKick, isFromCorner, isFromCross, isFromThroughball: 1 or 0 – shot context

For this run-through I have dropped xM and yM as that information is already factored in to the distance from goal and I don’t want to double count. Let’s throw some of these variables into a logit and see how things look:

All of our features should be fine as they are below the 0.05 p-value threshold. Next, let’s split and randomise our data into training and testing sets and see how accurate the model is on the test data. General convention recommends a 70/30 percent split, so I’ll go with that.

91% on the test set sounds good, but it’s classifying whether a shot is a goal or not, so most of that score is on the correct NOT A GOAL classification. We can check this by looking at a confusion matrix:

The confusion matrix shows us which values the model correctly and incorrectly classified.

The value at the top left are shots that were correctly predicted as not goals. To the right of this are shots that were predicted as goals but were not goals. The second row shows shots that were classified as not goals, but were indeed goals. To the right of that are shots that were correctly classified as goals. 

Alright, just one more test to see how the model is looking. This time we’ll use an ROC curve. This shows us how well our model does in predicting the correct outcome. We got a score of 0.61, which isn’t spectacular but it will do the job. This score could be improved by adding in variables such as game state and the league (as scoring rates are not uniform across competitions).

Expected Goals 2018/19

Now that the model is built, let’s take a look at how it performs totally out of sample on this season’s data. We can do this by using log_r.predict_proba(X)

with X being the variables from our 2018 / 2019 dataset.

Below is expected goals vs observed goals per 90 minutes for outfield players with more than 900 minutes played from 13 leagues.

For a quick logistic regression with minimal optimisation I’m quite happy with r^2 here. I’ve included the top 20 players by expected goals for you to look at. One stands out right away, as I posted late last night on twitter:

Before I wrap up, here are the 20 players with the biggest +/- of xG to observed goals. 

KEVIN LASAGNA!

Round Up

Expected Goals is a neat little metric. But remember that it’s just one of many tools in the analyst’s toolbox. The process is more important than the result and there are way more cooler things we can do with this data than just calculate expected goals. However, it does have it’s uses in analysing long term trends in football on a player and team level, as well as it’s uses in creating betting models (which I will be writing about soon).

As always, if you have questions or complaints you can reach me on twitter here

Liked it? Take a second to support petermckeever on Patreon!
Tags:
,
No Comments

Post A Comment