Predicting the English Premier League

The pre-match warm-up – Intro.

This post’s about predicting (football) match results in the English Premier League using a few simple statistics over the 2016/17 season.

The English Premier League – how it works

Teams and their matches

Teams are fielded by football clubs and in the English Premier League (EPL) there are twenty of these, so we end up with:

  • 20 teams, who play every other team in the league twice; once at Home and once at the opposition’s, or Away, ground.
  • 38 games in a season for each team 1.
  • 380 matches in the league each season 2.
Points

Points are awarded to each team according to their result in the match:

  • 3 Points for a win.
  • 1 Point for a Draw
  • 0 Points for a losing.
League position and Goal Difference

Teams are ranked in the league table according to accumulated points over the course of the season.

Where league positions are tied, the team with the superior Goal Difference, i.e. goals they’ve scored minus the ones they’ve conceded, is used as a tiebreaker or secondary rank decider.

Final position impact

At the end of the season those teams occupying the top ranks in the league table gain kudos, monetary rewards and entry to additional competitions within Europe and beyond. Whilst the bottom three teams are relegated to the division below and their places taken by newly promoted teams.

The final league positions have enormous, some would argue ridiculous, financial implications for each club. For the season we’re looking at, the 2016/17 season:

The combination of high stakes and a ranking system skewed toward encouraging teams to win, can, on occasion, provide a highly entertaining spectacle as well as a league where a team that were almost relegated one season, win it the next.

2016/17 the EPL by the numbers

It is somewhat of a cliche, but Football really is a game of fine margins. The impact of this, along with Home Advantage and a motivational points system can be clearly seen in a bar chart break down of results for the 2016/17 season.

Bar chart showing the relative percentages outcomes for Home wins, Draws and Away wins

Of the 380 matches a Home win is most likely outcome, with a Draw being the least:

  • 187 (49.21%) Home wins.
  • 84 (22.11%) Draws.
  • 109 (28.68%) Away wins.

Like anything involving gambling, EPL results are well studied and a little Googling suggests that these statistics of about 50% Home win, 20% Draw and 30% Away win are fairly consistent year on year 3.

Methodology Overview

Libraries and scripts were developed in Python 3.6 to:

  • Download the season results from BBC sports website into a SQLite DB.
  • Generate statistics based on the results.
  • Create Models from these statistics.
  • Use the Python testing framework to run tests of the Models’ predictive abilities and to log the results into a SQLite DB

The source code can be found in the repository here:

https://github.com/shufflingB/prem_league_prediction

All of the tests reported in this document (and a few others beside) are contained in the file

./tests/test_predictor_algorithms.py.

Each of these tests has a unique test case ID that can be used to cross reference between this document and source code.

Benchmarks

Random selection

If the results were evenly distributed between the three possible outcomes, then random selection would on average get the result correct 1/3 of the time. However, we know from historical data that this assumption is unlikely to be correct, so random selection will be correct less than 1/3 of the time.

So for 2016/17 how much worse?

With random selection according to the 2016/17 statistics we would get a:

  • 15.9% under estimation error for Home wins (49.21% – 33.33%).
  • 11.2% over estimation error for Draws (33.33% – 22.11%)
  • 4.7% over estimation error for Away wins (33.33% – 28.68%)

Which across the three different outcomes would give an average additional error of 10.6%.

So for the 2016/17 season random selection of Home win, Draw or Away win, would enable us to get the result right about 33.3% – 10.6%, or 22.7% of the time.

Picking the Home team

Instead of randomly picking a result, we could always just pick the Home team. In this case we would have got the result right, as previously mentioned, 49.21% of the time

Phil Jones, classic grimace

First Half – Trying the obvious

Test Case ID 10 – Premier League Rank

The most obvious performance ‘Feature’ that can be used to predict results is the one that folks will tend to use by default. Namely the team’s Premier League rank i.e. a team that is closer to the top of the table will tend to beat those at the bottom.

In this experiment we see how effective this is would have been as a tactic.

Method

  1. The goodness, or strength of a team is modelled by an approximation of that team’s Premier League position. (An approximation is used simply because it’s a bit easier to compute 4).
  2. After the first week’s matches are over, then prior to each subsequent match, a cumulative Model based on the results to that date is computed for each team.
  3. For the match in question, the corresponding team’s Models are then used to predict the matches outcome. This prediction is done by selecting from the two Models, the one with the biggest size and declaring its team to be the predicted winner.
  4. The predictions are compared to the actual outcomes of the matches and the results are logged.
  5. This is repeated until predictions have been made for all of the 370 predictable matches in the season 5

Results

Using the Premier League Rank over the course of the 2016/17 season correctly predicted outcomes:

  • 65.57% of the time for Home wins (120 correct/183 predicted)
  • 25.00% of the time for Draws (1 correct/4 predicted)
  • 41.53% of the time for Away wins (76 correct/183 predicted)

On average, its accuracy was 53.24%.

Thoughts

  1. Performs better than always Home win (53.24% vs 49.21% from backing the Home team)
  2. System is not very good at predicting Draws or Away wins.
  3. Over predicts Away wins and under predicts Draws.
  4. As a Feature it has several obvious issues:
    • Loses information about the performance differences between teams i.e. a team that wins one nil gets the same amount of points as a team that beats the opposition five nil.
    • Performs poorly at predicting Draws.
    • Biased against teams that have played less Premier League matches because of their involvement in non-Premier League competitions, e.g. FA Cup, Champions League, Europa League and the ensuing fixture rescheduling.

Test Case ID 20 – Normalised Premier League Rank

In 10 we observed that using a simple approximation of Premier League Rank resulted in predictions that were biased against teams that had not played as many matches because of fixture rescheduling.

In this test case we repeat the experiment from 10, but normalise the rank approximation by the number of matches played by the team in order to remove this bias and observe the effect.

Method

Identical to 10, except the Feature is normalised by the number of matches played during construction. 6

Results

Using normalised Premier League Rank over the course of the 2016/17 season correctly predicted outcomes:

  • 65.93% of the time for Home wins (120 correct/182 predicted)
  • 25.00% of the time for Draws (1 correct/4 predicted)
  • 41.85% of the time for Away wins (77 correct/184 predicted)

On average, its accuracy was 53.51%.

Thoughts

  1. Tiny accuracy improvement but best result so far (53.51% vs 53.24%).
  2. Same problems with numbers of predictions predicted for Draws and Away wins.
  3. Does not degrade performance, so seems likely that implicit knowledge of number of matches played is not useful information to retain 7.

Test Case ID 30 – Normalised Premier League Points

If one of the main sources of inaccuracy is the inability to predict Draws, perhaps using a Feature that did less to separate teams might prove interesting. Possibly the simplest option here is to just use the normalised Premier League Points for a team, i.e. same as Premier League Rank but without the splitting of tied positions by Goal Difference, so that’s what we’ll try …

Method

As for 10, except the Feature is now the Premier League Points for the team 8

Results

Using normalised Premier League Points over the course of the 2016/17 season correctly predicted outcomes:

  • 66.10% of the time for Home wins (117 correct/177 predicted)
  • 18.75% of the time for Draws (3 correct/16 predicted)
  • 41.81% of the time for Away wins (74 correct/177 predicted)

On average, its accuracy was 52.43%.

Thoughts

  1. The system actually ends up predicting an order of magnitude more Draws, but this is still quite short of the actual number that happen in real life.
  2. It’s made things worse than the previous best achieved in 20 (52.43% vs 53.51%) as too many of those additional Draw predictions are incorrect.
  3. Dropping the small Goal Difference contribution from the Premier League Rank feature to just use Premier League Points hints that Goal Difference might be an interesting feature in its own right to use.

Test Case ID 40 – Normalised Goal Difference

As has been observed a features based on points for wins, Draws etc, may be discard information about how ‘good’ a team is. Specifically a team that wins 5 nil gets the same points as a team that wins 1 nil.

In 30, we saw that omitting this type of information caused performance to deteriorate slightly. In this test we’ll use Goal Difference as a Feature in its own right to see quite how good it is.

Method

As for 10, except the Feature is now the normalised Goal Difference for the team 9

Results

Using Goal Difference over the course of the 2016/17 season correctly predicted outcomes:

  • 68.00% of the time for Home wins (119 correct/175 predicted)
  • 30.00% of the time for Draws (3 correct/10 predicted)
  • 41.62% of the time for Away wins (77 correct/185 predicted)

On average, its accuracy was 53.78%.

Thoughts

  1. Goal Difference produces by a small margin the best accuracy yet (53.78% vs 53.51% for normalised Premier League Rank)
  2. If the evolution of Goal Difference is known then it simple to deduce match results and Points for a team, therefore it is not a particular surprise that the performance levels are very similar because they correlated.
  3. Whilst deriving Points from Goal Difference is possible, deriving Goal Difference from Points is not. The Points derivation process is lossy and it is loosing at least some useful information, therefore we’ll use Goal Difference as the ‘standard’ Feature henceforth.

Half Time – What’s wrong with out ‘best’ Goal Difference Model?

At this point we’ve got something that works but it’s not exactly working great. What’s wrong with it? and how might we improve it?

Prediction performance

If we look at Goal Difference we can see the following.

Outcome Number of Actual Outcomes GD Number Predicted GD Incorrectly Predicted
Home win 49.21% 47.30% 32.00%
Draws 22.11% 2.70% 70.00%
Away win 28.68% 50.00% 58.38%

From this it is clear that the main problems are:

  1. The system is not making enough predictions of Draws.
  2. Over predicting the number of Away wins.
  3. When it is making Draw or Away win predictions, they are usually wrong, particularly for Draws.

Model Convergence

In terms of convergence, what we should see is that with time the system’s performance improves. And, indeed this small positive trend can by the linear trend line for Goal Difference performance plotted against month, as shown here:

Goal Difference by month showing slighlty improving performance

However, the underlying data is quite noisy.

What’s happening I think can be better understood by looking at frequency plot of the distances produced by the model Home, Away model comparisons for each prediction, grouped by the actual outcomes to which they relate.

When this is done and trend lines are added it’s possible to see that all three outcomes have a lot of overlap, with the curve corresponding to Drawn results looking particularly squashed between the other two.

Frequency graph of distances between the GD model and the actual outcomes, shows large amount of overlap

What to optimise

At this interlude taking these observations together suggested to me that:

  1. Relatively stronger Home models are needed in order to increase the number of predicted Home wins.
  2. It might be better to model the team’s Home and Away performances separately if performance asymmetries are endemic and/or variable (it’s well known that some teams can have extremely asymmetric Home and Away performance e.g. Burnley, who in the season won ten Home games but only a single Away game).
  3. A team’s performance can vary dramatically over time e.g managerial comings and goings, because of this it is possible that the model should adapt more quickly by discarding historical information.
  4. To predict Draws perhaps a reasonable approach might be to attempt to predict wins, but to apply some form of threshold, i.e. positive, or negatives distances below some value are predicted as Draws rather than use absolute equality.

Second Half – Exploring and optimising

Test Case ID 71 – Determining the optimum Home Advantage for Normalised Goal Difference

In order to see if boosting the Home model might work in terms of improving prediction accuracy, the simplest option is to add a positive offset to the Home team’s Goal Difference when they play at Home, in effect this creates separate Home and Away models for the team. This test determines what the optimum for this value would have been for the 2016/17.

Method

Two models, a Home and Away model are constructed for each team. The Away model for the team is identical to that constructed in 40, whilst the Home model for team’s Goal Difference is increased relative to it.

Comparison is as before, with the team with the larger Goal Difference being predicted to be the winners, but in this case the appropriate Home or Away models are used when making this comparison.

The amount that the Home performance model is boosted by is iteratively increased from 0 to a 4 in 0.01 increments.

Results

Based on the results of the 2016/17, optimal Home Advantage looks to have been worth something like an additional 0.6 to 0.9 goal head start for the Home team.

Prediction performance versus the amount that the Home model is boosted

With the a Home boost value of 0.72 goals for, we get the following results for normalised Goal Difference based prediction:

  • 60.82% of the time for Home wins (163 correct/268 predicted)
  • No predicted Draws
  • 55.88% of the time for Away wins (45 correct/102 predicted)

On average, its accuracy was 59.46%.

Thoughts

  1. Prediction of Draws is worse than previous attempts.
  2. The performance of Home wins has been slightly reduced because the quantity that has been predicted has increased.
  3. No guarantees that the boost figure is reliable season on season, as it calculated a posteriori.
  4. Even with this caveat, if we had used models with this then we could have predicted matches with an accuracy of close to 60% (59.5%), or the best yet.

Test Case ID 80 – Independent Home and Away models

In 71, the implicitly assumption was that the performance of a team at Home was related to its Away performance plus some form of boost to take into account of Home Advantage. Logically this seems sensible, but then there are always the Burnley’s of this world.

What if we had completely independent models, it would have the advantage of not having to hand set a Home boost value …

Method

Two models, a Home and Away model are constructed for each team. The Home model is only based on the team’s Home match Goal Difference results, whilst the Away model is based entirely on the their Away results.

Comparison is as for 71, with the Home model being used for the Home team and the Away for the Away team.

Results

Using Goal Difference and independent models over the course of the 2016/17 season correctly predicted outcomes::

  • 60.43% of the time for Home wins (139 correct/230 predicted)
  • 26.67% of the time for Draws (4 correct/15 predicted)
  • 47.42% of the time for Away wins (46 correct/97 predicted)

On average, its accuracy was 55.26%.

NB: Prediction done for 342 matches, instead of the previous 370 matches. This was done in order to ensure that all teams had had at least one Home and one Away result in order to derive the initial models.

Thoughts

Outright performance is not as good as the hand optimised boosted models in 71, however it:

  1. Has the advantages of needing no magic numbers.
  2. It looks like with more data (each model having half the the available data) it might have produced a better performance level, something that is suggested when the performance is plotted on a monthly basis, as shown below.

Gd indy model by month

Test Case ID 100 – Moving analysis window

All of the models so far have used all of the data available to them. However, in real life the performance of a team can take step changes when circumstances such as a new manager arrive, star players get injured etc.

Is it possible that a smaller window of results should be used for calculating the models in order to allow the models to adapt to their circumstances more readily?

Method

Normalised Goal Difference is calculated for a predetermined number of matches in a window preceding those that are to be predicted. The number of matches included in the window is varied from 1 through to all available matches and the prediction performance is determined across the season for each number.

Results

The results for all of the different number of matches windows are as shown below:

Gd with various moving windows

Thoughts

There appears to be no advantage to the routine use of windowed data, at least over the course of a single season the best model is obtained by using all available data. Possibly the best way instead to handle changes might be to use an ad. hoc. “on significant event” model invalidation type process instead.

Test Case ID 200 – Normalised Goal Difference with thresholds

If we look at the frequency diagram for Goal Difference with a 0.72 Home boost i.e. close to the previously observed optimum, in addition to being able to see it is a shifted version of that discussed in the interlude, it’s possible to see a range of distances that we might classify as Draws.

Frequency graph of distances between the GD model with a 0.72 boost against actual outcomes

Specifically it suggests matches that produce a distance between somewhere between 0.3 and 0.9 should be classified as Draws. This test is to see what happens when we try this.

Method

Home and Away Goal difference models are constructed for each team using all available data prior to the matches that they are to predict. The Home model is identical to the Away model except it has a 0.72 positive offset added to it in order to model Home Advantage.

Results are classified such that distances:
– Greater than 0.9 will be predicted as Home wins
– Lower than 0.3 will be predicted as Away wins
– Distances between the 0.3 and 0.9 will be predicted as Draws.

Results

Using Goal Difference with a 0.7 boost and thresholds over the course of the 2016/17 season correctly predicted outcomes:

  • 67.39% of the time for Home wins (124 correct/184 predicted)
  • 27.38% of the time for Draws (23 correct/84 predicted)
  • 55.88% of the time for Away wins (57 correct/102 predicted)

On average, its accuracy was 55.14%.

Thoughts

  1. Adding the thresholds in does result in significantly more predictions of Draws, actually close to the number that actually happened.
  2. Accuracy is close to the best result previously achieved for Draws.
  3. Additional definition of thresholds is introducing error into the system’s overall performance, relative to the previous best (55.14% vs 59.46%)

Test Case ID 210 and 220 – Adjusting the Draw thresholds to increase the Win prediction accuracy

Can we engineer a tradeoff in the system such that we increase reliability of predicting say Wins, Home or Away, versus that of Draws?

In 200, we saw that introducing upper and lower thresholds into the system ended up predicting more Draws. If we increase the range between these two then we will tend to increase the number of Draws predicted. If we do this, then it seems likely that we may be able to trade poorer overall accuracy for all outcomes, for greater accuracy when predicting Home or Away Wins.

This test is to examine what happens when we try this.

Method

Home and Away Goal difference models are constructed for each team using all available data prior to the matches that they are to predict.

The Home model is identical to the Away model except it has a 0.72 positive offset added to it in order to model Home Advantage.

The experiment is run twice:

  1. In 210 classification of results is done such that distances:
    • Greater than 0.9 will be predicted as Home wins
    • The lower threshold below which a result is predicted to be an Away win is varied from -2.0 to 0.9.
    • Distances between the lower threshold and 0.9 will be predicted as Draws
  2. In 220 classification of results is done such that distances:
    • Distances lower than than 0.3 will be predicted as Away wins
    • The upper threshold above which a result is predicted to be a Home win is varied from 0.3 to 5.0.
    • Distances between the 0.3 and the upper threshold be predicted as Draws.

The results are plotted showing the effects of adjusting the threshold in question on the percentage of errors when making predictions of Away and Home wins respectively.

Results

210 – The results for incorrect Away predictions when varying the lower threshold value.

Results for incorrect Away predictions when varying the lower threshold value, show that lowering the threshold only gets you so far before things get worse

220 – The results for incorrect Home predictions when varying the upper threshold value.

Results for incorrect Home predictions when varying the uppder threshold value, show that raising the threshold only gets you so far before things get worse

Thoughts

At first glance these results might appear slightly counter intuitive. Surely, if we lower/increase the threshold for predicting Away/Home wins, then the number of false predictions made will tend to decrease. However, what we get instead are graphs that shows a tendency to do this up to a local minima before start to increase again.

What is actually happening is as the thresholds grow all of the normal results are eventually being discounted until the only remaining predictions are dominated by the statistical outliers, or fluke results that are left. These are the type of matches where a team like Crystal Palace who after 29 matches were fifth bottom with 31 points and a goal difference of -9, go to Chelsea, league leaders with a 69 points and a goal difference of 37 and win.

Test Case ID 205 – The final attack, normalised Goal Difference with hand optimised Home Advantage and thresholds

In this test we bring together everything we know based upon our data and see how well we might have been able to predict Home and Away wins if we had this knowledge at the start of the season, i.e. a bit impractical without a time machine, but still interesting.

Method

Home and Away Goal difference models are constructed for each team using all available data prior to the matches that they are to predict. The Home model is identical to the Away model except it has a 0.72 positive offset added to it in order to model Home Advantage.

Results are classified such that distances:

  • Greater than 1.945 will be predicted as Home wins
  • Lower than -0.792 will be predicted as Away wins
  • Distances between the -0.792, 1.945 will be predicted as Draws.

These threshold values having been previously determined in 210 and 220 as likely being close to the optimum for minimising Away and Home win false predictions.

Results

Using Goal Difference with a 0.7 boost and these thresholds over the course of the 2016/17 season correctly predicted outcomes:

  • 77.19% of the time for Home wins (44 correct/57 predicted)
  • 24.63% of the time for Draws (67 correct/205 predicted)
  • 68.29% of the time for Away wins (28 correct/41 predicted)

On average, its accuracy for all outcomes was 37.57%.

However, if predictions for Draws are discarded, then the accuracy improves (as hoped for) to 62.43% at the expense of only being able to make predictions for 26.49% ((57+41)/370) of all predictable matches

Thoughts

It is possible to set threshold values that potentially boost the performance of Away and Home wins at the cost of Draw. However, setting such thresholds drastically restricts the number of predictions that the system can make even if it is possible to determine the suitable thresholds before hand.

Summary – The post match interview – a game of two halves.

What we’ve found out along the way.

EPL Rank – Using the EPL rank, or league position over the course of the 2016/17 season does predict match outcomes more reliably than either randomly selecting outcomes or always picking the Home team 53.24% vs 22.7% vs 49.21% respectively.

Goal Difference – Using the Goal Difference normalised by the number of matches is an incrementally better basic feature than Rank for predicting results (53.78% vs 53.24%).

Model the teams Home and Away performance separately because it accounts for Home advantage and produces better performance. If this is done by:

  • Boosting a team’s Home model, then an observed Home advantage worth about 3/4’s of a goal per match produces the best prediction accuracy observed of about 59.46%.
  • Deriving the the team’s Home and Away models independently using only the corresponding data produces a model that works and does not require a magic value to be set. However, it looks like it would benefit from more data to improve its prediction accuracy (55.26%).

Model convergence – at least within the context of a season, most team’s performances is fairly consistent and therefore prediction benefits from using all available data i.e. the routine discarding of past results via windowing does not improve accuracy.

Predicting Draws – is really really difficult to predict them in similar numbers to those that occur in reality. With Goal Difference, then thresholds have to be used and dding these simple, static thresholds decreases the systems overall prediction accuracy even in the unrealistic situation where they are hand picking based on observed season data.

Increasing the accuracy of Away and Home win prediction – It is possible, up to a point. to increase the accuracy of Win prediction by increasing the range of values for which Draws are predicted. However, there are values beyond which residual statistical outliers will dominate and are likely to cause accuracy to decrease again. Neither how to determine these thresholds values a-priori or their long term stability is known.

Roadmap – Better next season?

At the moment, I’m going to stop play with this stuff, I’ve already spent way, way, way too much time :-/ However, in terms of what I would look at next should I get a chance … then my list would be:

More data

Seek to obtain more programatically accessible data – I think the easiest way to do this would be to update my data fetching scripts to make use of interesting resources available on http://football-data.org/. Specifically it would be interesting to see if the:

  • Approximate 3/4’s of a goal Home advantage stands season on season.
  • Independent Home and Away models could converge to the same, or better performance than the hand boosted ones

Logarithmic feature

Whilst compressing match results to a 3 points for a win, 1 for a Draw throws too much data away. It also seems unlikely that that a team in the same league that beats another team five one is actually five times better than the other as we are saying is the case with Goal Difference now.

More likely, it’s a one-off and the result should be somewhat deemphasised in the grand scheme of things. A simple way to do this might be to use a log based metric instead of a linear one, e.g. Ln (GD + a_positive_constant_to_stop_it_going_below_zero)

Multivariate

At the moment the models are based on a single value. We already know that compressing match results to simple points looses information but so to does Goal Difference. With Goal Difference it does not matter if a match finishes 1-0, or 11-10, it’s all the same, but from the point of view of the teams concerned there is a lot that may be inferred about both teams attacking prowess and defensive frailties.

Similarly, there are many other variables readily accessible that might help predict a matches outcomes that are, or are not directly derivable from match results. For instance, in test case 74 I’ve used the size of a team’s Home Ground to predict match outcomes and this achieves an accuracy of 50% i.e. better than chance and picking the Home team, so there is obviously useful independent information [^what_sort_of_info’] there that we might like to use.

The question in both cases is how do we combine it and use it? If we break Goal Difference down into Goals For and Goal Against, which is the more important? if either, and how do we combine them to factor this in? Whilst we might, come up with something that works for Goals For, or Against, things rapidly become impossible when we bring in factors such a Ground Size, club revenue and such like.

Fortunately there are statistical techniques that can be used for joining these datasets together, things such a Principal Component Analysis, Linear Discriminant Analysis and there are now libraries in Python that look like they might enable mere mortals to have a go at using them 🙂

AI

Oh yes, Google, Amazon et. al now also have some very shiney AI systems that are open for anyone with a small amount of money and an interested to have a play with, for instance https://cloud.google.com/blog/big-data/2017/01/learn-tensorflow-and-deep-learning-without-a-phd and https://aws.amazon.com/amazon-ai/ … must stop … have other things … must do ….

We lost because we didn't win - Cristiano Ronaldo, good at football, rest ...hmm

and finally

The predictOmatic makes the following predictions for the 2017/18 opening weekend fixtures:

  • Arsenal vs Leicester City – strong home win
  • Watford vs Liverpool – away win
  • Chelsea vs Burnley – strong home win
  • Crystal Palace vs Huddersfield – weak away win
  • Everton vs Stoke City – strong home win
  • Southampton vs Swansea City – home win
  • West Bromwich Albion vs Bournemouth – home win
  • Brighton vs Manchester City – away win
  • Newcastle vs Tottenham Hotspur – away win
  • Manchester United vs West Ham – home win

will this be the edge needed to beat the Oracle …


  1. 38 = 20 -1 * 2, the -1 coming about because without time travel most teams are unable to play themselves :-) 
  2. 38 matches, 20 teams, 2 teams take part in each match. 
  3. https://www.gambling.com/online-betting/strategy/8-essential-premier-league-stats-10400 – Retrieved 2017/06/20 
  4. Points + (Goal Difference/1000000), since it’s easier to calculate and gives the same predictions, if not distances. 
  5. 380 – the 10 matches from the first weekend which are skipped as it would be harsh to mark down a model’s performance when it has not had any data to based its predictions on. 
  6. (Points + (Goal Difference/1000000))/MatchesPlayed. 
  7. Teams with less matches might be involved with a competition, which might imply that they were a ‘good’ team. However, they might also just as easily be a team that was scheduled to play a ‘good’ team. 
  8. (3 * MatchesWon + MatchesDrawn)/MatchesPlayed. 
  9. (GoalsFor – GoalAgainst)/MatchesPlayed.