## Monday, 1 October 2012

### Ordinal football

I've had a quick look at this article on R-bloggers $-$ I don't think I've followed the whole exchange, but I believe they have discussed what models should/could be applied to estimate football scores (specifically, in this case they are using the Dutch league).

The main point of the post is that using ordinal regression models can improve the performance (I suppose in terms of prediction or validation of the probability associated with the observed frequency of the results).

At a very superficial level (since I've just read the article and have not thought about this a great deal), I think that assuming that the observed number of goals can be considered as an ordinal variable, much as you would do for a Likert scale, is not quite the best option.

This assumption might not have a huge impact on the actual results of this model; just as for an ordinal variable, the distance between the modalities is not linear (thus moving from scoring 0 to scoring 1 goal does not necessarily take the same effort required for moving from scoring 3 to scoring 4 goals). And ordinal regression can accommodate this situation. But I think this formulation is unnecessarily complicated and a bit confusing.

Moreover (and far more importantly, I think), if I understand it correctly, both the original models and those discussed in the post I'm considering seem to assume independence between the goals scored by the two teams competing in a single game. This is not realistic, I think, as we proved in our paper (of course drawing on other good examples in the literature).

In particular, we were considering a hierarchical structure in which the goals scored by the two competing teams are conditionally independent given a set of parameters (accounting for defence and attack, and home advantage); but because these were given exchangeable priors, correlation would be implied in the responses $-$ something like this:

The Bayesian machinery was very good at prediction, especially after we considered a slightly more complex structure in which we included information on each team's propensity to be "good", "average", or "poor". This helped avoid overshrinkage in the estimations and we did quite well.

An interesting point of the models discussed in the posts at R-bloggers is the introduction of a time effect (in this particular case to account for winter breaks in the Dutch league). In our experience, we have only considered the Italian, Spanish and English leagues (which, as far as I am aware of) do not have breaks.

But including external information is always good: for example, teams involved in European football (eg Champion's or Europa League) may do worse on the league games immediately before (and/or immediately after) their European fixture. This would be easy enough to include and could perhaps increase the precision in the estimations.

1. i'm not sure but it's still in sample prediction right? how's the prediction when you explain future games and not hypothetical ones with your model?

2. I suppose what is "tricky" in this case is to define the "out-of-sample" predictions. What we did was to replay the whole season at once, which I know may not be the objective, eg if you're a bookie.

In this sense, the games are hypothetical, I think. What we could have done (and didn't in the end, although we played with the idea of actually doing it) was to take for example the first two/three weeks of observed data and based on those predict the next round of games. Would that count as "future games"?

Also, the model was relatively simple and crucially didn't include any observed covariate; I think this would be fundamental to do real prediction (as opposed to showing that your team were good and deserved better in that particular season, which was my main goal $-$ Marta didn't really care about this one, though).

As I was mentioning in my comment to Kees's post, information on the current form, eg in terms of having just (or being about to) played (play) an European fixture, or injuries/suspensions etc. would make the model much more robust and better in predictions, especially for "future" (vs "hypothetical", in the sense I was hinting to above) games.

1. <> This is exactly what I meant. I tried to do (playing around for 2 days) that with a poison and also with an ordinal model. While in sample prediction was quite good the prediction of match day t based on the match days 1 to t-1 was not (compared to bookies ;-) ). I'm convinced that a Bayesian approach is better than mine. Maybe your model performs better.

3. Well, if you only consider limited number of games, you are probably going to have a very large uncertainty associated with your predictions.

Especially in this case, the Bayesian approach is helpful in including extra information (eg team form, etc).

In fact that's exactly what the bookies do...

4. This comment has been removed by the author.