Comparing statistical models and machine learning algorithms in predicting football outcomes

Egidi, Leonardo; Torelli, Nicola

Nowadays, modelling football outcomes is widespread and popular and the challenge to include relevant predictors along with new possible correlations is strong. From a statistical point of view, two approaches are designed to achieve this task: the goals-based (direct) models (Dixon and Coles, 1997; Karlis and Ntzoufras, 2003) for the number of goals scored by two competing teams; and the results-based (indirect) models, for the probability of the categorical outcome of a win, a draw, or a loss, the so-called three-way process. Both the frameworks have pro and cons; a long debate has been produced to state which approach is better, and many agreed that any direct comparison between the forecasting abilities of the two types of models must be based on forecasts of match results (Goddard, 2005). Machine Learning tools such as Classification and Regression Trees (CART, Breiman et al. (1984)) and Random Forests represent alternatives to predict new match results (Groll et al., 2019) and in some cases have proved to be successful. In this paper we develop a broad comparison between some statistical results-based models and some results-based Machine Learning algorithms, to explore predictive performance for future matches. Although not conclusive, we believe our comparison review may be beneficial for future scholars to discern between goals-based and results-based models.