Prediction of soccer outcome (2009-2010 EPL) by combining game statistics and rating values

In this example we use rating values from “Colley method” as machine learning features. Additionally, we consider two teams statistics as features: the Total Goals and Total Shots on Target. The dataset is composed of soccer matches of the English Premier League for the season 2009-2010. The predictions are performed through Naive Bayes classifier of scikit-learn library in order to predict soccer outcome and we apply the walk-forward procedure.

Load packages

[1]:
from ratingslib.app_sports.methods import Predictions, prepare_sports_seasons
from ratingslib.application import SoccerOutcome
from ratingslib.datasets.filenames import get_seasons_dict_footballdata_online
from ratingslib.datasets.parameters import championships
from ratingslib.ratings.colley import Colley
from sklearn.naive_bayes import GaussianNB

Set target outcome

[2]:
outcome = SoccerOutcome()

Get the filename from football-data.co.uk for season 2009-2010 (English Premier League). Then, we create Colley rating system instance and we add it to a dictionary, Then, we define the statistics. Finally, we prepare the dataset. The ratings in the dataset start from the second match-week.

[3]:
filenames_dict = get_seasons_dict_footballdata_online(
    season_start=2009, season_end=2010, championship=championships.PREMIERLEAGUE)
ratings_dict = {'Colley': Colley()}
stats_attributes = {
    'TG': {'ITEM_I': 'FTHG', 'ITEM_J': 'FTAG', 'TYPE': 'POINTS'},
    'TST': {'ITEM_I': 'HST', 'ITEM_J': 'AST', 'TYPE': 'POINTS'},
}
data_ml = prepare_sports_seasons(filenames_dict,
                                 outcome,
                                 rating_systems=ratings_dict,
                                 stats_attributes=stats_attributes,
                                 start_week=2)
Load season: 2009 - 2010
2.9%5.7%8.6%11.4%14.3%17.1%20.0%22.9%25.7%28.6%31.4%34.3%37.1%40.0%42.9%45.7%48.6%51.4%54.3%57.1%60.0%62.9%65.7%68.6%71.4%74.3%77.1%80.0%82.9%85.7%88.6%91.4%94.3%97.1%100.0%

We show the columns of 2009 season dataframe

[4]:
data_ml[2009].columns
[4]:
Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',
       'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
       'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD',
       'BWA', 'GBH', 'GBD', 'GBA', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA',
       'SBH', 'SBD', 'SBA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD', 'SJA', 'VCH',
       'VCD', 'VCA', 'BSH', 'BSD', 'BSA', 'Bb1X2', 'BbMxH', 'BbAvH', 'BbMxD',
       'BbAvD', 'BbMxA', 'BbAvA', 'BbOU', 'BbMx>2.5', 'BbAv>2.5', 'BbMx<2.5',
       'BbAv<2.5', 'BbAH', 'BbAHh', 'BbMxAHH', 'BbAvAHH', 'BbMxAHA', 'BbAvAHA',
       'Period', 'Week_Number', 'FT', 'HColley', 'AColley',
       'HratingnormColley', 'AratingnormColley', 'HTG', 'ATG', 'HTST', 'ATST'],
      dtype='object')

We will use the normalized ratings from Colley method and the team statistics: total goals and total shots on target as features for ml classifier.

[5]:
features_names = ['HratingnormColley',
                  'AratingnormColley',
                  'HTG', 'ATG', 'HTST', 'ATST']
data_ml[2009][features_names].head()
[5]:
HratingnormColley AratingnormColley HTG ATG HTST ATST
0 0.777294 0.000000 6.0 0.0 9.0 4.0
1 0.377459 0.546966 0.5 1.0 4.5 4.0
2 0.226129 0.174768 1.0 0.0 5.0 3.0
3 0.777294 0.589776 2.0 0.5 5.0 7.0
4 0.544873 0.252242 1.0 0.0 8.0 9.0

We have selected the Naive Bayes classifier and we start making predictions from the 4th week. We apply the anchored walk-farward procedure with window size = 1 which means that every week we make predictions by using previous weeks data for training set. For example for the 4th week, the training set is consisted of the 1st, 2nd and 3rd week.

[6]:
test_y, pred = Predictions(data_ml,
                           outcome,
                           start_from_week=4,
                           print_classification_report=True).ml_pred(
                            clf=GaussianNB(),
                            features_names=features_names)


=====GaussianNB()-[features: HratingnormColley AratingnormColley HTG ATG HTST ATST]=====
              precision    recall  f1-score   support

         1.0     0.6815    0.5257    0.5935       175
         2.0     0.4062    0.3421    0.3714        76
         3.0     0.3448    0.5376    0.4202        93

    accuracy                         0.4884       344
   macro avg     0.4775    0.4685    0.4617       344
weighted avg     0.5297    0.4884    0.4976       344

confusion matrix:
[[92 21 62]
  [17 26 33]
  [26 17 50]]
Correct games: 168
Wrong games: 176
Total predicted Games: 344



=====GaussianNB()-[features: HratingnormColley AratingnormColley HTG ATG HTST ATST]=====


=====Accuracy results=====

                                                                                Accuracy  \
GaussianNB()-[features: HratingnormColley AratingnormColley HTG ATG HTST ATST]  0.488372

                                                                                Correct Games  \
GaussianNB()-[features: HratingnormColley AratingnormColley HTG ATG HTST ATST]            168

                                                                                Wrong Games  \
GaussianNB()-[features: HratingnormColley AratingnormColley HTG ATG HTST ATST]          176

                                                                                Total Games
GaussianNB()-[features: HratingnormColley AratingnormColley HTG ATG HTST ATST]          344