Prediction of soccer outcome (2009-2010 EPL)

In this example we use the rating values from “AccuRATE method” as machine learning features to predict soccer outcome. The dataset is composed of soccer matches of the English Premier League (season 2009-2010). The predictions are performed through Naive Bayes classifier of scikit-learn library and we apply the walk-forward procedure.

Load packages

[1]:
from ratingslib.app_sports.methods import Predictions, prepare_sports_seasons
from ratingslib.application import SoccerOutcome
from ratingslib.datasets.filenames import get_seasons_dict_footballdata_online
from ratingslib.datasets.soccer import championships
from ratingslib.ratings.accurate import AccuRate
from sklearn.naive_bayes import GaussianNB

Set target outcome

[2]:
outcome = SoccerOutcome()

Firstly, we get the filename from football-data.co.uk for season 2009-2010 (English Premier League). Then, we create rating system and we add it to a dictionary and finally we prepare the dataset. The ratings in the dataset start from the second match week.

[3]:
filenames_dict = get_seasons_dict_footballdata_online(
    season_start=2009, season_end=2010, championship=championships.PREMIERLEAGUE)
ratings_dict = {'AccuRATE': AccuRate()}
data_ml = prepare_sports_seasons(filenames_dict,
                                 outcome,
                                 rating_systems=ratings_dict,
                                 start_week=2)
Load season: 2009 - 2010
2.9%5.7%8.6%11.4%14.3%17.1%20.0%22.9%25.7%28.6%31.4%34.3%37.1%40.0%42.9%45.7%48.6%51.4%54.3%57.1%60.0%62.9%65.7%68.6%71.4%74.3%77.1%80.0%82.9%85.7%88.6%91.4%94.3%97.1%100.0%

We show the columns of 2009 season dataframe

[4]:
data_ml[2009].columns
[4]:
Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',
       'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
       'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD',
       'BWA', 'GBH', 'GBD', 'GBA', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA',
       'SBH', 'SBD', 'SBA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD', 'SJA', 'VCH',
       'VCD', 'VCA', 'BSH', 'BSD', 'BSA', 'Bb1X2', 'BbMxH', 'BbAvH', 'BbMxD',
       'BbAvD', 'BbMxA', 'BbAvA', 'BbOU', 'BbMx>2.5', 'BbAv>2.5', 'BbMx<2.5',
       'BbAv<2.5', 'BbAH', 'BbAHh', 'BbMxAHH', 'BbAvAHH', 'BbMxAHA', 'BbAvAHA',
       'Period', 'Week_Number', 'FT', 'HAccuRATE', 'AAccuRATE',
       'HratingnormAccuRATE', 'AratingnormAccuRATE'],
      dtype='object')

We will only use the normalized ratings from AccuRATE as features for ml classifier.

[5]:
features_names = ['HratingnormAccuRATE',
                  'AratingnormAccuRATE']
data_ml[2009][features_names].head()
[5]:
HratingnormAccuRATE AratingnormAccuRATE
0 0.848092 0.061352
1 0.401452 0.288341
2 0.000000 0.231402
3 0.663704 0.343821
4 0.369277 0.165818

We have selected the Naive Bayes classifier and we start making predictions from the 4th week. We apply the anchored walk-farward procedure with window size = 1 which means that every week we make predictions by using previous weeks data for training set. For example for the 4th week, the training set is consisted of the 1st, 2nd and 3rd week.

[6]:
test_y, pred = Predictions(data_ml,
                           outcome,
                           start_from_week=4,
                           print_classification_report=True).ml_pred(
                            clf=GaussianNB(),
                            features_names=features_names)


=====GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE]=====
              precision    recall  f1-score   support

         1.0     0.5630    0.8686    0.6831       175
         2.0     0.4821    0.3553    0.4091        76
         3.0     0.2222    0.0430    0.0721        93

    accuracy                         0.5320       344
   macro avg     0.4224    0.4223    0.3881       344
weighted avg     0.4530    0.5320    0.4574       344

confusion matrix:
[[152  14   9]
  [ 44  27   5]
  [ 74  15   4]]
Correct games: 183
Wrong games: 161
Total predicted Games: 344



=====GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE]=====


=====Accuracy results=====

                                                                  Accuracy  \
GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE]  0.531977

                                                                  Correct Games  \
GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE]            183

                                                                  Wrong Games  \
GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE]          161

                                                                  Total Games
GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE]          344