Prediction of soccer outcome (2009-2010 EPL)
In this example we use the rating values from “AccuRATE method” as machine learning features to predict soccer outcome. The dataset is composed of soccer matches of the English Premier League (season 2009-2010). The predictions are performed through Naive Bayes classifier of scikit-learn library and we apply the walk-forward procedure.
Load packages
[1]:
from ratingslib.app_sports.methods import Predictions, prepare_sports_seasons
from ratingslib.application import SoccerOutcome
from ratingslib.datasets.filenames import get_seasons_dict_footballdata_online
from ratingslib.datasets.soccer import championships
from ratingslib.ratings.accurate import AccuRate
from sklearn.naive_bayes import GaussianNB
Set target outcome
[2]:
outcome = SoccerOutcome()
Firstly, we get the filename from football-data.co.uk for season 2009-2010 (English Premier League). Then, we create rating system and we add it to a dictionary and finally we prepare the dataset. The ratings in the dataset start from the second match week.
[3]:
filenames_dict = get_seasons_dict_footballdata_online(
season_start=2009, season_end=2010, championship=championships.PREMIERLEAGUE)
ratings_dict = {'AccuRATE': AccuRate()}
data_ml = prepare_sports_seasons(filenames_dict,
outcome,
rating_systems=ratings_dict,
start_week=2)
Load season: 2009 - 2010
2.9%5.7%8.6%11.4%14.3%17.1%20.0%22.9%25.7%28.6%31.4%34.3%37.1%40.0%42.9%45.7%48.6%51.4%54.3%57.1%60.0%62.9%65.7%68.6%71.4%74.3%77.1%80.0%82.9%85.7%88.6%91.4%94.3%97.1%100.0%
We show the columns of 2009 season dataframe
[4]:
data_ml[2009].columns
[4]:
Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',
'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',
'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD',
'BWA', 'GBH', 'GBD', 'GBA', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA',
'SBH', 'SBD', 'SBA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD', 'SJA', 'VCH',
'VCD', 'VCA', 'BSH', 'BSD', 'BSA', 'Bb1X2', 'BbMxH', 'BbAvH', 'BbMxD',
'BbAvD', 'BbMxA', 'BbAvA', 'BbOU', 'BbMx>2.5', 'BbAv>2.5', 'BbMx<2.5',
'BbAv<2.5', 'BbAH', 'BbAHh', 'BbMxAHH', 'BbAvAHH', 'BbMxAHA', 'BbAvAHA',
'Period', 'Week_Number', 'FT', 'HAccuRATE', 'AAccuRATE',
'HratingnormAccuRATE', 'AratingnormAccuRATE'],
dtype='object')
We will only use the normalized ratings from AccuRATE as features for ml classifier.
[5]:
features_names = ['HratingnormAccuRATE',
'AratingnormAccuRATE']
data_ml[2009][features_names].head()
[5]:
HratingnormAccuRATE | AratingnormAccuRATE | |
---|---|---|
0 | 0.848092 | 0.061352 |
1 | 0.401452 | 0.288341 |
2 | 0.000000 | 0.231402 |
3 | 0.663704 | 0.343821 |
4 | 0.369277 | 0.165818 |
We have selected the Naive Bayes classifier and we start making predictions from the 4th week. We apply the anchored walk-farward procedure with window size = 1 which means that every week we make predictions by using previous weeks data for training set. For example for the 4th week, the training set is consisted of the 1st, 2nd and 3rd week.
[6]:
test_y, pred = Predictions(data_ml,
outcome,
start_from_week=4,
print_classification_report=True).ml_pred(
clf=GaussianNB(),
features_names=features_names)
=====GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE]=====
precision recall f1-score support
1.0 0.5630 0.8686 0.6831 175
2.0 0.4821 0.3553 0.4091 76
3.0 0.2222 0.0430 0.0721 93
accuracy 0.5320 344
macro avg 0.4224 0.4223 0.3881 344
weighted avg 0.4530 0.5320 0.4574 344
confusion matrix:
[[152 14 9]
[ 44 27 5]
[ 74 15 4]]
Correct games: 183
Wrong games: 161
Total predicted Games: 344
=====GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE]=====
=====Accuracy results=====
Accuracy \
GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE] 0.531977
Correct Games \
GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE] 183
Wrong Games \
GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE] 161
Total Games
GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE] 344