{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Prediction of soccer outcome (2009-2010 EPL)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example we use the rating values from \"AccuRATE method\" as machine learning features to predict soccer outcome.\n", "The dataset is composed of soccer matches of the English Premier League (season 2009-2010).\n", "The predictions are performed through Naive Bayes classifier of scikit-learn library and we apply the walk-forward procedure." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load packages" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from ratingslib.app_sports.methods import Predictions, prepare_sports_seasons\n", "from ratingslib.application import SoccerOutcome\n", "from ratingslib.datasets.filenames import get_seasons_dict_footballdata_online\n", "from ratingslib.datasets.soccer import championships\n", "from ratingslib.ratings.accurate import AccuRate\n", "from sklearn.naive_bayes import GaussianNB" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set target outcome" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "outcome = SoccerOutcome()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Firstly, we get the filename from football-data.co.uk for season 2009-2010 (English Premier League).\n", "Then, we create rating system and we add it to a dictionary and finally we prepare the dataset.\n", "The ratings in the dataset start from the second match week." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Load season: 2009 - 2010\n", "2.9%5.7%8.6%11.4%14.3%17.1%20.0%22.9%25.7%28.6%31.4%34.3%37.1%40.0%42.9%45.7%48.6%51.4%54.3%57.1%60.0%62.9%65.7%68.6%71.4%74.3%77.1%80.0%82.9%85.7%88.6%91.4%94.3%97.1%100.0%\n" ] } ], "source": [ "filenames_dict = get_seasons_dict_footballdata_online(\n", " season_start=2009, season_end=2010, championship=championships.PREMIERLEAGUE)\n", "ratings_dict = {'AccuRATE': AccuRate()}\n", "data_ml = prepare_sports_seasons(filenames_dict,\n", " outcome,\n", " rating_systems=ratings_dict,\n", " start_week=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We show the columns of 2009 season dataframe" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Div', 'Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HTHG',\n", " 'HTAG', 'HTR', 'Referee', 'HS', 'AS', 'HST', 'AST', 'HF', 'AF', 'HC',\n", " 'AC', 'HY', 'AY', 'HR', 'AR', 'B365H', 'B365D', 'B365A', 'BWH', 'BWD',\n", " 'BWA', 'GBH', 'GBD', 'GBA', 'IWH', 'IWD', 'IWA', 'LBH', 'LBD', 'LBA',\n", " 'SBH', 'SBD', 'SBA', 'WHH', 'WHD', 'WHA', 'SJH', 'SJD', 'SJA', 'VCH',\n", " 'VCD', 'VCA', 'BSH', 'BSD', 'BSA', 'Bb1X2', 'BbMxH', 'BbAvH', 'BbMxD',\n", " 'BbAvD', 'BbMxA', 'BbAvA', 'BbOU', 'BbMx>2.5', 'BbAv>2.5', 'BbMx<2.5',\n", " 'BbAv<2.5', 'BbAH', 'BbAHh', 'BbMxAHH', 'BbAvAHH', 'BbMxAHA', 'BbAvAHA',\n", " 'Period', 'Week_Number', 'FT', 'HAccuRATE', 'AAccuRATE',\n", " 'HratingnormAccuRATE', 'AratingnormAccuRATE'],\n", " dtype='object')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_ml[2009].columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will only use the normalized ratings from AccuRATE as features for ml classifier." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
HratingnormAccuRATEAratingnormAccuRATE
00.8480920.061352
10.4014520.288341
20.0000000.231402
30.6637040.343821
40.3692770.165818
\n", "
" ], "text/plain": [ " HratingnormAccuRATE AratingnormAccuRATE\n", "0 0.848092 0.061352\n", "1 0.401452 0.288341\n", "2 0.000000 0.231402\n", "3 0.663704 0.343821\n", "4 0.369277 0.165818" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features_names = ['HratingnormAccuRATE',\n", " 'AratingnormAccuRATE']\n", "data_ml[2009][features_names].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have selected the Naive Bayes classifier and we start making predictions from the 4th week.\n", "We apply the anchored walk-farward procedure with window size = 1 which means that every week we make predictions\n", "by using previous weeks data for training set. For example for the 4th week, the training set is consisted of the 1st, 2nd and 3rd week. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "=====GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE]=====\n", " precision recall f1-score support\n", "\n", " 1.0 0.5630 0.8686 0.6831 175\n", " 2.0 0.4821 0.3553 0.4091 76\n", " 3.0 0.2222 0.0430 0.0721 93\n", "\n", " accuracy 0.5320 344\n", " macro avg 0.4224 0.4223 0.3881 344\n", "weighted avg 0.4530 0.5320 0.4574 344\n", "\n", "confusion matrix:\n", "[[152 14 9]\n", " [ 44 27 5]\n", " [ 74 15 4]]\n", "Correct games: 183\n", "Wrong games: 161\n", "Total predicted Games: 344\n", "\n", "\n", "\n", "=====GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE]=====\n", "\n", "\n", "=====Accuracy results=====\n", "\n", " Accuracy \\\n", "GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE] 0.531977 \n", "\n", " Correct Games \\\n", "GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE] 183 \n", "\n", " Wrong Games \\\n", "GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE] 161 \n", "\n", " Total Games \n", "GaussianNB()-[features: HratingnormAccuRATE AratingnormAccuRATE] 344 \n" ] } ], "source": [ "test_y, pred = Predictions(data_ml,\n", " outcome,\n", " start_from_week=4,\n", " print_classification_report=True).ml_pred(\n", " clf=GaussianNB(),\n", " features_names=features_names)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.13 ('py38')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "23ceb7112fbf9d0e38ecbf60d6e6d5e2dcebcc82200eeb1e5a5d5f9ffb9e27ca" } } }, "nbformat": 4, "nbformat_minor": 2 }