ratingslib.datasets.parse module

Module for parsing a dataset

_parse(pairs_data_df, col_names, parse_dates=None, frequency=None, start_period_from: int = 1, outcome: Optional[Outcome] = None) → DataFrame

parse_pairs_data(filename_or_data: Union[str, DataFrame], columns_dict: Optional[dict] = None, parse_dates=None, date_parser: Optional[Callable] = None, dayfirst: bool = True, frequency: Optional[str] = None, start_period_from: int = 1, outcome: Optional[Outcome] = None) → Tuple[DataFrame, DataFrame]

Parse data from filename or from pandas DataFrame. The data must be in pairs. For example in the case of soccer teams:

HomeTeam

AwayTeam

FTHG

FTAG

Team1

Team2

1

5

Team2

Team3

5

4

The purpose of this function is to read a .csv file and store it to a pandas.DataFrame structure. If the .csv contains dates then parse_dates and date_parser parameters must set. In a case of competitions games if there are weekly games then frequency parameter must be defined.

Parameters

filename_or_data (Union[str, pd.DataFrame]) – Filename to parse or a dataframe that contains pairwise scores
columns_dict (dict, default=None,) – A dictionary mapping the column names of the dataset. If None is given, COLUMNS_DICT will be used See the module ratingslib.datasets.parameters for more details.
parse_dates (Optional[Union[bool, List[int], List[str], List[list], dict]], default = None,) –
Which column has the dates. The behavior is as follows:
- boolean. If True -> try parsing the index.
- list of int or names. e.g. If [1, 2, 3] -> then try parsing columns 1, 2, 3 each as a separate date column.
- list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column. dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
date_parser (Callable, default = None) –
Function to use for converting a sequence of string columns to an array of datetime instances. For example in football-data.co.uk data the function is
```
def parser_date_func(x): return datetime.strptime(x, '%d/%m/%y')
```
dayfirst (bool, default=False) – DD/MM format dates, international and European format.
frequency (Optional[str], default = None) – If is not None then specifies the frequency of the PeriodIndex. For example, in soccer competitions if we want each match-week to start from Thursday to Wednesday then we set the value “W-THU”. For more information visit the site of pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
start_period_from (int, default = 1) – Defines the number that the period starts. Only valid if frequency set. For example if start_period_from=3 then period starts from 3.
outcome (SportOutcome = None) – The outcome parameter is related with application type. For sports application it must be an instance of subclass of SportOutcome class. e.g. for soccer the type of outcome is ratingslib.application.SoccerOutcome. For more details see ratingslib.application module.

Returns

pairs_data_df (pandas.DataFrame) – A DataFrame of pairwise data after parsing.
items_df (pandas.DataFrame) – Set of items (e.g. teams)

parse_data(filename_or_data: Union[str, DataFrame], reverse_attributes_cols: Optional[List[str]] = None, columns_dict: Optional[dict] = None) → Tuple[DataFrame, DataFrame]

Parse data (not in pairs form) from filename or from pandas.DataFrame

Parameters

filename_or_data (Union[str, pd.DataFrame]) – Filename to parse or a dataframe
reverse_attributes_cols (Optional[List[str]], optional) – If not None then columns will be multiplied by -1, by default None
columns_dict (dict, optional) – The column names of data file. See ratingslib.datasets.parameters.COLUMNS_DICT for more details.

Returns

pairs_data_df (pandas.DataFrame) – A DataFrame of data after parsing.
items_df (pandas.DataFrame) – Set of items (e.g. teams)

create_pairs_data(data_df: DataFrame, columns_dict: Optional[Dict[str, Any]] = None)

Convert dataset to pairs. For example from User-Movie (Rating-Item):

User

Movie1

Movie2

Movie3

u1

1

5

4

u2

1

3

0

to MovieI-MovieJ (Item-Item):

MovieI

MovieJ

point_i

point_j

Movie1

Movie2

1

5

Movie2

Movie3

5

4

Movie1

Movie3

1

4

Movie1

Movie2

1

3

Parameters

data_df (pd.DataFrame) – DataFrame of rating-items form
columns_dict (Optional[Dict[str, str]], default=None) – The column names of data file. See ratingslib.datasets.parameters.COLUMNS_DICT for more details.

Returns

item_item_df – DataFrame of item-item form

Return type

pd.DataFrame

create_data_from(path: str, year_min: int = 2005, year_max: int = 2018)

Create csv files from given seasons and then write a concatenated csv file for those csv files. This function has column names for data files of footballdata.co.uk

Parameters

path (str) – Path of files
year_min (int, default = 2005) – Starting season
year_max (int, default = 2018) – Ending season