ratingslib.datasets.parse module
Module for parsing a dataset
- _parse(pairs_data_df, col_names, parse_dates=None, frequency=None, start_period_from: int = 1, outcome: Optional[Outcome] = None) DataFrame
- parse_pairs_data(filename_or_data: Union[str, DataFrame], columns_dict: Optional[dict] = None, parse_dates=None, date_parser: Optional[Callable] = None, dayfirst: bool = True, frequency: Optional[str] = None, start_period_from: int = 1, outcome: Optional[Outcome] = None) Tuple[DataFrame, DataFrame]
Parse data from filename or from pandas DataFrame. The data must be in pairs. For example in the case of soccer teams:
HomeTeam
AwayTeam
FTHG
FTAG
Team1
Team2
1
5
Team2
Team3
5
4
The purpose of this function is to read a .csv file and store it to a pandas.DataFrame structure. If the .csv contains dates then
parse_datesanddate_parserparameters must set. In a case of competitions games if there are weekly games then frequency parameter must be defined.- Parameters
filename_or_data (Union[str, pd.DataFrame]) – Filename to parse or a dataframe that contains pairwise scores
columns_dict (dict, default=None,) – A dictionary mapping the column names of the dataset. If None is given,
COLUMNS_DICTwill be used See the moduleratingslib.datasets.parametersfor more details.parse_dates (Optional[Union[bool, List[int], List[str], List[list], dict]], default = None,) –
Which column has the dates. The behavior is as follows:
boolean. If True -> try parsing the index.
list of int or names. e.g. If [1, 2, 3] -> then try parsing columns 1, 2, 3 each as a separate date column.
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column. dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
date_parser (Callable, default = None) –
Function to use for converting a sequence of string columns to an array of datetime instances. For example in football-data.co.uk data the function is
def parser_date_func(x): return datetime.strptime(x, '%d/%m/%y')
dayfirst (bool, default=False) – DD/MM format dates, international and European format.
frequency (Optional[str], default = None) – If is not None then specifies the frequency of the PeriodIndex. For example, in soccer competitions if we want each match-week to start from Thursday to Wednesday then we set the value “W-THU”. For more information visit the site of pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
start_period_from (int, default = 1) – Defines the number that the period starts. Only valid if frequency set. For example if
start_period_from=3then period starts from 3.outcome (SportOutcome = None) – The
outcomeparameter is related with application type. For sports application it must be an instance of subclass of SportOutcome class. e.g. for soccer the type of outcome isratingslib.application.SoccerOutcome. For more details seeratingslib.applicationmodule.
- Returns
pairs_data_df (pandas.DataFrame) – A DataFrame of pairwise data after parsing.
items_df (pandas.DataFrame) – Set of items (e.g. teams)
- parse_data(filename_or_data: Union[str, DataFrame], reverse_attributes_cols: Optional[List[str]] = None, columns_dict: Optional[dict] = None) Tuple[DataFrame, DataFrame]
Parse data (not in pairs form) from filename or from pandas.DataFrame
- Parameters
filename_or_data (Union[str, pd.DataFrame]) – Filename to parse or a dataframe
reverse_attributes_cols (Optional[List[str]], optional) – If not None then columns will be multiplied by -1, by default None
columns_dict (dict, optional) – The column names of data file. See
ratingslib.datasets.parameters.COLUMNS_DICTfor more details.
- Returns
pairs_data_df (pandas.DataFrame) – A DataFrame of data after parsing.
items_df (pandas.DataFrame) – Set of items (e.g. teams)
- create_pairs_data(data_df: DataFrame, columns_dict: Optional[Dict[str, Any]] = None)
Convert dataset to pairs. For example from User-Movie (Rating-Item):
User
Movie1
Movie2
Movie3
u1
1
5
4
u2
1
3
0
to MovieI-MovieJ (Item-Item):
MovieI
MovieJ
point_i
point_j
Movie1
Movie2
1
5
Movie2
Movie3
5
4
Movie1
Movie3
1
4
Movie1
Movie2
1
3
- Parameters
data_df (pd.DataFrame) – DataFrame of rating-items form
columns_dict (Optional[Dict[str, str]], default=None) – The column names of data file. See
ratingslib.datasets.parameters.COLUMNS_DICTfor more details.
- Returns
item_item_df – DataFrame of item-item form
- Return type
pd.DataFrame
- create_data_from(path: str, year_min: int = 2005, year_max: int = 2018)
Create csv files from given seasons and then write a concatenated csv file for those csv files. This function has column names for data files of footballdata.co.uk
- Parameters
path (str) – Path of files
year_min (int, default = 2005) – Starting season
year_max (int, default = 2018) – Ending season