ratingslib.datasets.parse module
Module for parsing a dataset
- _parse(pairs_data_df, col_names, parse_dates=None, frequency=None, start_period_from: int = 1, outcome: Optional[Outcome] = None) DataFrame
- parse_pairs_data(filename_or_data: Union[str, DataFrame], columns_dict: Optional[dict] = None, parse_dates=None, date_parser: Optional[Callable] = None, dayfirst: bool = True, frequency: Optional[str] = None, start_period_from: int = 1, outcome: Optional[Outcome] = None) Tuple[DataFrame, DataFrame]
Parse data from filename or from pandas DataFrame. The data must be in pairs. For example in the case of soccer teams:
HomeTeam
AwayTeam
FTHG
FTAG
Team1
Team2
1
5
Team2
Team3
5
4
The purpose of this function is to read a .csv file and store it to a pandas.DataFrame structure. If the .csv contains dates then
parse_dates
anddate_parser
parameters must set. In a case of competitions games if there are weekly games then frequency parameter must be defined.- Parameters
filename_or_data (Union[str, pd.DataFrame]) – Filename to parse or a dataframe that contains pairwise scores
columns_dict (dict, default=None,) – A dictionary mapping the column names of the dataset. If None is given,
COLUMNS_DICT
will be used See the moduleratingslib.datasets.parameters
for more details.parse_dates (Optional[Union[bool, List[int], List[str], List[list], dict]], default = None,) –
Which column has the dates. The behavior is as follows:
boolean. If True -> try parsing the index.
list of int or names. e.g. If [1, 2, 3] -> then try parsing columns 1, 2, 3 each as a separate date column.
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column. dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
date_parser (Callable, default = None) –
Function to use for converting a sequence of string columns to an array of datetime instances. For example in football-data.co.uk data the function is
def parser_date_func(x): return datetime.strptime(x, '%d/%m/%y')
dayfirst (bool, default=False) – DD/MM format dates, international and European format.
frequency (Optional[str], default = None) – If is not None then specifies the frequency of the PeriodIndex. For example, in soccer competitions if we want each match-week to start from Thursday to Wednesday then we set the value “W-THU”. For more information visit the site of pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html
start_period_from (int, default = 1) – Defines the number that the period starts. Only valid if frequency set. For example if
start_period_from=3
then period starts from 3.outcome (SportOutcome = None) – The
outcome
parameter is related with application type. For sports application it must be an instance of subclass of SportOutcome class. e.g. for soccer the type of outcome isratingslib.application.SoccerOutcome
. For more details seeratingslib.application
module.
- Returns
pairs_data_df (pandas.DataFrame) – A DataFrame of pairwise data after parsing.
items_df (pandas.DataFrame) – Set of items (e.g. teams)
- parse_data(filename_or_data: Union[str, DataFrame], reverse_attributes_cols: Optional[List[str]] = None, columns_dict: Optional[dict] = None) Tuple[DataFrame, DataFrame]
Parse data (not in pairs form) from filename or from pandas.DataFrame
- Parameters
filename_or_data (Union[str, pd.DataFrame]) – Filename to parse or a dataframe
reverse_attributes_cols (Optional[List[str]], optional) – If not None then columns will be multiplied by -1, by default None
columns_dict (dict, optional) – The column names of data file. See
ratingslib.datasets.parameters.COLUMNS_DICT
for more details.
- Returns
pairs_data_df (pandas.DataFrame) – A DataFrame of data after parsing.
items_df (pandas.DataFrame) – Set of items (e.g. teams)
- create_pairs_data(data_df: DataFrame, columns_dict: Optional[Dict[str, Any]] = None)
Convert dataset to pairs. For example from User-Movie (Rating-Item):
User
Movie1
Movie2
Movie3
u1
1
5
4
u2
1
3
0
to MovieI-MovieJ (Item-Item):
MovieI
MovieJ
point_i
point_j
Movie1
Movie2
1
5
Movie2
Movie3
5
4
Movie1
Movie3
1
4
Movie1
Movie2
1
3
- Parameters
data_df (pd.DataFrame) – DataFrame of rating-items form
columns_dict (Optional[Dict[str, str]], default=None) – The column names of data file. See
ratingslib.datasets.parameters.COLUMNS_DICT
for more details.
- Returns
item_item_df – DataFrame of item-item form
- Return type
pd.DataFrame
- create_data_from(path: str, year_min: int = 2005, year_max: int = 2018)
Create csv files from given seasons and then write a concatenated csv file for those csv files. This function has column names for data files of footballdata.co.uk
- Parameters
path (str) – Path of files
year_min (int, default = 2005) – Starting season
year_max (int, default = 2018) – Ending season