streamsight.datasets.Dataset
- class streamsight.datasets.Dataset(filename: str | None = None, base_path: str | None = None, use_default_filters=False)
Bases:
ABC
Represents a collaborative filtering dataset.
Dataset must minimally contain user, item and timestamp columns for the other modules to work.
Assumption
User/item ID increments in the order of time. This is an assumption that will be made for the purposes of splitting the dataset and eventually passing the dataset to the model. The ID incrementing in the order of time allows us to set the shape of the currently known user and item matrix allowing easier manipulation of the data by the evaluator.
- param filename:
Name of the file, if no name is provided the dataset default will be used if known. If the dataset does not have a default filename, a ValueError will be raised.
- type filename:
str, optional
- param base_path:
The base_path to the data directory. Defaults to data
- type base_path:
str, optional
- param use_default_filters:
If True, the default filters will be applied to the dataset. Defaults to False.
- type use_default_filters:
bool, optional
- __init__(filename: str | None = None, base_path: str | None = None, use_default_filters=False)
Methods
__init__
([filename, base_path, ...])add_filter
(filter)Add a filter to be applied when loading the data.
fetch_dataset
([force])Check if dataset is present, if not download
load
([apply_filters])Loads data into an InteractionMatrix object.
Attributes
Default base path where the dataset will be stored.
Default filename that will be used if it is not specified by the user.
Name of the column in the DataFrame with item identifiers
Name of the column in the DataFrame that contains time of interaction in seconds since epoch.
Name of the column in the DataFrame with user identifiers
File path of the dataset.
Name of the object's class.
- DEFAULT_BASE_PATH = 'data'
Default base path where the dataset will be stored.
- DEFAULT_FILENAME = None
Default filename that will be used if it is not specified by the user.
- ITEM_IX = None
Name of the column in the DataFrame with item identifiers
- TIMESTAMP_IX = None
Name of the column in the DataFrame that contains time of interaction in seconds since epoch.
- USER_IX = None
Name of the column in the DataFrame with user identifiers
- _abc_impl = <_abc._abc_data object>
- _check_safe()
Check if the directory is safe. If directory does not exit, create it.
- _dataframe_to_matrix(df: DataFrame) InteractionMatrix
Converts a DataFrame to an InteractionMatrix.
- Parameters:
df (pd.DataFrame) – DataFrame to convert
- Returns:
InteractionMatrix object
- Return type:
- property _default_filters: List[Filter]
The default filters for all datasets
Concrete classes can override this property to add more filters.
- Returns:
List of filters to be applied to the dataset
- Return type:
List[Filter]
- abstract _download_dataset()
Downloads the dataset.
Downloads the csv file from the dataset URL and saves it to the file path.
- _fetch_remote(url: str, filename: str) str
Fetch data from remote url and save locally
- Parameters:
url (str) – url to fetch data from
filename (str) – Path to save file to
- Returns:
The filename where data was saved
- Return type:
str
- abstract _load_dataframe() DataFrame
Load the raw dataset from file, and return it as a pandas DataFrame.
Warning
This does not apply any preprocessing, and returns the raw dataset.
- Returns:
Interation with minimal columns of {user, item, timestamp}.
- Return type:
pd.DataFrame
- add_filter(filter: Filter)
Add a filter to be applied when loading the data.
Utilize
DataFramePreprocessor
class to add filters to the dataset to load. The filter will be applied when the data is loaded into anInteractionMatrix
object whenload()
is called.- Parameters:
filter (Filter) – Filter to be applied to the loaded DataFrame processing to interaction matrix.
- fetch_dataset(force=False) None
Check if dataset is present, if not download
- Parameters:
force (bool, optional) – If True, dataset will be downloaded, even if the file already exists. Defaults to False.
- property file_path: str
File path of the dataset.
- load(apply_filters=True) InteractionMatrix
Loads data into an InteractionMatrix object.
Data is loaded into a DataFrame using the
_load_dataframe()
function. Resulting DataFrame is parsed into anInteractionMatrix
object. Ifapply_filters
is set to True, the filters set will be applied to the dataset and mapping of user and item ids will be done. This is advised even if there is no filter set, as it will ensure that the user and item ids are incrementing in the order of time.- Parameters:
apply_filters (bool, optional) – To apply the filters set and preprocessing, defaults to True
- Returns:
Resulting interaction matrix
- Return type:
- property name
Name of the object’s class.