streamsight.datasets.YelpDataset

class streamsight.datasets.YelpDataset(filename: str | None = None, base_path: str | None = None, use_default_filters=False)

Bases: Dataset

Yelp dataset

The Yelp dataset contains user reviews of businesses. The main columns that will be used are:

user_id: The user identifier
business_id: The business identifier
stars: The rating given by the user to the business
date: The date of the review

The dataset can be downloaded from https://www.yelp.com/dataset/download. The dataset is in a zip file, there are online codes that will aid you in converting the json file to a csv file for usage. Note that for the purposes of this class, it is assumed that the dataset has been converted to a csv file and is named yelp_academic_dataset_review.csv.

Reference is made to the following code from the official repo from Yelp: https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py

you can use the following command to convert the json file to a csv file:

__init__(filename: str | None = None, base_path: str | None = None, use_default_filters=False)

Methods

`__init__`([filename, base_path, ...])
`add_filter`(filter)	Add a filter to be applied when loading the data.
`fetch_dataset`([force])	Check if dataset is present, if not download
`load`([apply_filters])	Loads data into an InteractionMatrix object.

Attributes

`DATASET_URL`	URL to fetch the dataset from.
`DEFAULT_BASE_PATH`	Default base path where the dataset will be stored.
`DEFAULT_FILENAME`	Default filename that will be used if it is not specified by the user.
`ITEM_IX`	Name of the column in the DataFrame that contains item identifiers.
`RATING_IX`	Name of the column in the DataFrame that contains the rating a user gave to the item.
`TIMESTAMP_IX`	Name of the column in the DataFrame that contains time of interaction in date format.
`USER_IX`	Name of the column in the DataFrame that contains user identifiers.
`file_path`	File path of the dataset.
`name`	Name of the object's class.

DATASET_URL = 'https://www.yelp.com/dataset/download': URL to fetch the dataset from.

DEFAULT_BASE_PATH = 'data': Default base path where the dataset will be stored.

DEFAULT_FILENAME = 'yelp_academic_dataset_review.csv': Default filename that will be used if it is not specified by the user.

ITEM_IX = 'business_id': Name of the column in the DataFrame that contains item identifiers.

RATING_IX = 'stars': Name of the column in the DataFrame that contains the rating a user gave to the item.

TIMESTAMP_IX = 'date': Name of the column in the DataFrame that contains time of interaction in date format.

USER_IX = 'user_id': Name of the column in the DataFrame that contains user identifiers.

_abc_impl = <_abc._abc_data object>

_check_safe(): Check if the directory is safe. If directory does not exit, create it.

_dataframe_to_matrix(df: DataFrame) → InteractionMatrix

Converts a DataFrame to an InteractionMatrix.

Parameters:: df (pd.DataFrame) – DataFrame to convert
Returns:: InteractionMatrix object
Return type:: InteractionMatrix

property _default_filters: List[Filter]

The default filters for all datasets

Concrete classes can override this property to add more filters.

Returns:: List of filters to be applied to the dataset
Return type:: List[Filter]

_download_dataset()

Downloads the dataset.

Downloads the csv file from the dataset URL and saves it to the file path.

_fetch_remote(url: str, filename: str) → str

Fetch data from remote url and save locally

Parameters:

url (str) – url to fetch data from
filename (str) – Path to save file to

Returns:

The filename where data was saved

Return type:

str

_load_dataframe() → DataFrame

Load the raw dataset from file, and return it as a pandas DataFrame.

Transform the dataset downloaded to have integer user and item ids. This will be needed for representation in the interaction matrix.

Returns:: The interaction data as a DataFrame with a row per interaction.
Return type:: pd.DataFrame

add_filter(filter: Filter)

Add a filter to be applied when loading the data.

Utilize DataFramePreprocessor class to add filters to the dataset to load. The filter will be applied when the data is loaded into an InteractionMatrix object when load() is called.

Parameters:: filter (Filter) – Filter to be applied to the loaded DataFrame processing to interaction matrix.

fetch_dataset(force=False) → None

Check if dataset is present, if not download

Parameters:: force (bool, optional) – If True, dataset will be downloaded, even if the file already exists. Defaults to False.

property file_path: str: File path of the dataset.

load(apply_filters=True) → InteractionMatrix

Loads data into an InteractionMatrix object.

Data is loaded into a DataFrame using the _load_dataframe() function. Resulting DataFrame is parsed into an InteractionMatrix object. If apply_filters is set to True, the filters set will be applied to the dataset and mapping of user and item ids will be done. This is advised even if there is no filter set, as it will ensure that the user and item ids are incrementing in the order of time.

Parameters:: apply_filters (bool, optional) – To apply the filters set and preprocessing, defaults to True
Returns:: Resulting interaction matrix
Return type:: InteractionMatrix

property name: Name of the object’s class.