datasets
Dataset module for public datasets in streaming experiments.
This module provides easy access to publicly available datasets for use in streaming experiments. Dataset classes are built on top of the Dataset base class, allowing for easy extension and customization.
Dataset Overview¶
Multiple public datasets are available from various sources. Additionally, a lightweight test dataset is provided for testing algorithm functionality.
Data Chunking Note¶
The MovieLens 100K dataset is available but not chunked into "blocks". Setting a global timeline to split the data could potentially cause a chunk of data to be lost. Other publicly available datasets are recommended.
Available Datasets¶
AmazonBookDataset: Amazon Books reviewsAmazonMovieDataset: Amazon Movies reviewsAmazonMusicDataset: Amazon Music reviewsAmazonSubscriptionBoxesDataset: Amazon Subscription Boxes reviewsLastFMDataset: Last.FM music listening historyMovieLens100K: MovieLens 100K rating datasetYelpDataset: Yelp business reviewsTestDataset: Lightweight dataset for testing algorithms
Loading Datasets¶
Basic loading:
from recnexteval.datasets import AmazonMusicDataset
dataset = AmazonMusicDataset()
data = dataset.load()
If the file does not exist, it will be downloaded and written. Subsequent loads will retrieve the file from disk without downloading again.
Using Default Filters¶
from recnexteval.datasets import AmazonMusicDataset
dataset = AmazonMusicDataset(use_default_filters=True)
data = dataset.load(apply_filters=False)
Each dataset can be loaded with default filters applied. Default filters ensure that user and item IDs increment in the order of time. This is the recommended loading approach.
Extending the Framework¶
To add custom datasets, inherit from Dataset and implement all abstract methods. Refer to the base class documentation for implementation details.
Related Modules¶
recnexteval.preprocessing: Data preprocessing and filtering utilities