streamsight.datasets

Dataset

The dataset module allows users to easily use to publicly available datasets in their experiments. The dataset class are built on top of the Dataset allowing for easy extension and customization. In this module, we provide the a few dataset that is available from public sources. The programmer is free to add more datasets as they see fit by defining the abstract methods that must be implemented.

Other than the publicly available datasets, we also provide a test dataset that can be used for testing purposes. The test dataset is a simple dataset that can be used to test the functionality of the algorithms.

While the MovieLens100K dataset is available in the module, we recommend that the programmer use the other publicly available datasets as the data are not chunked into “blocks”. The setting of a global timeline to split the data could potentially cause a chuck of data to be lost.

Dataset([filename, base_path, ...])

Represents a collaborative filtering dataset.

TestDataset([filename, base_path, ...])

AmazonBookDataset([filename, base_path, ...])

Handles Amazon Book dataset.

AmazonComputerDataset([filename, base_path, ...])

Handles Amazon Computer dataset.

AmazonMovieDataset([filename, base_path, ...])

Handles Amazon Movie dataset.

AmazonMusicDataset([filename, base_path, ...])

Handles Amazon Music dataset.

YelpDataset([filename, base_path, ...])

Yelp dataset

MovieLens100K([filename, base_path, ...])

Example

If the file specified does not exist, the dataset is downloaded and written into this file. Subsequent loading of the dataset will not require downloading the dataset again, and will be obtained from the file in the directory.

from streamsight.datasets import AmazonMusicDataset

dataset = AmazonMusicDataset()
data = dataset.load()

Each dataset can be loaded with default filters that are applied to the dataset. To use the default filters, set use_default_filters parameter to True. The dataset can also be loaded without filters and preprocessing of ID by calling the load() method with the parameter apply_filters set to False. The recommended loading is with filters applied to ensure that the user and item ids are incrementing in the order of time.

from streamsight.datasets import AmazonMusicDataset

dataset = AmazonMusicDataset(use_default_filters=True)
data = dataset.load(apply_filters=False)

For an overview of available filters see streamsight.preprocessing