Skip to content

splitters

Splitter

Bases: ABC

Abstract base class for dataset splitters.

Implementations should split an :class:InteractionMatrix into two parts according to a splitting condition (for example, by timestamp).

Source code in src/recnexteval/settings/splitters/base.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
class Splitter(ABC):
    """Abstract base class for dataset splitters.

    Implementations should split an :class:`InteractionMatrix` into two
    parts according to a splitting condition (for example, by timestamp).
    """

    def __init__(self) -> None:
        pass

    @property
    def name(self) -> str:
        """Return the class name of the splitter.

        Returns:
            The splitter class name.
        """
        return self.__class__.__name__

    @property
    def identifier(self) -> str:
        """Return a string identifier including the splitter's parameters.

        The identifier includes the class name and a comma-separated list of
        attribute name/value pairs from `self.__dict__`.

        Returns:
            Identifier string like `Name(k1=v1,k2=v2)`.
        """

        paramstring = ",".join((f"{k}={v}" for k, v in self.__dict__.items()))
        return self.name + f"({paramstring})"

    @abstractmethod
    def split(self, data: InteractionMatrix) -> tuple[InteractionMatrix, InteractionMatrix]:
        """Split an interaction matrix into two parts.

        Args:
            data: The interaction dataset to split.

        Returns:
            A pair of `InteractionMatrix` objects representing the two parts.

        Raises:
            NotImplementedError: If the concrete splitter does not implement this method.
        """

        raise NotImplementedError(f"{self.name} must implement the _split method.")

name property

Return the class name of the splitter.

Returns:

Type Description
str

The splitter class name.

identifier property

Return a string identifier including the splitter's parameters.

The identifier includes the class name and a comma-separated list of attribute name/value pairs from self.__dict__.

Returns:

Type Description
str

Identifier string like Name(k1=v1,k2=v2).

split(data) abstractmethod

Split an interaction matrix into two parts.

Parameters:

Name Type Description Default
data InteractionMatrix

The interaction dataset to split.

required

Returns:

Type Description
tuple[InteractionMatrix, InteractionMatrix]

A pair of InteractionMatrix objects representing the two parts.

Raises:

Type Description
NotImplementedError

If the concrete splitter does not implement this method.

Source code in src/recnexteval/settings/splitters/base.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
@abstractmethod
def split(self, data: InteractionMatrix) -> tuple[InteractionMatrix, InteractionMatrix]:
    """Split an interaction matrix into two parts.

    Args:
        data: The interaction dataset to split.

    Returns:
        A pair of `InteractionMatrix` objects representing the two parts.

    Raises:
        NotImplementedError: If the concrete splitter does not implement this method.
    """

    raise NotImplementedError(f"{self.name} must implement the _split method.")

NLastInteractionSplitter

Bases: Splitter

Splits the n most recent interactions of a user into the second return value, and earlier interactions into the first.

Parameters:

Name Type Description Default
n int

Number of most recent actions to assign to the second return value.

required
n_seq_data int

Number of last interactions to provide as unlabeled data for model to make prediction. Defaults to 1.

1

Raises:

Type Description
ValueError

If n is less than 1, as this would cause the ground truth data to be empty.

Source code in src/recnexteval/settings/splitters/n_last.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class NLastInteractionSplitter(Splitter):
    """Splits the n most recent interactions of a user into the second return value,
    and earlier interactions into the first.

    Args:
        n (int): Number of most recent actions to assign to the second return value.
        n_seq_data (int, optional): Number of last interactions to provide as unlabeled data
            for model to make prediction. Defaults to 1.

    Raises:
        ValueError: If n is less than 1, as this would cause the ground truth data to be empty.
    """

    def __init__(self, n: int, n_seq_data: int = 1) -> None:
        super().__init__()
        if n < 1:
            raise ValueError(
                f"n must be greater than 0, got {n}. "
                f"Values for n < 1 will cause the ground truth data to be empty."
            )
        self.n = n
        self.n_seq_data = n_seq_data

    def split(self, data: InteractionMatrix) -> tuple[InteractionMatrix, InteractionMatrix]:
        future_interaction = data.get_users_n_last_interaction(self.n)
        past_interaction = data - future_interaction
        past_interaction = past_interaction.get_users_n_last_interaction(self.n_seq_data)
        logger.debug(f"{self.identifier} has complete split")

        return past_interaction, future_interaction

name property

Return the class name of the splitter.

Returns:

Type Description
str

The splitter class name.

identifier property

Return a string identifier including the splitter's parameters.

The identifier includes the class name and a comma-separated list of attribute name/value pairs from self.__dict__.

Returns:

Type Description
str

Identifier string like Name(k1=v1,k2=v2).

n = n instance-attribute

n_seq_data = n_seq_data instance-attribute

split(data)

Source code in src/recnexteval/settings/splitters/n_last.py
33
34
35
36
37
38
39
def split(self, data: InteractionMatrix) -> tuple[InteractionMatrix, InteractionMatrix]:
    future_interaction = data.get_users_n_last_interaction(self.n)
    past_interaction = data - future_interaction
    past_interaction = past_interaction.get_users_n_last_interaction(self.n_seq_data)
    logger.debug(f"{self.identifier} has complete split")

    return past_interaction, future_interaction

NLastInteractionTimestampSplitter

Bases: TimestampSplitter

Splits with n last interactions based on a timestamp.

Splits the data into unlabeled and ground truth data based on a timestamp. Historical data contains last n_seq_data interactions before the timestamp t and the future interaction contains interactions after the timestamp t.

Attributes:

Name Type Description
past_interaction

List of unlabeled data. Interval is [0, t).

future_interaction

Data used for training the model. Interval is [t, t+t_upper) or [t,inf].

n_seq_data

Number of last interactions to provide as data for model to make prediction. These interactions are past interactions from before the timestamp t.

Parameters:

Name Type Description Default
t int

Timestamp to split on in seconds since epoch.

required
t_upper None | int

Seconds past t. Upper bound on the timestamp of interactions. Defaults to None (infinity).

None
n_seq_data int

Number of last interactions to provide as data for model to make prediction. Defaults to 1.

1
include_all_past_data bool

If True, include all past data in the past_interaction. Defaults to False.

False
Source code in src/recnexteval/settings/splitters/n_last_timestamp.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
class NLastInteractionTimestampSplitter(TimestampSplitter):
    """Splits with n last interactions based on a timestamp.

    Splits the data into unlabeled and ground truth data based on a timestamp.
    Historical data contains last `n_seq_data` interactions before the timestamp `t`
    and the future interaction contains interactions after the timestamp `t`.


    Attributes:
        past_interaction: List of unlabeled data. Interval is `[0, t)`.
        future_interaction: Data used for training the model.
            Interval is `[t, t+t_upper)` or `[t,inf]`.
        n_seq_data: Number of last interactions to provide as data for model to make prediction.
            These interactions are past interactions from before the timestamp `t`.

    Args:
        t: Timestamp to split on in seconds since epoch.
        t_upper: Seconds past t. Upper bound on the timestamp
            of interactions. Defaults to None (infinity).
        n_seq_data: Number of last interactions to provide as data
            for model to make prediction. Defaults to 1.
        include_all_past_data: If True, include all past data in the past_interaction.
            Defaults to False.
    """

    def __init__(
        self,
        t: int,
        t_upper: None | int = None,
        n_seq_data: int = 1,
        include_all_past_data: bool = False,
    ) -> None:
        super().__init__(t=t, t_lower=None, t_upper=t_upper)
        self.n_seq_data = n_seq_data
        self.include_all_past_data = include_all_past_data

    def update_split_point(self, t: int) -> None:
        logger.debug(f"{self.identifier} - Updating split point to t={t}")
        self.t = t

    def split(self, data: InteractionMatrix) -> tuple[InteractionMatrix, InteractionMatrix]:
        """Splits data such that the following definition holds:

        - past_interaction: List of unlabeled data. Interval is `[0, t)`.
        - future_interaction: Data used for training the model.
            Interval is `[t, t+t_upper)` or `[t,inf]`.

        Args:
            data: Interaction matrix to be split. Must contain timestamps.

        Returns:
            A 2-tuple containing the `past_interaction` and `future_interaction` matrices.
        """
        if self.t_upper is None:
            future_interaction = data.timestamps_gte(timestamp=self.t)
        else:
            future_interaction = data.timestamps_lt(timestamp=self.t + self.t_upper).timestamps_gte(timestamp=self.t)

        if self.include_all_past_data:
            past_interaction = data.timestamps_lt(timestamp=self.t)
        else:
            past_interaction = data.get_users_n_last_interaction(
                n_seq_data=self.n_seq_data, t_upper=self.t, user_in=future_interaction.user_ids
            )

        logger.debug(f"{self.identifier} has complete split")
        return past_interaction, future_interaction

name property

Return the class name of the splitter.

Returns:

Type Description
str

The splitter class name.

identifier property

Return a string identifier including the splitter's parameters.

The identifier includes the class name and a comma-separated list of attribute name/value pairs from self.__dict__.

Returns:

Type Description
str

Identifier string like Name(k1=v1,k2=v2).

t = t instance-attribute

t_lower = t_lower instance-attribute

t_upper = t_upper instance-attribute

n_seq_data = n_seq_data instance-attribute

include_all_past_data = include_all_past_data instance-attribute

update_split_point(t)

Source code in src/recnexteval/settings/splitters/n_last_timestamp.py
46
47
48
def update_split_point(self, t: int) -> None:
    logger.debug(f"{self.identifier} - Updating split point to t={t}")
    self.t = t

split(data)

Splits data such that the following definition holds:

  • past_interaction: List of unlabeled data. Interval is [0, t).
  • future_interaction: Data used for training the model. Interval is [t, t+t_upper) or [t,inf].

Parameters:

Name Type Description Default
data InteractionMatrix

Interaction matrix to be split. Must contain timestamps.

required

Returns:

Type Description
tuple[InteractionMatrix, InteractionMatrix]

A 2-tuple containing the past_interaction and future_interaction matrices.

Source code in src/recnexteval/settings/splitters/n_last_timestamp.py
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def split(self, data: InteractionMatrix) -> tuple[InteractionMatrix, InteractionMatrix]:
    """Splits data such that the following definition holds:

    - past_interaction: List of unlabeled data. Interval is `[0, t)`.
    - future_interaction: Data used for training the model.
        Interval is `[t, t+t_upper)` or `[t,inf]`.

    Args:
        data: Interaction matrix to be split. Must contain timestamps.

    Returns:
        A 2-tuple containing the `past_interaction` and `future_interaction` matrices.
    """
    if self.t_upper is None:
        future_interaction = data.timestamps_gte(timestamp=self.t)
    else:
        future_interaction = data.timestamps_lt(timestamp=self.t + self.t_upper).timestamps_gte(timestamp=self.t)

    if self.include_all_past_data:
        past_interaction = data.timestamps_lt(timestamp=self.t)
    else:
        past_interaction = data.get_users_n_last_interaction(
            n_seq_data=self.n_seq_data, t_upper=self.t, user_in=future_interaction.user_ids
        )

    logger.debug(f"{self.identifier} has complete split")
    return past_interaction, future_interaction

TimestampSplitter

Bases: Splitter

Split an interaction dataset by timestamp.

The splitter divides the data into two parts:

  1. Interactions with timestamps in the interval [t - t_lower, t), representing past interactions.
  2. Interactions with timestamps in the interval [t, t + t_upper], representing future interactions.

If t_lower or t_upper are not provided, they default to infinity, meaning the corresponding interval is unbounded on that side.

Note that a user can appear in both the past and future interaction sets.

Attributes:

Name Type Description
past_interaction InteractionMatrix

Interactions in the interval [0, t), representing unlabeled data for prediction.

future_interaction InteractionMatrix

Interactions in the interval [t, t + t_upper) or [t, inf), used for training the model.

Parameters:

Name Type Description Default
t int

Timestamp to split on, in seconds since the Unix epoch.

required
t_lower None | int

Seconds before t to include in the past interactions. If None, the interval is unbounded. Defaults to None.

None
t_upper None | int

Seconds after t to include in the future interactions. If None, the interval is unbounded. Defaults to None.

None
Source code in src/recnexteval/settings/splitters/timestamp.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class TimestampSplitter(Splitter):
    """Split an interaction dataset by timestamp.

    The splitter divides the data into two parts:

    1. Interactions with timestamps in the interval `[t - t_lower, t)`,
       representing past interactions.
    2. Interactions with timestamps in the interval `[t, t + t_upper]`,
       representing future interactions.

    If `t_lower` or `t_upper` are not provided, they default to infinity,
    meaning the corresponding interval is unbounded on that side.

    Note that a user can appear in both the past and future interaction sets.

    Attributes:
        past_interaction (InteractionMatrix): Interactions in the interval
            `[0, t)`, representing unlabeled data for prediction.
        future_interaction (InteractionMatrix): Interactions in the interval
            `[t, t + t_upper)` or `[t, inf)`, used for training the model.

    Args:
        t: Timestamp to split on, in seconds since the Unix epoch.
        t_lower: Seconds before `t` to include in
            the past interactions. If None, the interval is unbounded.
            Defaults to None.
        t_upper: Seconds after `t` to include in
            the future interactions. If None, the interval is unbounded.
            Defaults to None.
    """

    def __init__(
        self,
        t: int,
        t_lower: None | int = None,
        t_upper: None | int = None,
    ) -> None:
        super().__init__()
        self.t = t
        self.t_lower = t_lower
        self.t_upper = t_upper

    def split(self, data: InteractionMatrix) -> tuple[InteractionMatrix, InteractionMatrix]:
        """Split the interaction data by timestamp.

        The method populates the `past_interaction` and `future_interaction`
        attributes with the corresponding subsets of the input data.

        Args:
            data: The interaction dataset to split.
                Must include timestamp information.

        Returns:
            A pair containing the past interactions and future interactions.
        """

        if self.t_lower is None:
            # timestamp < t
            past_interaction = data.timestamps_lt(self.t)
        else:
            # t-t_lower =< timestamp < t
            past_interaction = data.timestamps_lt(self.t).timestamps_gte(self.t - self.t_lower)

        if self.t_upper is None:
            # timestamp >= t
            future_interaction = data.timestamps_gte(self.t)
        else:
            # t =< timestamp < t + t_upper
            future_interaction = data.timestamps_gte(self.t).timestamps_lt(self.t + self.t_upper)

        logger.debug(f"{self.identifier} has complete split")

        return past_interaction, future_interaction

name property

Return the class name of the splitter.

Returns:

Type Description
str

The splitter class name.

identifier property

Return a string identifier including the splitter's parameters.

The identifier includes the class name and a comma-separated list of attribute name/value pairs from self.__dict__.

Returns:

Type Description
str

Identifier string like Name(k1=v1,k2=v2).

t = t instance-attribute

t_lower = t_lower instance-attribute

t_upper = t_upper instance-attribute

split(data)

Split the interaction data by timestamp.

The method populates the past_interaction and future_interaction attributes with the corresponding subsets of the input data.

Parameters:

Name Type Description Default
data InteractionMatrix

The interaction dataset to split. Must include timestamp information.

required

Returns:

Type Description
tuple[InteractionMatrix, InteractionMatrix]

A pair containing the past interactions and future interactions.

Source code in src/recnexteval/settings/splitters/timestamp.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def split(self, data: InteractionMatrix) -> tuple[InteractionMatrix, InteractionMatrix]:
    """Split the interaction data by timestamp.

    The method populates the `past_interaction` and `future_interaction`
    attributes with the corresponding subsets of the input data.

    Args:
        data: The interaction dataset to split.
            Must include timestamp information.

    Returns:
        A pair containing the past interactions and future interactions.
    """

    if self.t_lower is None:
        # timestamp < t
        past_interaction = data.timestamps_lt(self.t)
    else:
        # t-t_lower =< timestamp < t
        past_interaction = data.timestamps_lt(self.t).timestamps_gte(self.t - self.t_lower)

    if self.t_upper is None:
        # timestamp >= t
        future_interaction = data.timestamps_gte(self.t)
    else:
        # t =< timestamp < t + t_upper
        future_interaction = data.timestamps_gte(self.t).timestamps_lt(self.t + self.t_upper)

    logger.debug(f"{self.identifier} has complete split")

    return past_interaction, future_interaction