Data Utilities

These are general-purpose data processing utilities.

Building Ratings Matrices, scipy=False, *, torch=False, users=None, items=None)

Convert a rating table to a sparse matrix of ratings.

  • ratings (DataFrame) – a data table of (user, item, rating) triples.

  • scipy (bool | Literal['csr', 'coo']) – if True or 'csr', return a SciPy csr matrix instead of CSR. if 'coo', return a SciPy coo matrix.

  • torch (bool) – if True, return a PyTorch sparse tensor instead of a CSR.

  • users – an index of user IDs.

  • items – an index of items IDs.


a named tuple containing the sparse matrix, user index, and item index.

Return type:


class, users, items)

Bases: NamedTuple, Generic[M]

A rating matrix with associated indices.


The rating matrix, with users on rows and items on columns.



mapping from user IDs to row numbers.




mapping from item IDs to column numbers.



matrix: M

Alias for field number 0

users: Index

Alias for field number 1

items: Index

Alias for field number 2

Sampling Utilities

The module provides support functions for various data sampling procedures for use in model training., uv, sample)

Sample the examples from a user-item matrix. For each user in uv, it samples an item that they have not rated using rejection sampling.

While this is embarassingly parallel, we do not parallelize because it’s often used in parallel.

This returns both the items and the sample counts for debugging:

neg_items, counts = neg_sample(matrix, users, sample_unweighted)

Two arrays:

  1. The sampled negative item IDs.

  2. An array of sample counts, the number of samples required to sample each item. This is useful for diagnosing sample inefficiency.

Return type:

numpy.ndarray, numpy.ndarray

Candidate sampling function for use with neg_sample(). It samples items uniformly at random.

Candidate sampling function for use with neg_sample(). It samples items proportionally to their popularity.