k-NN Collaborative Filtering
LKPY provides user- and item-based classical k-NN collaborative Filtering implementations. These lightly-configurable implementations are intended to capture the behavior of the Java-based LensKit implementations to provide a good upgrade path and enable basic experiments out of the box.
There are two different primary modes that you can use these algorithms in. When using explicit
feedback (rating values), you usually want to use the defaults of weighted-average aggregation and
mean-centering normalization. This is the default mode, and can be selected explicitly by passing
feedback='explicit'
to the class constructor.
With implicit feedback (unary data such as clicks and purchases, typically represented with
rating values of 1 for positive items), the usual design is sum aggregation and no centering. This
can be selected with feedback='implicit'
, which also configures the algorithm to ignore rating
values (when present) and treat every rating as 1:
implicit_knn = ItemItem(20, feedback='implicit')
Attempting to center data on the same scale (all 1, for example) will typically produce invalid results. ItemKNN has diagnostics to warn you about this.
The feedback
option only sets defaults; the algorithm can be further configured (e.g. to re-enable
rating values) with additional parameters to the constructor.
New in version 0.14: The feedback
option and the ability to ignore rating values was added in LensKit 0.14.
In previous versions, you need to specifically configure each option.
Item-based k-NN
This is LensKit’s item-based k-NN model, based on the description by Deshpande and Karypis [DK04].
- class lenskit.algorithms.item_knn.ItemItem(nnbrs, min_nbrs=1, min_sim=1e-06, save_nbrs=None, feedback='explicit', **kwargs)
Bases:
Predictor
Item-item nearest-neighbor collaborative filtering with ratings. This item-item implementation is not terribly configurable; it hard-codes design decisions found to work well in the previous Java-based LensKit code [ELKR11]. This implementation is based on the description of item-based CF by Deshpande and Karypis [DK04], and produces results equivalent to Java LensKit.
The k-NN predictor supports several aggregate functions:
weighted-average
The weighted average of the user’s rating values, using item-item similarities as weights.
sum
The sum of the similarities between the target item and the user’s rated items, regardless of the rating the user gave the items.
- Parameters:
nnbrs (int) – the maximum number of neighbors for scoring each item (
None
for unlimited)min_nbrs (int) – the minimum number of neighbors for scoring each item
min_sim (float) – minimum similarity threshold for considering a neighbor
save_nbrs (float) – the number of neighbors to save per item in the trained model (
None
for unlimited)feedback (str) –
Control how feedback should be interpreted. Specifies defaults for the other settings, which can be overridden individually; can be one of the following values:
explicit
Configure for explicit-feedback mode: use rating values, center ratings, and use the
weighted-average
aggregate method for prediction. This is the default setting.implicit
Configure for implicit-feedback mode: ignore rating values, do not center ratings, and use the
sum
aggregate method for prediction.
center (bool) – whether to normalize (mean-center) rating vectors prior to computing similarities and aggregating user rating values. Defaults to
True
; turn this off when working with unary data and other data types that don’t respond well to centering.aggregate (str) – the type of aggregation to do. Can be
weighted-average
(the default) orsum
.use_ratings (bool) – whether or not to use the rating values. If
False
, it ignores rating values and considers an implicit feedback signal of 1 for every (user,item) pair present.
- item_index_
the index of item IDs.
- Type:
- item_means_
the mean rating for each known item.
- Type:
- item_counts_
the number of saved neighbors for each item.
- Type:
- sim_matrix_
the similarity matrix.
- Type:
matrix.CSR
- user_index_
the index of known user IDs for the rating matrix.
- Type:
- rating_matrix_
the user-item rating matrix for looking up users’ ratings.
- Type:
matrix.CSR
- IGNORED_PARAMS = ['feedback']
Names of parameters to ignore in
get_params()
.
- EXTRA_PARAMS = ['center', 'aggregate', 'use_ratings']
Names of extra parameters to include in
get_params()
. Useful when the constructor takes**kwargs
.
- fit(ratings, **kwargs)
Train a model.
The model-training process depends on
save_nbrs
andmin_sim
, but not on other algorithm parameters.- Parameters:
ratings (pandas.DataFrame) – (user,item,rating) data for computing item similarities.
- predict_for_user(user, items, ratings=None)
Compute predictions for a user and items.
- Parameters:
user – the user ID
items (array-like) – the items to predict
ratings (pandas.Series) – the user’s ratings (indexed by item id); if provided, they may be used to override or augment the model’s notion of a user’s preferences.
- Returns:
scores for the items, indexed by item id.
- Return type:
User-based k-NN
- class lenskit.algorithms.user_knn.UserUser(nnbrs, min_nbrs=1, min_sim=0, feedback='explicit', **kwargs)
Bases:
Predictor
User-user nearest-neighbor collaborative filtering with ratings. This user-user implementation is not terribly configurable; it hard-codes design decisions found to work well in the previous Java-based LensKit code.
- Parameters:
nnbrs (int) – the maximum number of neighbors for scoring each item (
None
for unlimited)min_nbrs (int) – the minimum number of neighbors for scoring each item
min_sim (float) – minimum similarity threshold for considering a neighbor
feedback (str) –
Control how feedback should be interpreted. Specifies defaults for the other settings, which can be overridden individually; can be one of the following values:
explicit
Configure for explicit-feedback mode: use rating values, center ratings, and use the
weighted-average
aggregate method for prediction. This is the default setting.implicit
Configure for implicit-feedback mode: ignore rating values, do not center ratings, and use the
sum
aggregate method for prediction.
center (bool) – whether to normalize (mean-center) rating vectors. Turn this off when working with unary data and other data types that don’t respond well to centering.
aggregate (str) – the type of aggregation to do. Can be
weighted-average
orsum
.use_ratings (bool) – whether or not to use rating values; default is
True
. IfFalse
, it ignores rating values and treates every present rating as 1.
- user_index_
User index.
- Type:
- item_index_
Item index.
- Type:
- user_means_
User mean ratings.
- Type:
- rating_matrix_
Normalized user-item rating matrix.
- Type:
matrix.CSR
- transpose_matrix_
Transposed un-normalized rating matrix.
- Type:
matrix.CSR
- IGNORED_PARAMS = ['feedback']
Names of parameters to ignore in
get_params()
.
- EXTRA_PARAMS = ['center', 'aggregate', 'use_ratings']
Names of extra parameters to include in
get_params()
. Useful when the constructor takes**kwargs
.
- fit(ratings, **kwargs)
“Train” a user-user CF model. This memorizes the rating data in a format that is usable for future computations.
- Parameters:
ratings (pandas.DataFrame) – (user, item, rating) data for collaborative filtering.
- predict_for_user(user, items, ratings=None)
Compute predictions for a user and items.
- Parameters:
user – the user ID
items (array-like) – the items to predict
ratings (pandas.Series) – the user’s ratings (indexed by item id); if provided, will be used to recompute the user’s bias at prediction time.
- Returns:
scores for the items, indexed by item id.
- Return type: