Top-N Evaluation
LensKit’s support for top-N evaluation is in two parts, because there are some subtle complexities that make it more dfficult to get the right data in the right place for computing metrics correctly.
Top-N Analysis
The lenskit.topn
module contains the utilities for carrying out top-N
analysis, in conjucntion with lenskit.batch.recommend()
.
The entry point to this is RecListAnalysis
. This class encapsulates
an analysis with one or more metrics, and can apply it to data frames of recommendations.
An analysis requires two data frames: the recommendation frame contains the recommendations
themselves, and the truth frame contains the ground truth data for the users. The
analysis is flexible with regards to the columns that identify individual recommendation
lists; usually these will consist of a user ID, data set identifier, and algorithm
identifier(s), but the analysis is configurable and its defaults make minimal assumptions.
The recommendation frame does need an item
column with the recommended item IDs,
and it should be in order within a single recommendation list.
The truth frame should contain (a subset of) the columns identifying recommendation
lists, along with item
and, if available, rating
(if no rating is provided,
the metrics that need a rating value will assume a rating of 1 for every item present).
It can contain other items that custom metrics may find useful as well.
For example, a recommendation frame may contain:
DataSet
Partition
Algorithm
user
item
rank
score
And the truth frame:
DataSet
user
item
rating
The analysis will use this truth as the relevant item data for measuring the accuracy of the
roecommendation lists. Recommendations will be matched to test ratings by data set, user,
and item, using RecListAnalysis
defaults.
Identifying columns will be used to create two synthetic identifiers, LKRecID (the recommendation list identifier) and LKTruthID (the truth list identifier), that are used in the internal data frames. Custom metric classes will see these on the data frames instead of other identifying columns.
- class lenskit.topn.RecListAnalysis(group_cols=None, n_jobs=None)
Bases:
object
Compute one or more top-N metrics over recommendation lists.
This method groups the recommendations by the specified columns, and computes the metric over each group. The default set of grouping columns is all columns except the following:
item
rank
score
rating
The truth frame,
truth
, is expected to match over (a subset of) the grouping columns, and contain at least anitem
column. If it also contains arating
column, that is used as the users’ rating for metrics that require it; otherwise, a rating value of 1 is assumed.- Parameters
group_cols (list) – The columns to group by, or
None
to use the default.
- add_metric(metric, *, name=None, **kwargs)
Add a metric to the analysis.
A metric is a function of two arguments: the a single group of the recommendation frame, and the corresponding truth frame. The truth frame will be indexed by item ID. The recommendation frame will be in the order in the data. Many metrics are defined in
lenskit.metrics.topn
; they are re-exported fromlenskit.topn
for convenience.- Parameters
metric – The metric to compute.
name – The name to assign the metric. If not provided, the function name is used.
**kwargs – Additional arguments to pass to the metric.
- compute(recs, truth, *, include_missing=False)
Run the analysis. Neither data frame should be meaningfully indexed.
- Parameters
recs (pandas.DataFrame) – A data frame of recommendations.
truth (pandas.DataFrame) – A data frame of ground truth (test) data.
include_missing (bool) –
True
to include users from truth missing from recs. Matches are done via group columns that appear in bothrecs
andtruth
.
- Returns
The results of the analysis.
- Return type
Metrics
The lenskit.metrics.topn
module contains metrics for evaluating top-N
recommendation lists.
Each of the top-N metrics supports an optional keyword argument k
that specifies the expected list
length. Recommendation lists are truncated to this length prior to measurment (so you can measure a
a metric at multiple values of k
in a single analysis), and for recall-oriented metrics like
recall()
and ndcg()
, it normalizes the best-case possible items to k
(because if
there are 10 relevant items, Recall@5 should be 1 when the list returns any 5 relevant items).
To use this, pass extra arguments to RecListAnalysis.add_metric()
:
rla.add_metric(ndcg, k=5)
rla.add_metric(ndcg, name='ndcg_10', k=10)
The default is to allow unbounded lists. When using large recommendation lists, if users never have
more test ratings than there are recommended items, the default makes sense. For short recommendation
lists you will usually need to specify k
. Unfortunately, there isn’t a practical way to guess k
,
because shorter lists may mean that the recommender could not produce a full-length list.
Classification Metrics
These metrics treat the recommendation list as a classification of relevant items.
- lenskit.metrics.topn.precision(recs, truth, k=None)
Compute recommendation precision. This is computed as:
\[\frac{|L \cap I_u^{\mathrm{test}}|}{|L|}\]In the uncommon case that
k
is specified andlen(recs) < k
, this metric useslen(recs)
as the denominator.This metric has a bulk implementation.
- lenskit.metrics.topn.recall(recs, truth, k=None)
Compute recommendation recall. This is computed as:
\[\frac{|L \cap I_u^{\mathrm{test}}|}{\operatorname{max}\{|I_u^{\mathrm{test}}|, k\}}\]This metric has a bulk implementation.
- lenskit.metrics.topn.hit(recs, truth, k=None)
Compute whether or not a list is a hit; any list with at least one relevant item in the first \(k\) positions (\(L_{\le k} \cap I_u^{\mathrm{test}} \ne \emptyset\)) is scored as 1, and lists with no relevant items as 0. When averaged over the recommendation lists, this computes the hit rate [DK04].
\[\frac{|L \cap I_u^{\mathrm{test}}|}{\operatorname{max}\{|I_u^{\mathrm{test}}|, k\}}\]This metric has a bulk implementation.
Ranked List Metrics
These metrics treat the recommendation list as a ranked list of items that may or may not be relevant.
- lenskit.metrics.topn.recip_rank(recs, truth, k=None)
Compute the reciprocal rank [KV97] of the first relevant item in a list of recommendations. Let \(\kappa\) denote the 1-based rank of the first relevant item in \(L\), with \(\kappa=\infty\) if none of the first \(k\) items in \(L\) are relevant; then the reciprocal rank is \(1 / \kappa\). If no elements are relevant, the reciprocal rank is therefore 0. Deshpande and Karypis [DK04] call this the “reciprocal hit rate”.
This metric has a bulk equivalent.
Utility Metrics
The NDCG function estimates a utility score for a ranked list of recommendations.
- lenskit.metrics.topn.ndcg(recs, truth, discount=<ufunc 'log2'>, k=None)
Compute the normalized discounted cumulative gain [JarvelinKekalainen02].
Discounted cumultative gain is computed as:
\[\begin{align*} \mathrm{DCG}(L,u) & = \sum_{i=1}^{|L|} \frac{r_{ui}}{d(i)} \end{align*}\]Unrated items are assumed to have a utility of 0; if no rating values are provided in the truth frame, item ratings are assumed to be 1.
This is then normalized as follows:
\[\begin{align*} \mathrm{nDCG}(L, u) & = \frac{\mathrm{DCG}(L,u)}{\mathrm{DCG}(L_{\mathrm{ideal}}, u)} \end{align*}\]This metric has a bulk implementation.
- Parameters
recs – The recommendation list.
truth – The user’s test data.
discount (numpy.ufunc) – The rank discount function. Each item’s score will be divided the discount of its rank, if the discount is greater than 1.
k (int) – The maximum list length.
- lenskit.metrics.topn.dcg(recs, truth, discount=<ufunc 'log2'>)
Compute the unnormalized discounted cumulative gain [JarvelinKekalainen02].
Discounted cumultative gain is computed as:
\[\begin{align*} \mathrm{DCG}(L,u) & = \sum_{i=1}^{|L|} \frac{r_{ui}}{d(i)} \end{align*}\]Unrated items are assumed to have a utility of 0; if no rating values are provided in the truth frame, item ratings are assumed to be 1.
- Parameters
recs – The recommendation list.
truth – The user’s test data.
discount (ufunc) – The rank discount function. Each item’s score will be divided the discount of its rank, if the discount is greater than 1.
We also expose the internal DCG computation directly.
- lenskit.metrics.topn._dcg(scores, discount=<ufunc 'log2'>)
Compute the Discounted Cumulative Gain of a series of recommended items with rating scores. These should be relevance scores; they can be \({0,1}\) for binary relevance data.
This is not a true top-N metric, but is a utility function for other metrics.
- Parameters
scores (array-like) – The utility scores of a list of recommendations, in recommendation order.
discount (ufunc) – the rank discount function. Each item’s score will be divided the discount of its rank, if the discount is greater than 1.
- Returns
the DCG of the scored items.
- Return type
double
Writing a Metric
A metric is a function that takes two positional parameters:
recs
, a data frame of recommendations for a single recommendation list.truth
, a data frame of ground-truth data (usually ratings) for the user for whom the list was generated.
It can take additional keyword arguments that are passed through from RecListAnalysis.add_metric()
.
A metric then returns a single floating-point value; NaN is allowed.
Metrics can be further optimized with the bulk interface. A bulk metric function takes recs
and truth
frames for the entire set of recommendations, with transformation (they have
LKRecID
and LKTruthID
columns instead of other identifying columns), and returns a series
whose index is LKRecID
and values are the metric values for each list. Further, the recs
passed to a bulk implementation includes a 1-based rank for each recommendation.
The bulk_impl()
function registers a bulk implementation of a metric:
def metric(recs, truth):
# normal metric implementation
pass
@bulk_impl(metric)
def _bulk_metric(recs, truth):
# bulk metric implementation
If a bulk implementation of a metric is available, and it is possible to use it, it will be used automatically
when the corresponding metric is passed to add_metric()
.
- lenskit.metrics.topn.bulk_impl(metric)
Decorator to register a bulk implementation for a metric.