The functions in
lenskit.batch enable you to generate many recommendations or
predictions at the same time, useful for evaluations and experiments.
The batch functions can parallelize over users with the optional
n_jobs parameter, or
LK_NUM_PROCS environment variable.
Scripts calling the batch recommendation or prediction facilites must be protected;
that is, they should not directly perform their work when run, but should define functions
and call a
main function when run as a script, with a block like this at the end of the
def main(): # do the actual work if __name__ == '__main__': main()
If you are using the batch functions from a Jupyter notbook, you should be fine - the Jupyter programs are appropriately protected.
recommend(algo, users, n, candidates=None, *, n_jobs=None, **kwargs)¶
Batch-recommend for multiple users. The provided algorithm should be a
algo – the algorithm
users (array-like) – the users to recommend for
n (int) – the number of recommendations to generate (None for unlimited)
candidates – the users’ candidate sets. This can be a function, in which case it will be passed each user ID; it can also be a dictionary, in which case user IDs will be looked up in it. Pass
Noneto use the recommender’s built-in candidate selector (usually recommended).
A frame with at least the columns
item; possibly also
score, and any other columns returned by the recommender.
predict(algo, pairs, *, n_jobs=None, **kwargs)¶
Generate predictions for user-item pairs. The provided algorithm should be a
algorithms.Predictoror a function of two arguments: the user ID and a list of item IDs. It should return a dictionary or a
pandas.Seriesmapping item IDs to predictions.
To use this function, provide a pre-fit algorithm:
>>> from lenskit.algorithms.bias import Bias >>> from lenskit.metrics.predict import rmse >>> from lenskit import datasets >>> ratings = datasets.MovieLens('data/ml-latest-small').ratings >>> bias = Bias() >>> bias.fit(ratings[:-1000]) <lenskit.algorithms.bias.Bias object at ...> >>> preds = predict(bias, ratings[-1000:]) >>> preds.head() user item rating timestamp prediction 99004 664 8361 3.0 1393891425 3.288286 99005 664 8528 3.5 1393891047 3.559119 99006 664 8529 4.0 1393891173 3.573008 99007 664 8636 4.0 1393891175 3.846268 99008 664 8641 4.5 1393890852 3.710635 >>> rmse(preds['prediction'], preds['rating']) 0.8326992222...
algo (lenskit.algorithms.Predictor) – A rating predictor function or algorithm.
pairs (pandas.DataFrame) – A data frame of (
item) pairs to predict for. If this frame also contains a
ratingcolumn, it will be included in the result.
a frame with columns
predictioncontaining the prediction results. If
pairscontains a rating column, this result will also contain a rating column.
- Return type
This function isn’t a batch function per se, as it doesn’t perform multiple operations, but it
is primarily useful with batch operations. The
train_isolated() function trains an
algorithm in a subprocess, so all temporary resources are released by virtue of the training
process exiting. It returns a shared memory serialization of the trained model, which can
be passed directly to
predict() in lieu of an algorithm object,
to reduce the total memory consumption.
algo = BiasedMF(50) algo = Recommender.adapt(algo) algo = batch.train_isolated(algo, train_ratings) preds = batch.predict(algo, test_ratings)
train_isolated(algo, ratings, *, file=None, **kwargs)¶
Train an algorithm in a subprocess to isolate the training process. This function spawns a subprocess (in the same way that LensKit’s multiprocessing support does), calls
lenskit.algorithms.Algorithm.fit()on it, and serializes the result for shared-memory use.
Training the algorithm in a single-purpose subprocess makes sure that any training resources, such as TensorFlow sessions, are cleaned up by virtue of the process terminating when model training is completed. It can also reduce memory use, because the original trained model and the shared memory version are not in memory at the same time. While the batch functions use shared memory to reduce memory overhead for parallel processing, naive use of these functions will still have 2 copies of the model in memory, the shared one and the original, because the sharing process does not tear down the original model. Training in a subprocess solves this problem elegantly.
algo (lenskit.algorithms.Algorithm) – The algorithm to train.
ratings (pandas.DataFrame) – The rating data.
kwargs (dict) – Additional named parameters to
The saved model object. This is the owner, so it needs to be closed when finished to free resources.
- Return type
MultiEval class is useful to build scripts that evaluate multiple algorithms
or algorithm variants, simultaneously, across multiple data sets. It can extract parameters
from algorithms and include them in the output, useful for hyperparameter search.
from lenskit.batch import MultiEval from lenskit.crossfold import partition_users, SampleN from lenskit.algorithms import basic, als from lenskit.datasets import MovieLens from lenskit import topn import pandas as pd ml = MovieLens('ml-latest-small') eval = MultiEval('my-eval', recommend=20) eval.add_datasets(partition_users(ml.ratings, 5, SampleN(5)), name='ML-Small') eval.add_algorithms(basic.Popular(), name='Pop') eval.add_algorithms([als.BiasedMF(f) for f in [20, 30, 40, 50]], attrs=['features'], name='ALS') eval.run()
my-eval/runs.csv file will then contain the results of running these
algorithms on this data set. A more complete example is available in the
MultiEval(path, *, predict=True, recommend=100, candidates=None, save_models=False, eval_n_jobs=None, combine=True, **kwargs)¶
A runner for carrying out multiple evaluations, such as parameter sweeps.
path (str or
pathlib.Path) – the working directory for this evaluation. It will be created if it does not exist.
predict (bool) – whether to generate rating predictions.
recommend (int) – the number of recommendations to generate per user. Any false-y value (
0) will disable top-n. The literal value
Truewill generate recommendation lists of unlimited size.
candidates (function) – the default candidate set generator for recommendations. It should take the training data and return a candidate generator, itself a function mapping user IDs to candidate sets. Pass
Noneto use the default candidate set configured for each algorithm (recommended).
combine (bool) – whether to combine output; if
False, output will be left in separate files, if
True, it will be in a single set of files (runs, recommendations, and predictions).
add_algorithms(algos, attrs=, **kwargs)¶
Add one or more algorithms to the run.
algos (algorithm or list) – the algorithm(s) to add.
attrs (list of str) – a list of attributes to extract from the algorithm objects and include in the run descriptions.
kwargs – additional attributes to include in the run descriptions.
add_datasets(data, name=None, candidates=None, **kwargs)¶
Add one or more datasets to the run.
The input data set(s) to run. Can be one of the following:
A tuple of (train, test) data.
An iterable of (train, test) pairs, in which case the iterable is not consumed until it is needed.
A function yielding either of the above, to defer data load until it is needed.
Data can be either data frames or paths; paths are loaded after detection using
kwargs – additional attributes pertaining to these data sets.
Persist the data for an experiment, replacing in-memory data sets with file names. Once this has been called, the sweep can be pickled.
Get the number of runs in this evaluation.
run(runs=None, *, progress=None)¶
Run the evaluation.
runs (int or set-like) – If provided, a specific set of runs to run. Useful for splitting an experiment into individual runs. This is a set of 1-based run IDs, not 0-based indexes.
progress – A
tqdm.tqdm()-compatible progress function.
Collect the results from non-combined runs into combined output files.