LensKit strives to provide pretty good performance (in terms of computation speed), but sometimes it needs a little nudging.
If you are implementing an algorithm, see the implementation tips for information on good performance.
Use Conda-based Python, with
MKL_THREADING_LAYERenvironment variable to
tbb, so both MKL and LensKit will use TBB and can coordinate their thread pools.
LK_NUM_PROCSif you want to control LensKit’s batch prediction and recommendation parallelism, and
NUMBA_NUM_THREADSto control its model training parallelism.
We generally find the best performance using MKL with TBB throughout the stack. If both LensKit’s Numba-accelerated code and MKL are using TBB, they will coordinate their thread pools to coordinate threading levels.
LensKit has two forms of parallelism. Algorithm training processes can be parallelized through a number of mechanisms:
Our own parallel code uses Numba, which in turn uses TBB (preferred) or OpenMP. The thread count is controlled by
The BLAS library may parallelize underlying operations using its threading library. This is usually OpenMP; MKL also supports TBB, but unlike Numba, it defaults to OpenMP even if TBB is available.
Underlying libraries such as TensorFlow and scikit-learn may provide their own parallelism.
The LensKit batch functions use Python
multiprocessing, and their concurrency
level is controlled by the
LK_NUM_PROCS environment variable. The default number
of processes is one-half the number of cores as reported by
The batch functions also set the thread count for some libraries within the worker
procesess, to prevent over-subscribing the CPU. Right now, the worker will configure
Numba and MKL. In the rest of this section, this will be referred to as the ‘inner
The thread count logic is controlled by
and works as follows:
LK_NUM_PROCSis an integer, the batch functions will use the specified number of processes, and with 1 inner thread.
LK_NUM_PROCSis a comma-separated pair of integers (e.g.
8,4), the batch functions will use the first number for the process count and the second number as the inner thread count. This overrides
NUMBA_NUM_THREADS, unless it is larger than
LK_NUM_PROCSis not set, the batch functions use half the number of cores as the process count and 2 as the inner thread count (unless
NUMBA_NUM_THREADSis set to 1 in the environment).
Batch parallelism disables TensorFlow GPUs in the worker threads. This is fine, because GPUs are most useful for model training; multiple worker processes competing for the GPU causes problems.