421 lines
17 KiB
ReStructuredText
421 lines
17 KiB
ReStructuredText
|
.. _performance-howto:
|
||
|
|
||
|
=========================
|
||
|
How to optimize for speed
|
||
|
=========================
|
||
|
|
||
|
The following gives some practical guidelines to help you write efficient
|
||
|
code for the scikit-learn project.
|
||
|
|
||
|
.. note::
|
||
|
|
||
|
While it is always useful to profile your code so as to **check
|
||
|
performance assumptions**, it is also highly recommended
|
||
|
to **review the literature** to ensure that the implemented algorithm
|
||
|
is the state of the art for the task before investing into costly
|
||
|
implementation optimization.
|
||
|
|
||
|
Times and times, hours of efforts invested in optimizing complicated
|
||
|
implementation details have been rendered irrelevant by the subsequent
|
||
|
discovery of simple **algorithmic tricks**, or by using another algorithm
|
||
|
altogether that is better suited to the problem.
|
||
|
|
||
|
The section :ref:`warm-restarts` gives an example of such a trick.
|
||
|
|
||
|
|
||
|
Python, Cython or C/C++?
|
||
|
========================
|
||
|
|
||
|
.. currentmodule:: sklearn
|
||
|
|
||
|
In general, the scikit-learn project emphasizes the **readability** of
|
||
|
the source code to make it easy for the project users to dive into the
|
||
|
source code so as to understand how the algorithm behaves on their data
|
||
|
but also for ease of maintainability (by the developers).
|
||
|
|
||
|
When implementing a new algorithm is thus recommended to **start
|
||
|
implementing it in Python using Numpy and Scipy** by taking care of avoiding
|
||
|
looping code using the vectorized idioms of those libraries. In practice
|
||
|
this means trying to **replace any nested for loops by calls to equivalent
|
||
|
Numpy array methods**. The goal is to avoid the CPU wasting time in the
|
||
|
Python interpreter rather than crunching numbers to fit your statistical
|
||
|
model. It's generally a good idea to consider NumPy and SciPy performance tips:
|
||
|
https://scipy.github.io/old-wiki/pages/PerformanceTips
|
||
|
|
||
|
Sometimes however an algorithm cannot be expressed efficiently in simple
|
||
|
vectorized Numpy code. In this case, the recommended strategy is the
|
||
|
following:
|
||
|
|
||
|
1. **Profile** the Python implementation to find the main bottleneck and
|
||
|
isolate it in a **dedicated module level function**. This function
|
||
|
will be reimplemented as a compiled extension module.
|
||
|
|
||
|
2. If there exists a well maintained BSD or MIT **C/C++** implementation
|
||
|
of the same algorithm that is not too big, you can write a
|
||
|
**Cython wrapper** for it and include a copy of the source code
|
||
|
of the library in the scikit-learn source tree: this strategy is
|
||
|
used for the classes :class:`svm.LinearSVC`, :class:`svm.SVC` and
|
||
|
:class:`linear_model.LogisticRegression` (wrappers for liblinear
|
||
|
and libsvm).
|
||
|
|
||
|
3. Otherwise, write an optimized version of your Python function using
|
||
|
**Cython** directly. This strategy is used
|
||
|
for the :class:`linear_model.ElasticNet` and
|
||
|
:class:`linear_model.SGDClassifier` classes for instance.
|
||
|
|
||
|
4. **Move the Python version of the function in the tests** and use
|
||
|
it to check that the results of the compiled extension are consistent
|
||
|
with the gold standard, easy to debug Python version.
|
||
|
|
||
|
5. Once the code is optimized (not simple bottleneck spottable by
|
||
|
profiling), check whether it is possible to have **coarse grained
|
||
|
parallelism** that is amenable to **multi-processing** by using the
|
||
|
``joblib.Parallel`` class.
|
||
|
|
||
|
When using Cython, use either
|
||
|
|
||
|
.. prompt:: bash $
|
||
|
|
||
|
python setup.py build_ext -i
|
||
|
python setup.py install
|
||
|
|
||
|
to generate C files. You are responsible for adding .c/.cpp extensions along
|
||
|
with build parameters in each submodule ``setup.py``.
|
||
|
|
||
|
C/C++ generated files are embedded in distributed stable packages. The goal is
|
||
|
to make it possible to install scikit-learn stable version
|
||
|
on any machine with Python, Numpy, Scipy and C/C++ compiler.
|
||
|
|
||
|
.. _profiling-python-code:
|
||
|
|
||
|
Profiling Python code
|
||
|
=====================
|
||
|
|
||
|
In order to profile Python code we recommend to write a script that
|
||
|
loads and prepare you data and then use the IPython integrated profiler
|
||
|
for interactively exploring the relevant part for the code.
|
||
|
|
||
|
Suppose we want to profile the Non Negative Matrix Factorization module
|
||
|
of scikit-learn. Let us setup a new IPython session and load the digits
|
||
|
dataset and as in the :ref:`sphx_glr_auto_examples_classification_plot_digits_classification.py` example::
|
||
|
|
||
|
In [1]: from sklearn.decomposition import NMF
|
||
|
|
||
|
In [2]: from sklearn.datasets import load_digits
|
||
|
|
||
|
In [3]: X, _ = load_digits(return_X_y=True)
|
||
|
|
||
|
Before starting the profiling session and engaging in tentative
|
||
|
optimization iterations, it is important to measure the total execution
|
||
|
time of the function we want to optimize without any kind of profiler
|
||
|
overhead and save it somewhere for later reference::
|
||
|
|
||
|
In [4]: %timeit NMF(n_components=16, tol=1e-2).fit(X)
|
||
|
1 loops, best of 3: 1.7 s per loop
|
||
|
|
||
|
To have a look at the overall performance profile using the ``%prun``
|
||
|
magic command::
|
||
|
|
||
|
In [5]: %prun -l nmf.py NMF(n_components=16, tol=1e-2).fit(X)
|
||
|
14496 function calls in 1.682 CPU seconds
|
||
|
|
||
|
Ordered by: internal time
|
||
|
List reduced from 90 to 9 due to restriction <'nmf.py'>
|
||
|
|
||
|
ncalls tottime percall cumtime percall filename:lineno(function)
|
||
|
36 0.609 0.017 1.499 0.042 nmf.py:151(_nls_subproblem)
|
||
|
1263 0.157 0.000 0.157 0.000 nmf.py:18(_pos)
|
||
|
1 0.053 0.053 1.681 1.681 nmf.py:352(fit_transform)
|
||
|
673 0.008 0.000 0.057 0.000 nmf.py:28(norm)
|
||
|
1 0.006 0.006 0.047 0.047 nmf.py:42(_initialize_nmf)
|
||
|
36 0.001 0.000 0.010 0.000 nmf.py:36(_sparseness)
|
||
|
30 0.001 0.000 0.001 0.000 nmf.py:23(_neg)
|
||
|
1 0.000 0.000 0.000 0.000 nmf.py:337(__init__)
|
||
|
1 0.000 0.000 1.681 1.681 nmf.py:461(fit)
|
||
|
|
||
|
The ``tottime`` column is the most interesting: it gives to total time spent
|
||
|
executing the code of a given function ignoring the time spent in executing the
|
||
|
sub-functions. The real total time (local code + sub-function calls) is given by
|
||
|
the ``cumtime`` column.
|
||
|
|
||
|
Note the use of the ``-l nmf.py`` that restricts the output to lines that
|
||
|
contains the "nmf.py" string. This is useful to have a quick look at the hotspot
|
||
|
of the nmf Python module it-self ignoring anything else.
|
||
|
|
||
|
Here is the beginning of the output of the same command without the ``-l nmf.py``
|
||
|
filter::
|
||
|
|
||
|
In [5] %prun NMF(n_components=16, tol=1e-2).fit(X)
|
||
|
16159 function calls in 1.840 CPU seconds
|
||
|
|
||
|
Ordered by: internal time
|
||
|
|
||
|
ncalls tottime percall cumtime percall filename:lineno(function)
|
||
|
2833 0.653 0.000 0.653 0.000 {numpy.core._dotblas.dot}
|
||
|
46 0.651 0.014 1.636 0.036 nmf.py:151(_nls_subproblem)
|
||
|
1397 0.171 0.000 0.171 0.000 nmf.py:18(_pos)
|
||
|
2780 0.167 0.000 0.167 0.000 {method 'sum' of 'numpy.ndarray' objects}
|
||
|
1 0.064 0.064 1.840 1.840 nmf.py:352(fit_transform)
|
||
|
1542 0.043 0.000 0.043 0.000 {method 'flatten' of 'numpy.ndarray' objects}
|
||
|
337 0.019 0.000 0.019 0.000 {method 'all' of 'numpy.ndarray' objects}
|
||
|
2734 0.011 0.000 0.181 0.000 fromnumeric.py:1185(sum)
|
||
|
2 0.010 0.005 0.010 0.005 {numpy.linalg.lapack_lite.dgesdd}
|
||
|
748 0.009 0.000 0.065 0.000 nmf.py:28(norm)
|
||
|
...
|
||
|
|
||
|
The above results show that the execution is largely dominated by
|
||
|
dot products operations (delegated to blas). Hence there is probably
|
||
|
no huge gain to expect by rewriting this code in Cython or C/C++: in
|
||
|
this case out of the 1.7s total execution time, almost 0.7s are spent
|
||
|
in compiled code we can consider optimal. By rewriting the rest of the
|
||
|
Python code and assuming we could achieve a 1000% boost on this portion
|
||
|
(which is highly unlikely given the shallowness of the Python loops),
|
||
|
we would not gain more than a 2.4x speed-up globally.
|
||
|
|
||
|
Hence major improvements can only be achieved by **algorithmic
|
||
|
improvements** in this particular example (e.g. trying to find operation
|
||
|
that are both costly and useless to avoid computing then rather than
|
||
|
trying to optimize their implementation).
|
||
|
|
||
|
It is however still interesting to check what's happening inside the
|
||
|
``_nls_subproblem`` function which is the hotspot if we only consider
|
||
|
Python code: it takes around 100% of the accumulated time of the module. In
|
||
|
order to better understand the profile of this specific function, let
|
||
|
us install ``line_profiler`` and wire it to IPython:
|
||
|
|
||
|
.. prompt:: bash $
|
||
|
|
||
|
pip install line_profiler
|
||
|
|
||
|
**Under IPython 0.13+**, first create a configuration profile:
|
||
|
|
||
|
.. prompt:: bash $
|
||
|
|
||
|
ipython profile create
|
||
|
|
||
|
Then register the line_profiler extension in
|
||
|
``~/.ipython/profile_default/ipython_config.py``::
|
||
|
|
||
|
c.TerminalIPythonApp.extensions.append('line_profiler')
|
||
|
c.InteractiveShellApp.extensions.append('line_profiler')
|
||
|
|
||
|
This will register the ``%lprun`` magic command in the IPython terminal application and the other frontends such as qtconsole and notebook.
|
||
|
|
||
|
Now restart IPython and let us use this new toy::
|
||
|
|
||
|
In [1]: from sklearn.datasets import load_digits
|
||
|
|
||
|
In [2]: from sklearn.decomposition import NMF
|
||
|
... : from sklearn.decomposition._nmf import _nls_subproblem
|
||
|
|
||
|
In [3]: X, _ = load_digits(return_X_y=True)
|
||
|
|
||
|
In [4]: %lprun -f _nls_subproblem NMF(n_components=16, tol=1e-2).fit(X)
|
||
|
Timer unit: 1e-06 s
|
||
|
|
||
|
File: sklearn/decomposition/nmf.py
|
||
|
Function: _nls_subproblem at line 137
|
||
|
Total time: 1.73153 s
|
||
|
|
||
|
Line # Hits Time Per Hit % Time Line Contents
|
||
|
==============================================================
|
||
|
137 def _nls_subproblem(V, W, H_init, tol, max_iter):
|
||
|
138 """Non-negative least square solver
|
||
|
...
|
||
|
170 """
|
||
|
171 48 5863 122.1 0.3 if (H_init < 0).any():
|
||
|
172 raise ValueError("Negative values in H_init passed to NLS solver.")
|
||
|
173
|
||
|
174 48 139 2.9 0.0 H = H_init
|
||
|
175 48 112141 2336.3 5.8 WtV = np.dot(W.T, V)
|
||
|
176 48 16144 336.3 0.8 WtW = np.dot(W.T, W)
|
||
|
177
|
||
|
178 # values justified in the paper
|
||
|
179 48 144 3.0 0.0 alpha = 1
|
||
|
180 48 113 2.4 0.0 beta = 0.1
|
||
|
181 638 1880 2.9 0.1 for n_iter in range(1, max_iter + 1):
|
||
|
182 638 195133 305.9 10.2 grad = np.dot(WtW, H) - WtV
|
||
|
183 638 495761 777.1 25.9 proj_gradient = norm(grad[np.logical_or(grad < 0, H > 0)])
|
||
|
184 638 2449 3.8 0.1 if proj_gradient < tol:
|
||
|
185 48 130 2.7 0.0 break
|
||
|
186
|
||
|
187 1474 4474 3.0 0.2 for inner_iter in range(1, 20):
|
||
|
188 1474 83833 56.9 4.4 Hn = H - alpha * grad
|
||
|
189 # Hn = np.where(Hn > 0, Hn, 0)
|
||
|
190 1474 194239 131.8 10.1 Hn = _pos(Hn)
|
||
|
191 1474 48858 33.1 2.5 d = Hn - H
|
||
|
192 1474 150407 102.0 7.8 gradd = np.sum(grad * d)
|
||
|
193 1474 515390 349.7 26.9 dQd = np.sum(np.dot(WtW, d) * d)
|
||
|
...
|
||
|
|
||
|
By looking at the top values of the ``% Time`` column it is really easy to
|
||
|
pin-point the most expensive expressions that would deserve additional care.
|
||
|
|
||
|
|
||
|
Memory usage profiling
|
||
|
======================
|
||
|
|
||
|
You can analyze in detail the memory usage of any Python code with the help of
|
||
|
`memory_profiler <https://pypi.org/project/memory_profiler/>`_. First,
|
||
|
install the latest version:
|
||
|
|
||
|
.. prompt:: bash $
|
||
|
|
||
|
pip install -U memory_profiler
|
||
|
|
||
|
Then, setup the magics in a manner similar to ``line_profiler``.
|
||
|
|
||
|
**Under IPython 0.11+**, first create a configuration profile:
|
||
|
|
||
|
.. prompt:: bash $
|
||
|
|
||
|
ipython profile create
|
||
|
|
||
|
|
||
|
Then register the extension in
|
||
|
``~/.ipython/profile_default/ipython_config.py``
|
||
|
alongside the line profiler::
|
||
|
|
||
|
c.TerminalIPythonApp.extensions.append('memory_profiler')
|
||
|
c.InteractiveShellApp.extensions.append('memory_profiler')
|
||
|
|
||
|
This will register the ``%memit`` and ``%mprun`` magic commands in the
|
||
|
IPython terminal application and the other frontends such as qtconsole and notebook.
|
||
|
|
||
|
``%mprun`` is useful to examine, line-by-line, the memory usage of key
|
||
|
functions in your program. It is very similar to ``%lprun``, discussed in the
|
||
|
previous section. For example, from the ``memory_profiler`` ``examples``
|
||
|
directory::
|
||
|
|
||
|
In [1] from example import my_func
|
||
|
|
||
|
In [2] %mprun -f my_func my_func()
|
||
|
Filename: example.py
|
||
|
|
||
|
Line # Mem usage Increment Line Contents
|
||
|
==============================================
|
||
|
3 @profile
|
||
|
4 5.97 MB 0.00 MB def my_func():
|
||
|
5 13.61 MB 7.64 MB a = [1] * (10 ** 6)
|
||
|
6 166.20 MB 152.59 MB b = [2] * (2 * 10 ** 7)
|
||
|
7 13.61 MB -152.59 MB del b
|
||
|
8 13.61 MB 0.00 MB return a
|
||
|
|
||
|
Another useful magic that ``memory_profiler`` defines is ``%memit``, which is
|
||
|
analogous to ``%timeit``. It can be used as follows::
|
||
|
|
||
|
In [1]: import numpy as np
|
||
|
|
||
|
In [2]: %memit np.zeros(1e7)
|
||
|
maximum of 3: 76.402344 MB per loop
|
||
|
|
||
|
For more details, see the docstrings of the magics, using ``%memit?`` and
|
||
|
``%mprun?``.
|
||
|
|
||
|
|
||
|
Using Cython
|
||
|
============
|
||
|
|
||
|
If profiling of the Python code reveals that the Python interpreter
|
||
|
overhead is larger by one order of magnitude or more than the cost of the
|
||
|
actual numerical computation (e.g. ``for`` loops over vector components,
|
||
|
nested evaluation of conditional expression, scalar arithmetic...), it
|
||
|
is probably adequate to extract the hotspot portion of the code as a
|
||
|
standalone function in a ``.pyx`` file, add static type declarations and
|
||
|
then use Cython to generate a C program suitable to be compiled as a
|
||
|
Python extension module.
|
||
|
|
||
|
The `Cython's documentation <http://docs.cython.org/>`_ contains a tutorial and
|
||
|
reference guide for developing such a module.
|
||
|
For more information about developing in Cython for scikit-learn, see :ref:`cython`.
|
||
|
|
||
|
|
||
|
.. _profiling-compiled-extension:
|
||
|
|
||
|
Profiling compiled extensions
|
||
|
=============================
|
||
|
|
||
|
When working with compiled extensions (written in C/C++ with a wrapper or
|
||
|
directly as Cython extension), the default Python profiler is useless:
|
||
|
we need a dedicated tool to introspect what's happening inside the
|
||
|
compiled extension it-self.
|
||
|
|
||
|
Using yep and gperftools
|
||
|
------------------------
|
||
|
|
||
|
Easy profiling without special compilation options use yep:
|
||
|
|
||
|
- https://pypi.org/project/yep/
|
||
|
- https://fa.bianp.net/blog/2011/a-profiler-for-python-extensions
|
||
|
|
||
|
Using a debugger, gdb
|
||
|
---------------------
|
||
|
|
||
|
* It is helpful to use ``gdb`` to debug. In order to do so, one must use
|
||
|
a Python interpreter built with debug support (debug symbols and proper
|
||
|
optimization). To create a new conda environment (which you might need
|
||
|
to deactivate and reactivate after building/installing) with a source-built
|
||
|
CPython interpreter:
|
||
|
|
||
|
.. code-block:: bash
|
||
|
|
||
|
git clone https://github.com/python/cpython.git
|
||
|
conda create -n debug-scikit-dev
|
||
|
conda activate debug-scikit-dev
|
||
|
cd cpython
|
||
|
mkdir debug
|
||
|
cd debug
|
||
|
../configure --prefix=$CONDA_PREFIX --with-pydebug
|
||
|
make EXTRA_CFLAGS='-DPy_DEBUG' -j<num_cores>
|
||
|
make install
|
||
|
|
||
|
|
||
|
Using gprof
|
||
|
-----------
|
||
|
|
||
|
In order to profile compiled Python extensions one could use ``gprof``
|
||
|
after having recompiled the project with ``gcc -pg`` and using the
|
||
|
``python-dbg`` variant of the interpreter on debian / ubuntu: however
|
||
|
this approach requires to also have ``numpy`` and ``scipy`` recompiled
|
||
|
with ``-pg`` which is rather complicated to get working.
|
||
|
|
||
|
Fortunately there exist two alternative profilers that don't require you to
|
||
|
recompile everything.
|
||
|
|
||
|
Using valgrind / callgrind / kcachegrind
|
||
|
----------------------------------------
|
||
|
|
||
|
kcachegrind
|
||
|
~~~~~~~~~~~
|
||
|
|
||
|
``yep`` can be used to create a profiling report.
|
||
|
``kcachegrind`` provides a graphical environment to visualize this report:
|
||
|
|
||
|
.. prompt:: bash $
|
||
|
|
||
|
# Run yep to profile some python script
|
||
|
python -m yep -c my_file.py
|
||
|
|
||
|
.. prompt:: bash $
|
||
|
|
||
|
# open my_file.py.callgrin with kcachegrind
|
||
|
kcachegrind my_file.py.prof
|
||
|
|
||
|
.. note::
|
||
|
|
||
|
``yep`` can be executed with the argument ``--lines`` or ``-l`` to compile
|
||
|
a profiling report 'line by line'.
|
||
|
|
||
|
Multi-core parallelism using ``joblib.Parallel``
|
||
|
================================================
|
||
|
|
||
|
See `joblib documentation <https://joblib.readthedocs.io>`_
|
||
|
|
||
|
|
||
|
.. _warm-restarts:
|
||
|
|
||
|
A simple algorithmic trick: warm restarts
|
||
|
=========================================
|
||
|
|
||
|
See the glossary entry for :term:`warm_start`
|