I often struggle to find bottlenecks in my cython code. How can I profile cython functions line-by-line?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Robert Bradshaw helped me to get Robert Kern’s line_profiler tool working for cdef functions and I thought I’d share the results on stackoverflow.
In short, set up a regular .pyx file and build script and add the following before your call to cythonize.
# Thanks to @tryptofame for proposing an updated snippet from Cython.Compiler.Options import get_directive_defaults directive_defaults = get_directive_defaults() directive_defaults['linetrace'] = True directive_defaults['binding'] = True
Furthermore, you need to define the C macro CYTHON_TRACE=1 by modifying your extensions setup such that
extensions = [
Extension("test", ["test.pyx"], define_macros=[('CYTHON_TRACE', '1')])
]
A working example using the %%cython magic in the iPython notebook is here:
http://nbviewer.ipython.org/gist/tillahoffmann/296501acea231cbdf5e7
Method 2
While @Till’s answer shows the way for profiling Cython-code using setup.py-approach, this answer is about ad-hoc profiling in IPython/Jupiter notebook and is more or less “translation” of Cython-documentation to IPython/Jupiter.
%prun-magic:
If %prun-magic should be used, then it is enough to set the Cython’s compiler directive profile to True (here with example from Cython-documentation):
%%cython
# cython: profile=True
def recip_square(i):
return 1. / i ** 3
def approx_pi(n=10000000):
val = 0.
for k in range(1, n + 1):
val += recip_square(k)
return (6 * val) ** .5
Using global directive (i.e. # cython: profile=True) is a better way, than modifying the global Cython-state, because changing it will lead to extension being recompiled (which is not the case if the global Cython state is changed – the old cached version compiled with old global state will be reloaded/reused).
And now
%prun -s cumulative approx_pi(1000000)
yields:
1000005 function calls in 1.860 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.860 1.860 {built-in method builtins.exec}
1 0.000 0.000 1.860 1.860 <string>:1(<module>)
1 0.000 0.000 1.860 1.860 {_cython_magic_404d18ea6452e5ffa4c993f6a6e15b22.approx_pi}
1 0.612 0.612 1.860 1.860 _cython_magic_404d18ea6452e5ffa4c993f6a6e15b22.pyx:7(approx_pi)
1000000 1.248 0.000 1.248 0.000 _cython_magic_404d18ea6452e5ffa4c993f6a6e15b22.pyx:4(recip_square)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
%lprun-magic
If the line profiler (i.e. %lprun-magic) should be used, then the Cython module should be compiled with different directives:
%%cython # cython: linetrace=True # cython: binding=True # distutils: define_macros=CYTHON_TRACE_NOGIL=1 ...
linetrace=True triggers creation of the tracing in the generated C-code and implies profile=True so it must not be set in addition. Without binding=True line_profiler doesn’t have the necessary code-information and CYTHON_TRACE_NOGIL=1 is needed, so the line profiling is also activated when compiled with the C-compiler (and not thrown away by the C-preprocessor). It is also possible to use CYTHON_TRACE=1 if nogil-blocks should not be profiled on per-line basis.
Now it can be used for example as follows, passing the functions, which should be line-profiled via -f option (use %lprun? to get info about possible options):
%load_ext line_profiler %lprun -f approx_pi -f recip_square approx_pi(1000000)
which yields:
Timer unit: 1e-06 s
Total time: 1.9098 s
File: /XXXX.pyx
Function: recip_square at line 5
Line # Hits Time Per Hit % Time Line Contents
==============================================================
5 def recip_square(i):
6 1000000 1909802.0 1.9 100.0 return 1. / i ** 2
Total time: 6.54676 s
File: /XXXX.pyx
Function: approx_pi at line 8
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 def approx_pi(n=10000000):
9 1 3.0 3.0 0.0 val = 0.
10 1000001 1155778.0 1.2 17.7 for k in range(1, n + 1):
11 1000000 5390972.0 5.4 82.3 val += recip_square(k)
12 1 9.0 9.0 0.0 return (6 * val) ** .5
line_profiler has however a minor hiccup with the cpdef-function: it doesn’t detect the function body correctly. In this SO-post, a possible workaround is shown.
One should be aware, that profiling (all above line profiling) changes the execution-time and its distribution compared with “normal” run. Here we see, that different times are needed depending on type of profiling, for the same function:
Method (N=10^6): Running Time: Build with: %timeit 1 second %prun 2 seconds profile=True %lprun 6.5 seconds linetrace=True,binding=True,CYTHON_TRACE_NOGIL=1
Method 3
Although I wouldn’t really call it profiling, there is another option to analyze your Cython code by running cython with -a (annotate), this creates a webpage in which the main bottlenecks are highlighted. For example, when I forget to declare some variables:
After correctly declaring them (cdef double dudz, dvdz):
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

