From: sirver Date: Sat, 7 Nov 2009 17:45:40 +0000 (+0100) Subject: Added a first version of profiling documentation X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=3038f5443e86739a20abf3ad3e1902da8ea949bf;p=cython.git Added a first version of profiling documentation --- diff --git a/conf.py b/conf.py index 8227ecea..8c59edea 100644 --- a/conf.py +++ b/conf.py @@ -29,7 +29,7 @@ import cython_highlighting # Add any Sphinx extension module names here, as strings. They can be extensions # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. -extensions = ['ipython_console_highlighting', 'cython_highlighting'] +extensions = ['ipython_console_highlighting', 'cython_highlighting', 'sphinx.ext.pngmath'] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] diff --git a/docs/profiling_tutorial.rst b/docs/profiling_tutorial.rst new file mode 100644 index 00000000..9d15dd43 --- /dev/null +++ b/docs/profiling_tutorial.rst @@ -0,0 +1,299 @@ +.. highlight:: cython + +.. _profiling_tutorial: + +****************** +Profiling Tutorial +****************** + +This part describes the profiling abilities of cython. If you are familiar of +pure Python code, you can only read the first section +(:ref:`profiling_basics`). If you are not familiar with python profiling you +should also read the tutorial TODO:reference which takes you through a complete +example step by step. + +.. _profiling_basics: + +Cython Profiling Basics +======================= + +There are essential two ways to active Profiling in Cython. The first is by a +directive to the compiler which is by convention one of the earliest lines in +the source code. The second is to enable or disable profiling per function +(TODO: also per class?) via a cython decorator. + +Enable profiling for a complete source file +------------------------------------------- + +Profiling is enable for a complete source file via a directive to the cython +compiler:: + + # cython: profile=True + +Note that profiling gives an overhead to each function call therefore making +your program a little slower (or a lot, if you call some small functions very +often). + +Once enabled, your cython code will behave just like python code when called +from the cProfile module. This means you can just profile your cython code +together with your Python code using the same tools as for Python code alone. + +Disabling profiling function wise +------------------------------------------ + +If your profiling is messed up because of the call overhead to some small +functions that you rather do not want to see in your profile - either because +you plan to inline them anyway or because you are sure that you can't make them +any faster you can use a special decorator to disable profiling for one +function only:: + + cimport cython + + @cython.profile(False) + def my_often_called_function(): + pass + + +Profiling Tutorial +================== + +This will be a complete tutorial, start to finish, of profiling python code, +turning it into cython code and keep profiling until it is fast enough. + +As a toy example, we would like to evaluate the summation of the reciprocals of +squares up to a certain integer :math:`n` for evaluating :math:`\pi`. The +relation we want to use has been proven by Euler in 1735 and is known as the +`Basel problem `_. + + +.. math:: + \pi^2 = 6 \sum_{k=1}^{\infty} \frac{1}{k^2} = + 6 \lim_{k \to \infty} \big( \frac{1}{1^2} + + \frac{1}{2^2} + \dots + \frac{1}{k^2} \big) \approx + 6 \big( \frac{1}{1^2} + \frac{1}{2^2} + \dots + \frac{1}{n^2} \big) + +A simple python code for evaluating the truncated sum looks like this:: + + #!/usr/bin/env python + # encoding: utf-8 + # filename: calc_pi.py + + def recip_square(i): + return 1./i**2 + + def approx_pi(n=10000000): + val = 0. + for k in range(1,n+1): + val += recip_square(k) + return (6 * val)**.5 + +On my box, this needs approximately 4 seconds to run the function with the +given n. The higher we choose n, the better will be the approximation for +:math:`\pi`. An experienced python programmer will already see plenty of +places to optimize this code. But remember the golden rule of optimization: +Never optimize without having profiled. Let me repeat this: **Never** optimize +without having profiled your code. Your thoughts about which part of your +code takes too much time are wrong. At least, mine are always wrong. So let's +write a short script to profile our code:: + + #!/usr/bin/env python + # encoding: utf-8 + # filename: profile.py + + import pstats, cProfile + + import calc_pi + + cProfile.runctx("calc_pi.approx_pi()", globals(), locals(), "Profile.prof") + + s = pstats.Stats("Profile.prof") + s.strip_dirs().sort_stats("time").print_stats() + +Running this on my box gives the following output:: + + TODO: how to display this not as code but verbatimly? + + Sat Nov 7 17:40:54 2009 Profile.prof + + 10000004 function calls in 6.211 CPU seconds + + Ordered by: internal time + + ncalls tottime percall cumtime percall filename:lineno(function) + 1 3.243 3.243 6.211 6.211 calc_pi.py:7(approx_pi) + 10000000 2.526 0.000 2.526 0.000 calc_pi.py:4(recip_square) + 1 0.442 0.442 0.442 0.442 {range} + 1 0.000 0.000 6.211 6.211 :1() + 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} + +This contains the information that the code runs in 6.2 CPU seconds. Note that +the code got slower by 2 seconds because it ran inside the cProfile module. The +table contains the real valuable information. You might want to check the +python `profiling documentation `_ +for the nitty gritty details. The most important columns here are totime (total +time spend in this function **not** counting functions that were called by this +function) and cumtime (total time spend in this function **also** counting the +functions called by this function). Looking at the tottime column, we see that +approximately half the time is spend in approx_pi and the other half is spend +in recip_square. Also half a second is spend in range ... of course we should +have used xrange for such a big iteration. And in fact, just changing range to +xrange makes the code run in 5.8 seconds. + +We could optimize a lot in the pure python version, but since we are interested +in cython, let's move forward and bring this module to cython. We would do this +anyway at some time to get the loop run faster. Here is our first cython version:: + + # encoding: utf-8 + # cython: profile=True + # filename: calc_pi.pyx + + def recip_square(unsigned int i): + return 1./i**2 + + def approx_pi(unsigned int n=10000000): + cdef double val = 0. + cdef unsigned int k + for k in xrange(1,n+1): + val += recip_square(k) + return (6 * val)**.5 + +Note the second line: We have to tell cython that profiling should be enabled. +This makes the cython code slightly slower, but without this we would not get +meaningful output from the cProfile module. The rest of the code is mostly +unchanged, I only typed some variables which will likely speed things up a bit. + +We also need to modify our profiling script to import the cython module directly. +Here is the complete version adding the import of the pyximport module:: + + #!/usr/bin/env python + # encoding: utf-8 + # filename: profile.py + + import pstats, cProfile + + import pyximport + pyximport.install() + + import calc_pi + + cProfile.runctx("calc_pi.approx_pi()", globals(), locals(), "Profile.prof") + + s = pstats.Stats("Profile.prof") + s.strip_dirs().sort_stats("time").print_stats() + +We only added two lines, the rest stays completely the same. Alternatively, we could also +manually compile our code into an extension; we wouldn't need to change the +profile script then at all. The script now outputs the following:: + + Sat Nov 7 18:02:33 2009 Profile.prof + + 10000004 function calls in 4.406 CPU seconds + + Ordered by: internal time + + ncalls tottime percall cumtime percall filename:lineno(function) + 1 3.305 3.305 4.406 4.406 calc_pi.pyx:7(approx_pi) + 10000000 1.101 0.000 1.101 0.000 calc_pi.pyx:4(recip_square) + 1 0.000 0.000 4.406 4.406 {calc_pi.approx_pi} + 1 0.000 0.000 4.406 4.406 :1() + 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} + +We gained 1.8 seconds. Not too shabby. Comparing the output to the previous, we +see that recip_square function got faster while the approx_pi function has not +changed a lot. Let's concentrate on the approx_pi function a bit more. First +note, that this function is not to be called from code outside of our module; +so it would be wise to turn it into a cdef to reduce call overhead. We should +also get rid of the power operator: it is turned into a pow(i,2) function call by +cython, but we could instead just write i*i which will be much faster. The +whole function is also a good candidate for inlining. Let's look at the +necessary changes for these ideas:: + + # encoding: utf-8 + # cython: profile=True + # filename: calc_pi.pyx + + cdef inline double recip_square(unsigned int i): + return 1./(i*i) + + def approx_pi(unsigned int n=10000000): + cdef double val = 0. + cdef unsigned int k + for k in xrange(1,n+1): + val += recip_square(k) + return (6 * val)**.5 + +Now running the profile script yields:: + + Sat Nov 7 18:10:11 2009 Profile.prof + + 10000004 function calls in 2.622 CPU seconds + + Ordered by: internal time + + ncalls tottime percall cumtime percall filename:lineno(function) + 1 1.782 1.782 2.622 2.622 calc_pi.pyx:7(approx_pi) + 10000000 0.840 0.000 0.840 0.000 calc_pi.pyx:4(recip_square) + 1 0.000 0.000 2.622 2.622 {calc_pi.approx_pi} + 1 0.000 0.000 2.622 2.622 :1() + 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} + +That bought us another 1.8 seconds. Not the dramatic change we could have +expected. And why is recip_square still in this table; it is supposed to be +inlined, isn't it? The reason for this is that cython can't inline a function +if it is supposed to gather profiling information for it. Let's tell it to not +profile recip_square any more; we couldn't get the function to be much faster anyway:: + + # encoding: utf-8 + # cython: profile=True + # filename: calc_pi.pyx + + cimport cython + + @cython.profile(False) + cdef inline double recip_square(unsigned int i): + return 1./(i*i) + + def approx_pi(unsigned int n=10000000): + cdef double val = 0. + cdef unsigned int k + for k in xrange(1,n+1): + val += recip_square(k) + return (6 * val)**.5 + +Running this shows an interesting result:: + + Sat Nov 7 18:15:02 2009 Profile.prof + + 4 function calls in 0.089 CPU seconds + + Ordered by: internal time + + ncalls tottime percall cumtime percall filename:lineno(function) + 1 0.089 0.089 0.089 0.089 calc_pi.pyx:10(approx_pi) + 1 0.000 0.000 0.089 0.089 {calc_pi.approx_pi} + 1 0.000 0.000 0.089 0.089 :1() + 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} + +First note the tremendous speed gain: this version only takes 1/50 of the time +of our first cython version. Also note that recip_square has vanished from the +table like we wanted. But the most peculiar and import change is that +approx_pi also got much faster. This is a problem with all profiling: calling a +function in a profile run adds a certain overhead to the function call. This +overhead is **not** added to the time spend in the called function, but to the +time spend in the **calling** function. In this example, approx_pi didn't need 2.622 +seconds in the last run; but it called recip_square 10000000 times, each time taking a +little to set up profiling for it. This adds up to the massive time loss of +around 2.6 seconds. Having disable profiling for the often called function now +reveals realistic timings for approx_pi; we could continue optimizing it now if +needed. + +This concludes this profiling tutorial. There is still some room for +improvement in this code. We could try to replace the power operator in +approx_pi with a call to sqrt from the C stdlib; but this is not necessarily +faster than calling pow(x,0.5). + +Even so, the result we achieved here is quite satisfactory: we came up with a +solution that is much faster then our original python version while retaining +functionality and readability. + + diff --git a/index.rst b/index.rst index 70b8a929..e8d08446 100644 --- a/index.rst +++ b/index.rst @@ -17,6 +17,7 @@ Contents: docs/source_files_and_compilation docs/wrapping_CPlusPlus docs/numpy_tutorial + docs/profiling_tutorial docs/limitations docs/pyrex_differences docs/early_binding_for_speed