Added a first version of profiling documentation

author sirver <none@none>

Sat, 7 Nov 2009 17:45:40 +0000 (18:45 +0100)

committer sirver <none@none>

Sat, 7 Nov 2009 17:45:40 +0000 (18:45 +0100)
author sirver <none@none>
Sat, 7 Nov 2009 17:45:40 +0000 (18:45 +0100)
committer sirver <none@none>
Sat, 7 Nov 2009 17:45:40 +0000 (18:45 +0100)
diff --git a/conf.py b/conf.py

index 8227eceaccded4272f2072dced233bbbd01ddeb4..8c59edea484239c69e7c5359442c5508d576ddac 100644 (file)
--- a/conf.py
+++ b/conf.py
@@ -29,7 +29,7 @@ import cython_highlighting
  
  # Add any Sphinx extension module names here, as strings. They can be extensions
  # coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
-extensions = ['ipython_console_highlighting', 'cython_highlighting']
+extensions = ['ipython_console_highlighting', 'cython_highlighting', 'sphinx.ext.pngmath']
  
  # Add any paths that contain templates here, relative to this directory.
  templates_path = ['_templates']
diff --git a/docs/profiling_tutorial.rst b/docs/profiling_tutorial.rst

new file mode 100644 (file)

index 0000000..9d15dd4
--- /dev/null
+++ b/docs/profiling_tutorial.rst
@@ -0,0 +1,299 @@
+.. highlight:: cython
+
+.. _profiling_tutorial:
+
+******************
+Profiling Tutorial
+******************
+
+This part describes the profiling abilities of cython. If you are familiar of
+pure Python code, you can only read the first section
+(:ref:`profiling_basics`). If you are not familiar with python profiling you
+should also read the tutorial TODO:reference which takes you through a complete
+example step by step.
+
+.. _profiling_basics:
+
+Cython Profiling Basics
+=======================
+
+There are essential two ways to active Profiling in Cython. The first is by a
+directive to the compiler which is by convention one of the earliest lines in
+the source code. The second is to enable or disable profiling per function
+(TODO: also per class?) via a cython decorator.
+
+Enable profiling for a complete source file
+-------------------------------------------
+
+Profiling is enable for a complete source file via a directive to the cython
+compiler::
+   
+   # cython: profile=True
+
+Note that profiling gives an overhead to each function call therefore making
+your program a little slower (or a lot, if you call some small functions very
+often).
+
+Once enabled, your cython code will behave just like python code when called
+from the cProfile module. This means you can just profile your cython code
+together with your Python code using the same tools as for Python code alone. 
+
+Disabling profiling function wise
+------------------------------------------
+
+If your profiling is messed up because of the call overhead to some small
+functions that you rather do not want to see in your profile - either because
+you plan to inline them anyway or because you are sure that you can't make them
+any faster you can use a special decorator to disable profiling for one
+function only::
+
+   cimport cython
+
+   @cython.profile(False)
+   def my_often_called_function():
+      pass
+
+
+Profiling Tutorial
+==================
+
+This will be a complete tutorial, start to finish, of profiling python code,
+turning it into cython code and keep profiling until it is fast enough. 
+
+As a toy example, we would like to evaluate the summation of the reciprocals of
+squares up to a certain integer :math:`n` for evaluating :math:`\pi`. The
+relation we want to use has been proven by Euler in 1735 and is known as the
+`Basel problem <http://en.wikipedia.org/wiki/Basel_problem>`_.
+
+
+.. math::
+   \pi^2 = 6 \sum_{k=1}^{\infty} \frac{1}{k^2} = 
+   6 \lim_{k \to \infty} \big( \frac{1}{1^2} + 
+         \frac{1}{2^2} + \dots + \frac{1}{k^2}  \big) \approx
+   6 \big( \frac{1}{1^2} + \frac{1}{2^2} + \dots + \frac{1}{n^2}  \big)
+
+A simple python code for evaluating the truncated sum looks like this::
+
+   #!/usr/bin/env python
+   # encoding: utf-8
+   # filename: calc_pi.py
+
+   def recip_square(i):
+       return 1./i**2
+
+   def approx_pi(n=10000000):
+       val = 0.
+       for k in range(1,n+1):
+           val += recip_square(k)
+       return (6 * val)**.5
+      
+On my box, this needs approximately 4 seconds to run the function with the
+given n. The higher we choose n, the better will be the approximation for
+:math:`\pi`. An experienced python programmer will already see plenty of
+places to optimize this code. But remember the golden rule of optimization:
+Never optimize without having profiled. Let me repeat this: **Never** optimize
+without having profiled your code. Your thoughts about which part of your
+code takes too much time are wrong. At least, mine are always wrong. So let's
+write a short script to profile our code::
+
+   #!/usr/bin/env python
+   # encoding: utf-8
+   # filename: profile.py
+
+   import pstats, cProfile
+
+   import calc_pi
+
+   cProfile.runctx("calc_pi.approx_pi()", globals(), locals(), "Profile.prof")
+
+   s = pstats.Stats("Profile.prof")
+   s.strip_dirs().sort_stats("time").print_stats()
+
+Running this on my box gives the following output::
+
+   TODO: how to display this not as code but verbatimly? 
+
+   Sat Nov  7 17:40:54 2009    Profile.prof
+
+            10000004 function calls in 6.211 CPU seconds
+
+      Ordered by: internal time
+
+      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+           1    3.243    3.243    6.211    6.211 calc_pi.py:7(approx_pi)
+    10000000    2.526    0.000    2.526    0.000 calc_pi.py:4(recip_square)
+           1    0.442    0.442    0.442    0.442 {range}
+           1    0.000    0.000    6.211    6.211 <string>:1(<module>)
+           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
+
+This contains the information that the code runs in 6.2 CPU seconds. Note that
+the code got slower by 2 seconds because it ran inside the cProfile module. The
+table contains the real valuable information.  You might want to check the
+python `profiling documentation <http://docs.python.org/library/profile.html>`_
+for the nitty gritty details. The most important columns here are totime (total
+time spend in this function **not** counting functions that were called by this
+function) and cumtime (total time spend in this function **also** counting the
+functions called by this function). Looking at the tottime column, we see that 
+approximately half the time is spend in approx_pi and the other half is spend
+in recip_square. Also half a second is spend in range ... of course we should
+have used xrange for such a big iteration. And in fact, just changing range to
+xrange makes the code run in 5.8 seconds.
+
+We could optimize a lot in the pure python version, but since we are interested
+in cython, let's move forward and bring this module to cython. We would do this
+anyway at some time to get the loop run faster. Here is our first cython version::
+
+   # encoding: utf-8
+   # cython: profile=True
+   # filename: calc_pi.pyx
+
+   def recip_square(unsigned int i):
+       return 1./i**2
+
+   def approx_pi(unsigned int n=10000000):
+       cdef double val = 0.
+       cdef unsigned int k
+       for k in xrange(1,n+1):
+           val += recip_square(k)
+       return (6 * val)**.5
+
+Note the second line: We have to tell cython that profiling should be enabled.
+This makes the cython code slightly slower, but without this we would not get
+meaningful output from the cProfile module. The rest of the code is mostly
+unchanged, I only typed some variables which will likely speed things up a bit. 
+
+We also need to modify our profiling script to import the cython module directly.
+Here is the complete version adding the import of the pyximport module::
+
+   #!/usr/bin/env python
+   # encoding: utf-8
+   # filename: profile.py
+
+   import pstats, cProfile
+
+   import pyximport
+   pyximport.install()
+
+   import calc_pi
+
+   cProfile.runctx("calc_pi.approx_pi()", globals(), locals(), "Profile.prof")
+
+   s = pstats.Stats("Profile.prof")
+   s.strip_dirs().sort_stats("time").print_stats()
+
+We only added two lines, the rest stays completely the same. Alternatively, we could also
+manually compile our code into an extension; we wouldn't need to change the
+profile script then at all. The script now outputs the following::
+
+   Sat Nov  7 18:02:33 2009    Profile.prof
+
+            10000004 function calls in 4.406 CPU seconds
+
+      Ordered by: internal time
+
+      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+           1    3.305    3.305    4.406    4.406 calc_pi.pyx:7(approx_pi)
+    10000000    1.101    0.000    1.101    0.000 calc_pi.pyx:4(recip_square)
+           1    0.000    0.000    4.406    4.406 {calc_pi.approx_pi}
+           1    0.000    0.000    4.406    4.406 <string>:1(<module>)
+           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
+
+We gained 1.8 seconds. Not too shabby. Comparing the output to the previous, we
+see that recip_square function got faster while the approx_pi function has not
+changed a lot. Let's concentrate on the approx_pi function a bit more. First
+note, that this function is not to be called from code outside of our module;
+so it would be wise to turn it into a cdef to reduce call overhead. We should
+also get rid of the power operator: it is turned into a pow(i,2) function call by
+cython, but we could instead just write i*i which will be much faster. The
+whole function is also a good candidate for inlining.  Let's look at the
+necessary changes for these ideas::
+
+   # encoding: utf-8
+   # cython: profile=True
+   # filename: calc_pi.pyx
+
+   cdef inline double recip_square(unsigned int i):
+       return 1./(i*i)
+
+   def approx_pi(unsigned int n=10000000):
+       cdef double val = 0.
+       cdef unsigned int k
+       for k in xrange(1,n+1):
+           val += recip_square(k)
+       return (6 * val)**.5
+
+Now running the profile script yields::
+
+   Sat Nov  7 18:10:11 2009    Profile.prof
+
+            10000004 function calls in 2.622 CPU seconds
+
+      Ordered by: internal time
+
+      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+           1    1.782    1.782    2.622    2.622 calc_pi.pyx:7(approx_pi)
+    10000000    0.840    0.000    0.840    0.000 calc_pi.pyx:4(recip_square)
+           1    0.000    0.000    2.622    2.622 {calc_pi.approx_pi}
+           1    0.000    0.000    2.622    2.622 <string>:1(<module>)
+           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
+
+That bought us another 1.8 seconds. Not the dramatic change we could have
+expected. And why is recip_square still in this table; it is supposed to be
+inlined, isn't it?  The reason for this is that cython can't inline a function
+if it is supposed to gather profiling information for it. Let's tell it to not
+profile recip_square any more; we couldn't get the function to be much faster anyway::
+
+   # encoding: utf-8
+   # cython: profile=True
+   # filename: calc_pi.pyx
+
+   cimport cython
+
+   @cython.profile(False)
+   cdef inline double recip_square(unsigned int i):
+       return 1./(i*i)
+
+   def approx_pi(unsigned int n=10000000):
+       cdef double val = 0.
+       cdef unsigned int k
+       for k in xrange(1,n+1):
+           val += recip_square(k)
+       return (6 * val)**.5
+
+Running this shows an interesting result::
+
+   Sat Nov  7 18:15:02 2009    Profile.prof
+
+            4 function calls in 0.089 CPU seconds
+
+      Ordered by: internal time
+
+      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
+           1    0.089    0.089    0.089    0.089 calc_pi.pyx:10(approx_pi)
+           1    0.000    0.000    0.089    0.089 {calc_pi.approx_pi}
+           1    0.000    0.000    0.089    0.089 <string>:1(<module>)
+           1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
+
+First note the tremendous speed gain: this version only takes 1/50 of the time
+of our first cython version. Also note that recip_square has vanished from the
+table like we wanted. But the most peculiar and import change is that
+approx_pi also got much faster. This is a problem with all profiling: calling a
+function in a profile run adds a certain overhead to the function call. This
+overhead is **not** added to the time spend in the called function, but to the
+time spend in the **calling** function. In this example, approx_pi didn't need 2.622
+seconds in the last run; but it called recip_square 10000000 times, each time taking a
+little to set up profiling for it. This adds up to the massive time loss of
+around 2.6 seconds. Having disable profiling for the often called function now
+reveals realistic timings for approx_pi; we could continue optimizing it now if
+needed.
+
+This concludes this profiling tutorial. There is still some room for
+improvement in this code. We could try to replace the power operator in
+approx_pi with a call to sqrt from the C stdlib; but this is not necessarily
+faster than calling pow(x,0.5).
+
+Even so, the result we achieved here is quite satisfactory: we came up with a
+solution that is much faster then our original python version while retaining
+functionality and readability.
+
+
diff --git a/index.rst b/index.rst

index 70b8a9295eb758b129e5b2701a66559458bbfeb7..e8d084461ece3da8c6d6baf57205308004c74ae0 100644 (file)
--- a/index.rst
+++ b/index.rst
@@ -17,6 +17,7 @@ Contents:
     docs/source_files_and_compilation
     docs/wrapping_CPlusPlus
     docs/numpy_tutorial
+   docs/profiling_tutorial
     docs/limitations
     docs/pyrex_differences
     docs/early_binding_for_speed
author	sirver <none@none>
	Sat, 7 Nov 2009 17:45:40 +0000 (18:45 +0100)
committer	sirver <none@none>
	Sat, 7 Nov 2009 17:45:40 +0000 (18:45 +0100)
conf.py		patch \| blob \| history
docs/profiling_tutorial.rst	[new file with mode: 0644]	patch \| blob
index.rst		patch \| blob \| history