From: Stefan Behnel Date: Sun, 2 May 2010 08:34:51 +0000 (+0200) Subject: added section on source code encodings to string handling chapter X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=836aa220d81ac27922bb2ec67268cceaefe46886;p=cython.git added section on source code encodings to string handling chapter --- diff --git a/src/tutorial/strings.rst b/src/tutorial/strings.rst index b20452fe..ba4f5d02 100644 --- a/src/tutorial/strings.rst +++ b/src/tutorial/strings.rst @@ -21,6 +21,18 @@ original C string. It can be safely passed around in Python code, and will be garbage collected when the last reference to it goes out of scope. +Note that the creation of the Python bytes string can fail with an +exception, e.g. due to insufficient memory. If you need to ``free()`` +the string after the conversion, you should wrap the assignment in a +try-finally construct:: + + cimport stdlib + cdef char* c_string = c_call_returning_a_c_string() + try: + py_string = c_string + finally: + stdlib.free(c_string) + To convert the byte string back into a C ``char*``, use the opposite assignment:: @@ -40,11 +52,11 @@ management is required. Decoding bytes to text ---------------------- -The above way of passing and receiving C strings is as simple that -that, as long as we only deal with binary data in the strings. When -we deal with encoded text, however, it is best practice to decode the C byte -strings to Python Unicode strings on reception, and to encode Python -Unicode strings to C byte strings on the way out. +The initially presented way of passing and receiving C strings is +sufficient if your code only deals with binary data in the strings. +When we deal with encoded text, however, it is best practice to decode +the C byte strings to Python Unicode strings on reception, and to +encode Python Unicode strings to C byte strings on the way out. With a Python byte string object, you would normally just call the ``.decode()`` method to decode it into a Unicode string:: @@ -74,7 +86,7 @@ number of bytes by slicing the C string:: ustring = c_string[:length].decode('UTF-8') The same can be used when the string contains null bytes, e.g. when it -uses an encoding like UCS-2, where each character is encoded as two +uses an encoding like UCS-4, where each character is encoded in four bytes. It is common practice to wrap string conversions (and non-trivial type @@ -117,15 +129,69 @@ a memory managed byte string:: As noted before, this takes the pointer to the byte buffer of the Python byte string. Trying to do the same without keeping a reference -to the intermediate byte string will fail with a compile error:: +to the Python byte string will fail with a compile error:: # this will not compile ! cdef char* c_string = py_unicode_string.encode('UTF-8') Here, the Cython compiler notices that the code takes a pointer to a temporary string result that will be garbage collected after the -assignment. Later access to the invalidated pointer will most likely -result in a crash. Cython will therefore refuse to compile this code. +assignment. Later access to the invalidated pointer will read invalid +memory and likely result in a segfault. Cython will therefore refuse +to compile this code. + +Source code encoding +-------------------- + +When string literals appear in the code, the source code encoding is +important. It determines the byte sequence that Cython will store in +the C code for bytes literals, and the Unicode code points that Cython +builds for unicode literals when parsing the byte encoded source file. +Following `PEP 263`_, Cython supports the explicit declaration of +source file encodings. For example, putting the following comment at +the top of an ``ISO-8859-15`` (Latin-9) encoded source file (into the +first or second line) is required to enable ``ISO-8859-15`` decoding +in the parser:: + + # -*- coding: ISO-8859-15 -*- + +When no explicit encoding declaration is provided, the source code is +parsed as UTF-8 encoded text, as specified by `PEP 3120`_. `UTF-8`_ +is a very common encoding that can represent the entire Unicode set of +characters and is compatible with plain ASCII encoded text that it +encodes efficiently. This makes it a very good choice for source code +files which usually consist mostly of ASCII characters. + +.. _`PEP 263`: http://www.python.org/dev/peps/pep-0263/ +.. _`PEP 3120`: http://www.python.org/dev/peps/pep-3120/ +.. _`UTF-8`: http://en.wikipedia.org/wiki/UTF-8 + +As an example, putting the following line into a UTF-8 encoded source +file will print ``5``, as UTF-8 encodes the letter ``'ö'`` in the two +byte sequence ``'\xc3\xb6'``:: + + print( len(b'abcö') ) + +whereas the following ``ISO-8859-15`` encoded source file will print +``4``, as the encoding uses only 1 byte for this letter:: + + # -*- coding: ISO-8859-15 -*- + print( len(b'abcö') ) + +Note that the unicode literal ``u'abcö'`` is a correctly decoded four +character Unicode string in both cases, whereas the unprefixed Python +``str`` literal ``'abcö'`` will become a byte string in Python 2 (thus +having length 4 or 5 in the examples above), and a 4 character Unicode +string in Python 3. If you are not familiar with encodings, this may +not appear obvious at first read. See `CEP 108`_ for details. + +As a rule of thumb, it is best to avoid unprefixed non-ASCII ``str`` +literals and to use unicode string literals for all text. Cython also +supports the ``__future__`` import ``unicode_literals`` that instructs +the parser to read all unprefixed ``str`` literals in a source file as +unicode string literals. + +.. _`CEP 108`: http://wiki.cython.org/enhancements/stringliterals Single bytes and characters --------------------------- @@ -136,15 +202,18 @@ code point value, i.e. a single Unicode character. Since version 0.13, Cython supports the latter natively, which is either defined as an unsigned 2-byte or 4-byte integer, or as ``wchar_t``, depending on the platform. The exact type is a compile time option in the build of -the CPython interpreter. +the CPython interpreter and extension modules inherit this definition +at C compile time. In Cython, the ``char`` and ``Py_UNICODE`` types behave differently when coercing to Python objects. Similar to the behaviour of the bytes type in Python 3, the ``char`` type coerces to a Python integer value by default, so that the following prints 65 and not ``A``:: + # -*- coding: ASCII -*- + cdef char char_val = 'A' - assert char_val == 65 # 'A' + assert char_val == 65 # ASCII encoded byte value of 'A' print( char_val ) If you want a Python bytes string instead, you have to request it @@ -154,9 +223,9 @@ explicitly, and the following will print ``A`` (or ``b'A'`` in Python print( char_val ) The explicit coercion works for any C integer type. Values outside of -the range of a ``char`` will raise an ``OverflowError``. Coercion -will also happen automatically when assigning to a typed variable, -e.g.:: +the range of a ``char`` or ``unsigned char`` will raise an +``OverflowError``. Coercion will also happen automatically when +assigning to a typed variable, e.g.:: cdef bytes py_byte_string = char_val @@ -188,6 +257,8 @@ Cython 0.13 supports efficient iteration over ``char*``, bytes and unicode strings, as long as the loop variable is appropriately typed. So the following will generate the expected C code:: + # -*- coding: ASCII -*- + cdef char* c_string = c_call_returning_a_c_string() cdef char c