Some feeds have incorrectly declared encodings (e.g. the encoding
specified by the HTTP header does not match the encoding specified in
the XML declaration). For such a feed, "r2e run" would emit an error
message similar to the following:
processing error: document declared as us-ascii, but parsed as
iso-8859-1: undeadly (http://undeadly.org/cgi?action=rss ->
jlmuir@imca-cat.org)
In this particular case, the HTTP header indicated a content type of
"text/xml" with no "charset" parameter. According to the feedparser
5.1.3 documentation (section "Introduction to Character Encoding" [1]),
this results in an encoding of US-ASCII. But the served XML document
contains an encoding declaration of ISO-8859-1.
For this case and some others, feedparser is able to automatically
determine an encoding. When it does, we emit a warning rather than an
error, and accept the automatically determined encoding.
We check for a successfully overridden encoding by looking at the bozo
bit and the bozo_exception. If the bozo bit is set and the
bozo_exception is feedparser.CharacterEncodingOverride, the parser has
successfully overridden an incorrectly declared encoding. Quoting from
the feedparser 5.1.3 documentation, section "Handling
Incorrectly-Declared Encodings" [2]:
Universal Feed Parser initially uses the rules specified in RFC 3023
to determine the character encoding of the feed. If parsing succeeds,
then that's that. If parsing fails, Universal Feed Parser sets the
bozo bit to 1 and sets bozo_exception to
feedparser.CharacterEncodingOverride. Then it tries to reparse the
feed with the following character encodings:
1. the encoding specified in the XML declaration
2. the encoding sniffed from the first four bytes of the document (as
per Section F)
3. the encoding auto-detected by the Universal Encoding Detector, if
installed
4. utf-8
5. windows-1252
If the character encoding can not be determined, Universal Feed Parser
sets the bozo bit to 1 and sets bozo_exception to
feedparser.CharacterEncodingUnknown. In this case, parsed values will
be strings, not Unicode strings.
References:
1. http://pythonhosted.org/feedparser/character-encoding.html#introduction-to-character-encoding
2. http://pythonhosted.org/feedparser/character-encoding.html#handling-incorrectly-declared-encodings
Signed-off-by: J. Lewis Muir <jlmuir@imca-cat.org>
elif isinstance(exc, _sax.SAXParseException):
_LOG.error('sax parsing error: {}: {}'.format(exc, self))
warned = True
+ elif (parsed.bozo and
+ isinstance(exc, _feedparser.CharacterEncodingOverride)):
+ _LOG.warning(
+ 'incorrectly declared encoding: {}: {}'.format(exc, self))
+ warned = True
elif parsed.bozo or exc:
if exc is None:
exc = "can't process"