From 05f2628563a75b0cb32cd3c48be9c0ef33ec67ea Mon Sep 17 00:00:00 2001 From: "J. Lewis Muir" Date: Tue, 10 Sep 2013 11:48:06 -0500 Subject: [PATCH] feed: don't emit error if parser able to auto-determine encoding Some feeds have incorrectly declared encodings (e.g. the encoding specified by the HTTP header does not match the encoding specified in the XML declaration). For such a feed, "r2e run" would emit an error message similar to the following: processing error: document declared as us-ascii, but parsed as iso-8859-1: undeadly (http://undeadly.org/cgi?action=rss -> jlmuir@imca-cat.org) In this particular case, the HTTP header indicated a content type of "text/xml" with no "charset" parameter. According to the feedparser 5.1.3 documentation (section "Introduction to Character Encoding" [1]), this results in an encoding of US-ASCII. But the served XML document contains an encoding declaration of ISO-8859-1. For this case and some others, feedparser is able to automatically determine an encoding. When it does, we emit a warning rather than an error, and accept the automatically determined encoding. We check for a successfully overridden encoding by looking at the bozo bit and the bozo_exception. If the bozo bit is set and the bozo_exception is feedparser.CharacterEncodingOverride, the parser has successfully overridden an incorrectly declared encoding. Quoting from the feedparser 5.1.3 documentation, section "Handling Incorrectly-Declared Encodings" [2]: Universal Feed Parser initially uses the rules specified in RFC 3023 to determine the character encoding of the feed. If parsing succeeds, then that's that. If parsing fails, Universal Feed Parser sets the bozo bit to 1 and sets bozo_exception to feedparser.CharacterEncodingOverride. Then it tries to reparse the feed with the following character encodings: 1. the encoding specified in the XML declaration 2. the encoding sniffed from the first four bytes of the document (as per Section F) 3. the encoding auto-detected by the Universal Encoding Detector, if installed 4. utf-8 5. windows-1252 If the character encoding can not be determined, Universal Feed Parser sets the bozo bit to 1 and sets bozo_exception to feedparser.CharacterEncodingUnknown. In this case, parsed values will be strings, not Unicode strings. References: 1. http://pythonhosted.org/feedparser/character-encoding.html#introduction-to-character-encoding 2. http://pythonhosted.org/feedparser/character-encoding.html#handling-incorrectly-declared-encodings Signed-off-by: J. Lewis Muir --- rss2email/feed.py | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/rss2email/feed.py b/rss2email/feed.py index 3999b0c..3d9654f 100644 --- a/rss2email/feed.py +++ b/rss2email/feed.py @@ -404,6 +404,11 @@ class Feed (object): elif isinstance(exc, _sax.SAXParseException): _LOG.error('sax parsing error: {}: {}'.format(exc, self)) warned = True + elif (parsed.bozo and + isinstance(exc, _feedparser.CharacterEncodingOverride)): + _LOG.warning( + 'incorrectly declared encoding: {}: {}'.format(exc, self)) + warned = True elif parsed.bozo or exc: if exc is None: exc = "can't process" -- 2.26.2