feed: Catch parsing errors during html2text
authorW. Trevor King <wking@tremily.us>
Wed, 20 Mar 2013 09:27:03 +0000 (05:27 -0400)
committerW. Trevor King <wking@tremily.us>
Wed, 20 Mar 2013 09:27:03 +0000 (05:27 -0400)
commita3719f88fbd2faed3418c8391c3245465b4b850b
treeb0d272d14aa3bf90e1558c7a91b85bc0e90e347c
parenta88738f9905f306b989ff62983518214c5018c0f
feed: Catch parsing errors during html2text

This avoids crashing with:

  Traceback (most recent call last):
    ...
    File ".../rss2email/feed.py", line 732, in _process_entry_content
      lines = [_html2text.html2text(content['value'])]
    ...
    File "/usr/lib/python3.2/html/parser.py", line 149, in error
      raise HTMLParseError(message, self.getpos())
  html.parser.HTMLParseError: EOF in middle of construct, at line 1, column 262

The troublesome feed was:

  $ wget -S http://www.cell.com/rssFeed/biophysj/rss.NewIssueAndArticles.xml
  --2013-03-20 05:22:08--  http://www.cell.com/rssFeed/biophysj/rss.NewIssueAndArticles.xml
  Resolving www.cell.com... 145.36.42.28
  Connecting to www.cell.com|145.36.42.28|:80... connected.
  HTTP request sent, awaiting response...
    HTTP/1.1 200 OK
    Date: Wed, 20 Mar 2013 09:23:19 GMT
    Server: IBM_HTTP_Server
    Last-Modified: Tue, 19 Mar 2013 22:00:04 GMT
    Accept-Ranges: bytes
    Content-Length: 15362
    Vary: Accept-Encoding
    Keep-Alive: timeout=10, max=100
    Connection: Keep-Alive
    Content-Type: text/xml
  Length: 15362 (15K) [text/xml]
  Saving to: ‘rss.NewIssueAndArticles.xml’

  100%[======================================>] 15,362      94.1KB/s   in 0.2s

  2013-03-20 05:22:08 (94.1 KB/s) - ‘rss.NewIssueAndArticles.xml’ saved [15362/15362]

which contained the poorly split summary:

  <item>
    <title>Synergistic Insertion of Antimicrobial Magainin-Family Peptides in Membranes Depends on the Lipid Spontaneous Curvature</title>
    <link>http://www.cell.com/biophysj/abstract/S0006-3495(13)00153-7</link>
    <description>Erik Strandberg, Jonathan Zerweck, Parvesh Wadhwani, Anne S. Ulrich. PGLa and magainin 2 (MAG2) are amphiphilic antimicrobial peptides from frog skin with known synergistic activity. The orientation of the two helices in membranes was studied using solid-state &lt;sup....</description>
    <pubDate>Tue, 19 Mar 2013 00:00:00 GMT</pubDate>
    <guid>http://www.cell.com/biophysj/abstract/S0006-3495(13)00153-7</guid>
    <dc:date>2013-03-19T00:00:00Z</dc:date>
  </item>

The '<sup....' in the description broke the parser.

Signed-off-by: W. Trevor King <wking@tremily.us>
rss2email/feed.py