feed: Catch parsing errors during html2text
This avoids crashing with:
Traceback (most recent call last):
...
File ".../rss2email/feed.py", line 732, in _process_entry_content
lines = [_html2text.html2text(content['value'])]
...
File "/usr/lib/python3.2/html/parser.py", line 149, in error
raise HTMLParseError(message, self.getpos())
html.parser.HTMLParseError: EOF in middle of construct, at line 1, column 262
The troublesome feed was:
$ wget -S http://www.cell.com/rssFeed/biophysj/rss.NewIssueAndArticles.xml
--2013-03-20 05:22:08-- http://www.cell.com/rssFeed/biophysj/rss.NewIssueAndArticles.xml
Resolving www.cell.com... 145.36.42.28
Connecting to www.cell.com|145.36.42.28|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Wed, 20 Mar 2013 09:23:19 GMT
Server: IBM_HTTP_Server
Last-Modified: Tue, 19 Mar 2013 22:00:04 GMT
Accept-Ranges: bytes
Content-Length: 15362
Vary: Accept-Encoding
Keep-Alive: timeout=10, max=100
Connection: Keep-Alive
Content-Type: text/xml
Length: 15362 (15K) [text/xml]
Saving to: ‘rss.NewIssueAndArticles.xml’
100%[======================================>] 15,362 94.1KB/s in 0.2s
2013-03-20 05:22:08 (94.1 KB/s) - ‘rss.NewIssueAndArticles.xml’ saved [15362/15362]
which contained the poorly split summary:
<item>
<title>Synergistic Insertion of Antimicrobial Magainin-Family Peptides in Membranes Depends on the Lipid Spontaneous Curvature</title>
<link>http://www.cell.com/biophysj/abstract/S0006-3495(13)00153-7</link>
<description>Erik Strandberg, Jonathan Zerweck, Parvesh Wadhwani, Anne S. Ulrich. PGLa and magainin 2 (MAG2) are amphiphilic antimicrobial peptides from frog skin with known synergistic activity. The orientation of the two helices in membranes was studied using solid-state <sup....</description>
<pubDate>Tue, 19 Mar 2013 00:00:00 GMT</pubDate>
<guid>http://www.cell.com/biophysj/abstract/S0006-3495(13)00153-7</guid>
<dc:date>2013-03-19T00:00:00Z</dc:date>
</item>
The '<sup....' in the description broke the parser.
Signed-off-by: W. Trevor King <wking@tremily.us>