From: Joey Hess Date: Thu, 29 May 2008 06:51:40 +0000 (-0400) Subject: web commit by http://liw.fi/: uuml html entity in feeds confuses ikiwiki when aggregating X-Git-Tag: 2.48~8^2~1 X-Git-Url: http://git.tremily.us/?p=ikiwiki.git;a=commitdiff_plain;h=f543303bf0042158ca5bc119681019ead4140662 web commit by liw.fi/: uuml html entity in feeds confuses ikiwiki when aggregating --- diff --git a/doc/bugs/__38__uuml__59___in_markup_makes_ikiwiki_not_un-escape_HTML_at_all.mdwn b/doc/bugs/__38__uuml__59___in_markup_makes_ikiwiki_not_un-escape_HTML_at_all.mdwn new file mode 100644 index 000000000..7e9bf84e2 --- /dev/null +++ b/doc/bugs/__38__uuml__59___in_markup_makes_ikiwiki_not_un-escape_HTML_at_all.mdwn @@ -0,0 +1,35 @@ +I'm experimenting with using Ikiwiki as a feed aggregator. + +The Planet Ubuntu RSS 2.0 feed () as of today +has someone whose name contains the character u-with-umlaut. In HTML 4.0, this is +specified as the character entity uuml. Ikiwiki 2.47 running on Debian etch does +not seem to understand that entity, and decides not to un-escape any markup in +the feed. This makes the feed hard to read. + +The following is the test input: + + + + testfeed + http://example.com/ + en + example + + ü + http://example.com + http://example.com + foo + Tue, 27 May 2008 22:42:42 +0000 + + + + +When I feed this to ikiwiki, it complains: +"processed ok at 2008-05-29 09:44:14 (invalid UTF-8 stripped from feed) (feed entities escaped" + +Note also that the test input contains only pure ASCII, no UTF-8 at all. + +If I remove the ampersand in the title, ikiwiki has no problem. However, the entity is +valid HTML, so it would be good for ikiwiki to understand it. At the minimum, stripping +the offending entity but un-escaping the rest seems like a reasonable thing to do, +unless that has security implications.