irkerd: Handle UnicodeDecodeError in LineProtocol.data_received
authorW. Trevor King <wking@tremily.us>
Fri, 30 May 2014 23:46:54 +0000 (16:46 -0700)
committerW. Trevor King <wking@tremily.us>
Fri, 30 May 2014 23:46:54 +0000 (16:46 -0700)
I just got the following in a message-of-the-day from
leguin.freenode.net:

  Welcome to leguin.freenode.net in Ume\xe5, Sweden, EU!

Where Ume\xe5 is Ume{U+00E5 LATIN SMALL LETTER A WITH RING ABOVE}.
\xe5 is the ISO-8859-1 encoding.  Since important messages from the
IRC server should be in ASCII [1]:

  Regardless of being an 8-bit protocol, the delimiters and keywords
  are such that protocol is mostly usable from US-ASCII terminal and a
  telnet connection.

So rather than trying some fancy charset-detection heuristics, just
drop lines that don't decode properly.

[1]: http://tools.ietf.org/html/rfc2812#section-2.2

irkerd

diff --git a/irkerd b/irkerd
index d6686ff62258fc8697096c07c0e3c4e4342e0c12..9b1a63cd4ddbdcc6e0038e6d93f486448b6bd326 100755 (executable)
--- a/irkerd
+++ b/irkerd
@@ -484,9 +484,14 @@ class LineProtocol(asyncio.Protocol):
             else:
                 self.buffer = []
             for line in lines:
-                line = str(line, self.encoding).strip()
-                LOG.debug('{}: line received: {!r}'.format(self, line))
-                self.line_received(line=line)
+                try:
+                    line = str(line, self.encoding).strip()
+                except UnicodeDecodeError as e:
+                    LOG.warn('{}: invalid encoding in {!r} ({})'.format(
+                        self, line, e))
+                else:
+                    LOG.debug('{}: line received: {!r}'.format(self, line))
+                    self.line_received(line=line)
 
     def datagram_received(self, data, addr):
         "Decode the raw bytes and pass the line to line_received"