From: W. Trevor King <wking@tremily.us>
Date: Fri, 30 May 2014 23:46:54 +0000 (-0700)
Subject: irkerd: Handle UnicodeDecodeError in LineProtocol.data_received
X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=2d1afefb1d1e773f48fbcfa142340036b9c0dec2;p=irker.git

irkerd: Handle UnicodeDecodeError in LineProtocol.data_received

I just got the following in a message-of-the-day from
leguin.freenode.net:

  Welcome to leguin.freenode.net in Ume\xe5, Sweden, EU!

Where Ume\xe5 is Ume{U+00E5 LATIN SMALL LETTER A WITH RING ABOVE}.
\xe5 is the ISO-8859-1 encoding.  Since important messages from the
IRC server should be in ASCII [1]:

  Regardless of being an 8-bit protocol, the delimiters and keywords
  are such that protocol is mostly usable from US-ASCII terminal and a
  telnet connection.

So rather than trying some fancy charset-detection heuristics, just
drop lines that don't decode properly.

[1]: http://tools.ietf.org/html/rfc2812#section-2.2
---

diff --git a/irkerd b/irkerd
index d6686ff..9b1a63c 100755
--- a/irkerd
+++ b/irkerd
@@ -484,9 +484,14 @@ class LineProtocol(asyncio.Protocol):
             else:
                 self.buffer = []
             for line in lines:
-                line = str(line, self.encoding).strip()
-                LOG.debug('{}: line received: {!r}'.format(self, line))
-                self.line_received(line=line)
+                try:
+                    line = str(line, self.encoding).strip()
+                except UnicodeDecodeError as e:
+                    LOG.warn('{}: invalid encoding in {!r} ({})'.format(
+                        self, line, e))
+                else:
+                    LOG.debug('{}: line received: {!r}'.format(self, line))
+                    self.line_received(line=line)
 
     def datagram_received(self, data, addr):
         "Decode the raw bytes and pass the line to line_received"