From: W. Trevor King Date: Fri, 30 May 2014 23:46:54 +0000 (-0700) Subject: irkerd: Handle UnicodeDecodeError in LineProtocol.data_received X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=2d1afefb1d1e773f48fbcfa142340036b9c0dec2;p=irker.git irkerd: Handle UnicodeDecodeError in LineProtocol.data_received I just got the following in a message-of-the-day from leguin.freenode.net: Welcome to leguin.freenode.net in Ume\xe5, Sweden, EU! Where Ume\xe5 is Ume{U+00E5 LATIN SMALL LETTER A WITH RING ABOVE}. \xe5 is the ISO-8859-1 encoding. Since important messages from the IRC server should be in ASCII [1]: Regardless of being an 8-bit protocol, the delimiters and keywords are such that protocol is mostly usable from US-ASCII terminal and a telnet connection. So rather than trying some fancy charset-detection heuristics, just drop lines that don't decode properly. [1]: http://tools.ietf.org/html/rfc2812#section-2.2 --- diff --git a/irkerd b/irkerd index d6686ff..9b1a63c 100755 --- a/irkerd +++ b/irkerd @@ -484,9 +484,14 @@ class LineProtocol(asyncio.Protocol): else: self.buffer = [] for line in lines: - line = str(line, self.encoding).strip() - LOG.debug('{}: line received: {!r}'.format(self, line)) - self.line_received(line=line) + try: + line = str(line, self.encoding).strip() + except UnicodeDecodeError as e: + LOG.warn('{}: invalid encoding in {!r} ({})'.format( + self, line, e)) + else: + LOG.debug('{}: line received: {!r}'.format(self, line)) + self.line_received(line=line) def datagram_received(self, data, addr): "Decode the raw bytes and pass the line to line_received"