I just got the following in a message-of-the-day from
leguin.freenode.net:
Welcome to leguin.freenode.net in Ume\xe5, Sweden, EU!
Where Ume\xe5 is Ume{U+00E5 LATIN SMALL LETTER A WITH RING ABOVE}.
\xe5 is the ISO-8859-1 encoding. Since important messages from the
IRC server should be in ASCII [1]:
Regardless of being an 8-bit protocol, the delimiters and keywords
are such that protocol is mostly usable from US-ASCII terminal and a
telnet connection.
So rather than trying some fancy charset-detection heuristics, just
drop lines that don't decode properly.
[1]: http://tools.ietf.org/html/rfc2812#section-2.2
else:
self.buffer = []
for line in lines:
- line = str(line, self.encoding).strip()
- LOG.debug('{}: line received: {!r}'.format(self, line))
- self.line_received(line=line)
+ try:
+ line = str(line, self.encoding).strip()
+ except UnicodeDecodeError as e:
+ LOG.warn('{}: invalid encoding in {!r} ({})'.format(
+ self, line, e))
+ else:
+ LOG.debug('{}: line received: {!r}'.format(self, line))
+ self.line_received(line=line)
def datagram_received(self, data, addr):
"Decode the raw bytes and pass the line to line_received"