Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by arlo.cworth.org (Postfix) with ESMTP id 4F6B86DE13E7 for ; Tue, 16 Feb 2016 11:02:28 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at cworth.org X-Spam-Flag: NO X-Spam-Score: -0.016 X-Spam-Level: X-Spam-Status: No, score=-0.016 tagged_above=-999 required=5 tests=[AWL=-0.016] autolearn=disabled Received: from arlo.cworth.org ([127.0.0.1]) by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id l2PoT-sIX-7c for ; Tue, 16 Feb 2016 11:02:25 -0800 (PST) Received: from che.mayfirst.org (che.mayfirst.org [209.234.253.108]) by arlo.cworth.org (Postfix) with ESMTP id CE4166DE0244 for ; Tue, 16 Feb 2016 11:02:24 -0800 (PST) Received: from fifthhorseman.net (unknown [38.109.115.130]) by che.mayfirst.org (Postfix) with ESMTPSA id 58295F991; Tue, 16 Feb 2016 14:02:03 -0500 (EST) Received: by fifthhorseman.net (Postfix, from userid 1000) id 9E1671FF32; Tue, 16 Feb 2016 14:02:02 -0500 (EST) From: Daniel Kahn Gillmor To: David Bremner , notmuch@notmuchmail.org Subject: Re: encoding of message-ids In-Reply-To: <87si0svnim.fsf@zancas.localnet> References: <87si0svnim.fsf@zancas.localnet> User-Agent: Notmuch/0.21+72~gd8c4f1c (http://notmuchmail.org) Emacs/24.5.1 (x86_64-pc-linux-gnu) Date: Tue, 16 Feb 2016 14:02:02 -0500 Message-ID: <87ziv0iimt.fsf@alice.fifthhorseman.net> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Feb 2016 19:02:28 -0000 On Tue 2016-02-16 07:38:09 -0500, David Bremner wrote: > I spent a little time this morning staring at the code, and it seems > that all of the message-ids are parsed via g_mime_decode_text, which > deals with RFC2047 encodings and makes guesses at decoding 8bit > characters. In practice this means that in the notmuch database all > headers are UTF-8. Since message-id's are supposed to be printable ascii > [at least in rfc5322], this seems like not such a terrible decision, but > I wonder if we should document this potential conversion somewhere? i think you mean g_mime_utils_header_decode_text, not gmime_decode_text, right? What do you think are the potential risks here? * if all incoming message-ids are standards-compliant (lower-case ascii, with an @ sign in the middle and surrounded by angle-brackets [0], then it cannot be interpreted as RFC 2047 text because it does not have the leading =? or the trailing ?=, so gmime shouldn't translate it. * if some incoming message-ids are not standards-compliant, then it's possible that they will be transformed into other, non-standards-compliant message IDs. Some of them might even be transformed into standards-compliant message-IDs. for example, '=?UTF-8?q??=' will be transformed into ''. the main risk, i suppose, is that someone could craft a message with a different literal Message-ID than an existing message, and could trigger an otherwise undetectable message ID collision. This seems not much worse than the existing (detectable) mesage ID collision problems notmuch already has. That said, RFC 2047 suggest that its encodings are only relevant in places where a "text" token would be used. Message-ID (and References and In-Reply-To) are intended to only contain dot-atom-text tokens. So probably it would be more correct to avoid applying to these specific fields. i dunno that it's a big deal though, given the analysis above. --dkg [0] https://tools.ietf.org/html/rfc5322#section-3.6.4 [1] https://tools.ietf.org/html/rfc2047#section-5