From: Rob Browning Date: Sat, 5 Sep 2015 19:13:00 +0000 (+1900) Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP] X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=9a53f24c25d8d1c8bb8cf3832cce82014ca12af3;p=notmuch-archives.git Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP] --- diff --git a/0e/08a4d739c480eb8e938f5e5aab0f1902805988 b/0e/08a4d739c480eb8e938f5e5aab0f1902805988 new file mode 100644 index 000000000..38237e054 --- /dev/null +++ b/0e/08a4d739c480eb8e938f5e5aab0f1902805988 @@ -0,0 +1,99 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by arlo.cworth.org (Postfix) with ESMTP id 9C42D6DE1AEA + for ; Sat, 5 Sep 2015 12:13:05 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at cworth.org +X-Spam-Flag: NO +X-Spam-Score: 0.393 +X-Spam-Level: +X-Spam-Status: No, score=0.393 tagged_above=-999 required=5 tests=[AWL=0.199, + RP_MATCHES_RCVD=-0.55, URIBL_SBL=0.644, URIBL_SBL_A=0.1] + autolearn=disabled +Received: from arlo.cworth.org ([127.0.0.1]) + by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id QAJhby6f67TD for ; + Sat, 5 Sep 2015 12:13:03 -0700 (PDT) +Received: from defaultvalue.org (defaultvalue.org [70.85.129.156]) + by arlo.cworth.org (Postfix) with ESMTP id 607B76DE1AE9 + for ; Sat, 5 Sep 2015 12:13:03 -0700 (PDT) +Received: from trouble.defaultvalue.org (localhost [127.0.0.1]) + (Authenticated sender: rlb@defaultvalue.org) + by defaultvalue.org (Postfix) with ESMTPSA id 2FB7621FD2; + Sat, 5 Sep 2015 14:13:01 -0500 (CDT) +Received: by trouble.defaultvalue.org (Postfix, from userid 1000) + id A08B714E070; Sat, 5 Sep 2015 14:13:00 -0500 (CDT) +From: Rob Browning +To: David Bremner , notmuch@notmuchmail.org +Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP] +In-Reply-To: <87io7sw79j.fsf@trouble.defaultvalue.org> +References: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org> + <87fv2we26p.fsf@maritornes.cs.unb.ca> + <87io7sw79j.fsf@trouble.defaultvalue.org> +User-Agent: Notmuch/0.20.1 (http://notmuchmail.org) Emacs/24.5.1 + (x86_64-pc-linux-gnu) +Date: Sat, 05 Sep 2015 14:13:00 -0500 +Message-ID: <877fo4wugz.fsf@trouble.defaultvalue.org> +MIME-Version: 1.0 +Content-Type: text/plain; charset=utf-8 +Content-Transfer-Encoding: quoted-printable +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.18 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Sat, 05 Sep 2015 19:13:05 -0000 + +Rob Browning writes: + +> David Bremner writes: + +>> It seems plausible to specify UTF-8 input for the library, but what +>> about the CLI? It seems like the canonicalization operation increases +>> the chance of mangling user input in non-UTF-8 locales. +> +> Yes, the key question: what does notmuch intend? i.e. given a sequence +> of bytes, how will notmuch interpret them? I think we should decide +> that, and document it clearly somewhere. +> +> The commit message describes my understanding of how things currently +> work, and if/when I get time, I'd like to propose some related +> documentation updates (perhaps to notmuch-search-terms or +> notmuch-insert/new?). +> +> Oh, and if I do understand things correctly, notmuch may already stand a +> chance of mangling any bytes that aren't an invalid UTF-8 byte sequence, +> but also aren't actually in UTF-8 (excepting encodings that are a strict +> subset of UTF-8, like ASCII). +> +> For example (if I did this right), [0xd1 0xa1] is valid UTF-8, producing +> omega "=D1=A1", and also valid Latin-1, producing "=C3=91=C2=A1". + +So on this particular point, I'm perhaps too used to thinking about the +general encoding problem, and wasn't thinking about our specific +constraints. + +If (1) "normal" message bodies are required to be US-ASCII (which I'd +neglected to remember might be the case), and (2) MIME handles the rest, +then perhaps notmuch will only receive raw bytes via user input +(i.e. query strings, etc.). + +In which case, we could just document that notmuch interprets user input +as UTF-8 (and we might or might not mention the Latin-1 fallback). + +Later locale support could be added if desired, and none of this would +involve the quite nasty problem of encoding detection. + +--=20 +Rob Browning +rlb @defaultvalue.org and @debian.org +GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A +GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4