From: Rob Browning <rlb@defaultvalue.org>
Date: Sat, 5 Sep 2015 19:13:00 +0000 (+1900)
Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=9a53f24c25d8d1c8bb8cf3832cce82014ca12af3;p=notmuch-archives.git

Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
---

diff --git a/0e/08a4d739c480eb8e938f5e5aab0f1902805988 b/0e/08a4d739c480eb8e938f5e5aab0f1902805988
new file mode 100644
index 000000000..38237e054
--- /dev/null
+++ b/0e/08a4d739c480eb8e938f5e5aab0f1902805988
@@ -0,0 +1,99 @@
+Return-Path: <rlb@defaultvalue.org>
+X-Original-To: notmuch@notmuchmail.org
+Delivered-To: notmuch@notmuchmail.org
+Received: from localhost (localhost [127.0.0.1])
+ by arlo.cworth.org (Postfix) with ESMTP id 9C42D6DE1AEA
+ for <notmuch@notmuchmail.org>; Sat,  5 Sep 2015 12:13:05 -0700 (PDT)
+X-Virus-Scanned: Debian amavisd-new at cworth.org
+X-Spam-Flag: NO
+X-Spam-Score: 0.393
+X-Spam-Level: 
+X-Spam-Status: No, score=0.393 tagged_above=-999 required=5 tests=[AWL=0.199, 
+ RP_MATCHES_RCVD=-0.55, URIBL_SBL=0.644, URIBL_SBL_A=0.1]
+ autolearn=disabled
+Received: from arlo.cworth.org ([127.0.0.1])
+ by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024)
+ with ESMTP id QAJhby6f67TD for <notmuch@notmuchmail.org>;
+ Sat,  5 Sep 2015 12:13:03 -0700 (PDT)
+Received: from defaultvalue.org (defaultvalue.org [70.85.129.156])
+ by arlo.cworth.org (Postfix) with ESMTP id 607B76DE1AE9
+ for <notmuch@notmuchmail.org>; Sat,  5 Sep 2015 12:13:03 -0700 (PDT)
+Received: from trouble.defaultvalue.org (localhost [127.0.0.1])
+ (Authenticated sender: rlb@defaultvalue.org)
+ by defaultvalue.org (Postfix) with ESMTPSA id 2FB7621FD2;
+ Sat,  5 Sep 2015 14:13:01 -0500 (CDT)
+Received: by trouble.defaultvalue.org (Postfix, from userid 1000)
+ id A08B714E070; Sat,  5 Sep 2015 14:13:00 -0500 (CDT)
+From: Rob Browning <rlb@defaultvalue.org>
+To: David Bremner <david@tethera.net>, notmuch@notmuchmail.org
+Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
+In-Reply-To: <87io7sw79j.fsf@trouble.defaultvalue.org>
+References: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org>
+ <87fv2we26p.fsf@maritornes.cs.unb.ca>
+ <87io7sw79j.fsf@trouble.defaultvalue.org>
+User-Agent: Notmuch/0.20.1 (http://notmuchmail.org) Emacs/24.5.1
+ (x86_64-pc-linux-gnu)
+Date: Sat, 05 Sep 2015 14:13:00 -0500
+Message-ID: <877fo4wugz.fsf@trouble.defaultvalue.org>
+MIME-Version: 1.0
+Content-Type: text/plain; charset=utf-8
+Content-Transfer-Encoding: quoted-printable
+X-BeenThere: notmuch@notmuchmail.org
+X-Mailman-Version: 2.1.18
+Precedence: list
+List-Id: "Use and development of the notmuch mail system."
+ <notmuch.notmuchmail.org>
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
+ <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
+List-Archive: <http://notmuchmail.org/pipermail/notmuch/>
+List-Post: <mailto:notmuch@notmuchmail.org>
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
+ <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
+X-List-Received-Date: Sat, 05 Sep 2015 19:13:05 -0000
+
+Rob Browning <rlb@defaultvalue.org> writes:
+
+> David Bremner <david@tethera.net> writes:
+
+>> It seems plausible to specify UTF-8 input for the library, but what
+>> about the CLI? It seems like the canonicalization operation increases
+>> the chance of mangling user input in non-UTF-8 locales.
+>
+> Yes, the key question: what does notmuch intend?  i.e. given a sequence
+> of bytes, how will notmuch interpret them?  I think we should decide
+> that, and document it clearly somewhere.
+>
+> The commit message describes my understanding of how things currently
+> work, and if/when I get time, I'd like to propose some related
+> documentation updates (perhaps to notmuch-search-terms or
+> notmuch-insert/new?).
+>
+> Oh, and if I do understand things correctly, notmuch may already stand a
+> chance of mangling any bytes that aren't an invalid UTF-8 byte sequence,
+> but also aren't actually in UTF-8 (excepting encodings that are a strict
+> subset of UTF-8, like ASCII).
+>
+> For example (if I did this right), [0xd1 0xa1] is valid UTF-8, producing
+> omega "=D1=A1", and also valid Latin-1, producing "=C3=91=C2=A1".
+
+So on this particular point, I'm perhaps too used to thinking about the
+general encoding problem, and wasn't thinking about our specific
+constraints.
+
+If (1) "normal" message bodies are required to be US-ASCII (which I'd
+neglected to remember might be the case), and (2) MIME handles the rest,
+then perhaps notmuch will only receive raw bytes via user input
+(i.e. query strings, etc.).
+
+In which case, we could just document that notmuch interprets user input
+as UTF-8 (and we might or might not mention the Latin-1 fallback).
+
+Later locale support could be added if desired, and none of this would
+involve the quite nasty problem of encoding detection.
+
+--=20
+Rob Browning
+rlb @defaultvalue.org and @debian.org
+GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
+GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4