From 351f632fbc2ee07664c5fcff3ad45453d174df4d Mon Sep 17 00:00:00 2001
From: David Bremner <david@tethera.net>
Date: Thu, 3 Sep 2015 22:12:14 +2100
Subject: [PATCH] Re: [PATCH 1/1] Store and search for canonical Unicode text
 [WIP]

---
 ca/43af5fe1a7b652d56f6aff17c168b1d504b3c7 | 102 ++++++++++++++++++++++
 1 file changed, 102 insertions(+)
 create mode 100644 ca/43af5fe1a7b652d56f6aff17c168b1d504b3c7

diff --git a/ca/43af5fe1a7b652d56f6aff17c168b1d504b3c7 b/ca/43af5fe1a7b652d56f6aff17c168b1d504b3c7
new file mode 100644
index 000000000..a7f6b645b
--- /dev/null
+++ b/ca/43af5fe1a7b652d56f6aff17c168b1d504b3c7
@@ -0,0 +1,102 @@
+Return-Path: <david@tethera.net>
+X-Original-To: notmuch@notmuchmail.org
+Delivered-To: notmuch@notmuchmail.org
+Received: from localhost (localhost [127.0.0.1])
+ by arlo.cworth.org (Postfix) with ESMTP id 1E9AD6DE1B56
+ for <notmuch@notmuchmail.org>; Wed,  2 Sep 2015 18:13:25 -0700 (PDT)
+X-Virus-Scanned: Debian amavisd-new at cworth.org
+X-Spam-Flag: NO
+X-Spam-Score: 0.114
+X-Spam-Level: 
+X-Spam-Status: No, score=0.114 tagged_above=-999 required=5 tests=[AWL=0.114]
+ autolearn=disabled
+Received: from arlo.cworth.org ([127.0.0.1])
+ by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024)
+ with ESMTP id hgFK-dBBJ5Bc for <notmuch@notmuchmail.org>;
+ Wed,  2 Sep 2015 18:13:23 -0700 (PDT)
+Received: from gitolite.debian.net (gitolite.debian.net [87.98.215.224])
+ by arlo.cworth.org (Postfix) with ESMTPS id D37D96DE1B51
+ for <notmuch@notmuchmail.org>; Wed,  2 Sep 2015 18:13:22 -0700 (PDT)
+Received: from remotemail by gitolite.debian.net with local (Exim 4.80)
+ (envelope-from <david@tethera.net>)
+ id 1ZXJ4s-0004iZ-U2; Thu, 03 Sep 2015 01:12:34 +0000
+Received: (nullmailer pid 17442 invoked by uid 1000); Thu, 03 Sep 2015
+ 01:12:14 -0000
+From: David Bremner <david@tethera.net>
+To: Rob Browning <rlb@defaultvalue.org>, notmuch@notmuchmail.org
+Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
+In-Reply-To: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org>
+References: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org>
+User-Agent: Notmuch/0.20.2+60~gcb08a2e (http://notmuchmail.org) Emacs/24.5.1
+ (x86_64-pc-linux-gnu)
+Date: Wed, 02 Sep 2015 22:12:14 -0300
+Message-ID: <87fv2we26p.fsf@maritornes.cs.unb.ca>
+MIME-Version: 1.0
+Content-Type: text/plain; charset=utf-8
+Content-Transfer-Encoding: quoted-printable
+X-BeenThere: notmuch@notmuchmail.org
+X-Mailman-Version: 2.1.18
+Precedence: list
+List-Id: "Use and development of the notmuch mail system."
+ <notmuch.notmuchmail.org>
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
+ <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
+List-Archive: <http://notmuchmail.org/pipermail/notmuch/>
+List-Post: <mailto:notmuch@notmuchmail.org>
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
+ <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
+X-List-Received-Date: Thu, 03 Sep 2015 01:13:25 -0000
+
+Rob Browning <rlb@defaultvalue.org> writes:
+
+>
+> Before this change, notmuch would index two strings that differ only
+> with respect to canonicalization, like to=CC=81ken and t=C3=B3ken, as sep=
+arate
+> terms, even though they may be visually indistinguishable, and do (for
+> most purposes) represent the same text.  After indexing, searching for
+> one would not find the other, and which one you present to notmuch
+> when you search depends on your tools.  See test/T570-normalization.sh
+> for a working example.
+
+One way to break this up into more bite sized pieces would be to first
+create one or more tests that fail with current notmuch, and mark those
+as broken.
+
+> Up to now, notmuch has let Xapian handle converting the incoming bytes
+> to UTF-8.  Xapian treats any byte sequence as UTF-8, and interprets
+> any invalid UTF-8 bytes as Latin-1.  This patch maintains the existing
+> behavior (excepting the new canonicalization) by using Xapian's
+> Utf8Iterator to handle the initial Unicode character parsing.
+
+Can you explain why notmuch is the right place to do this, and not
+Xapian? I know we talked back and forth about this, but I never really
+got a solid sense of what the conclusion was. Is it just dependencies?
+
+> And because when the input is already UTF-8, it just blindly converts
+> from UTF-8 to Unicode code points, and then back to UTF-8 (after
+> canonicalization), during each pass.  There are certainly
+> opportunities to optimize, though it may be worth discussing the
+> detection of data encodings more broadly first.
+
+It seems plausible to specify UTF-8 input for the library, but what
+about the CLI? It seems like the canonicalization operation increases
+the chance of mangling user input in non-UTF-8 locales.
+
+> FIXME: what about existing indexed text?
+
+I suppose some upgrade code to canonicalize all the terms? That sounds
+pretty slow.
+
+> ---
+>
+>  Posted for preliminary discussion, and as a milestone (it appears to
+>  mostly work now).  Though I doubt I'm handling things correctly
+>  everywhere notmuch-wise, wrt talloc, etc.
+
+I really didn't look at the code very closely, but there were a
+surprising number of calls to talloc_free. But those kind of details can
+wait.
+
+
-- 
2.26.2