From: Rob Browning <rlb@defaultvalue.org>
Date: Thu, 3 Sep 2015 02:45:12 +0000 (+1900)
Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=79abbb0d593649f217152aabbc70e56316c3feff;p=notmuch-archives.git

Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
---

diff --git a/20/cc95a3b725c79eeca5b634cbbb7c912eb29e2e b/20/cc95a3b725c79eeca5b634cbbb7c912eb29e2e
new file mode 100644
index 000000000..24d8d89ca
--- /dev/null
+++ b/20/cc95a3b725c79eeca5b634cbbb7c912eb29e2e
@@ -0,0 +1,135 @@
+Return-Path: <rlb@defaultvalue.org>
+X-Original-To: notmuch@notmuchmail.org
+Delivered-To: notmuch@notmuchmail.org
+Received: from localhost (localhost [127.0.0.1])
+ by arlo.cworth.org (Postfix) with ESMTP id AE53D6DE1B59
+ for <notmuch@notmuchmail.org>; Wed,  2 Sep 2015 19:45:17 -0700 (PDT)
+X-Virus-Scanned: Debian amavisd-new at cworth.org
+X-Spam-Flag: NO
+X-Spam-Score: 0.38
+X-Spam-Level: 
+X-Spam-Status: No, score=0.38 tagged_above=-999 required=5 tests=[AWL=0.186,
+ RP_MATCHES_RCVD=-0.55, URIBL_SBL=0.644, URIBL_SBL_A=0.1]
+ autolearn=disabled
+Received: from arlo.cworth.org ([127.0.0.1])
+ by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024)
+ with ESMTP id iv21yAC3oSsb for <notmuch@notmuchmail.org>;
+ Wed,  2 Sep 2015 19:45:14 -0700 (PDT)
+Received: from defaultvalue.org (defaultvalue.org [70.85.129.156])
+ by arlo.cworth.org (Postfix) with ESMTP id 7C3AF6DE1B58
+ for <notmuch@notmuchmail.org>; Wed,  2 Sep 2015 19:45:14 -0700 (PDT)
+Received: from trouble.defaultvalue.org (localhost [127.0.0.1])
+ (Authenticated sender: rlb@defaultvalue.org)
+ by defaultvalue.org (Postfix) with ESMTPSA id 1127B2009F;
+ Wed,  2 Sep 2015 21:45:13 -0500 (CDT)
+Received: by trouble.defaultvalue.org (Postfix, from userid 1000)
+ id 84AC514E0F9; Wed,  2 Sep 2015 21:45:12 -0500 (CDT)
+From: Rob Browning <rlb@defaultvalue.org>
+To: David Bremner <david@tethera.net>, notmuch@notmuchmail.org
+Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]
+In-Reply-To: <87fv2we26p.fsf@maritornes.cs.unb.ca>
+References: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org>
+ <87fv2we26p.fsf@maritornes.cs.unb.ca>
+User-Agent: Notmuch/0.20.1 (http://notmuchmail.org) Emacs/24.5.1
+ (x86_64-pc-linux-gnu)
+Date: Wed, 02 Sep 2015 21:45:12 -0500
+Message-ID: <87io7sw79j.fsf@trouble.defaultvalue.org>
+MIME-Version: 1.0
+Content-Type: text/plain; charset=utf-8
+Content-Transfer-Encoding: quoted-printable
+X-BeenThere: notmuch@notmuchmail.org
+X-Mailman-Version: 2.1.18
+Precedence: list
+List-Id: "Use and development of the notmuch mail system."
+ <notmuch.notmuchmail.org>
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,
+ <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>
+List-Archive: <http://notmuchmail.org/pipermail/notmuch/>
+List-Post: <mailto:notmuch@notmuchmail.org>
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,
+ <mailto:notmuch-request@notmuchmail.org?subject=subscribe>
+X-List-Received-Date: Thu, 03 Sep 2015 02:45:17 -0000
+
+David Bremner <david@tethera.net> writes:
+
+> One way to break this up into more bite sized pieces would be to first
+> create one or more tests that fail with current notmuch, and mark those
+> as broken.
+
+Right - for the moment I just wanted to post what I had for
+consideration.  I didn't want to spend too much more time on the
+approach if was uninteresting/inappropriate.
+
+One simple place to start might be the included T570-normalization.sh.
+Though perhaps that should be "canonicalization"?
+
+> Can you explain why notmuch is the right place to do this, and not
+> Xapian? I know we talked back and forth about this, but I never really
+> got a solid sense of what the conclusion was. Is it just dependencies?
+
+I have no strong opinion there, but to do the work in Xapian will
+require a new release at a minimum, and likely new dependencies.
+
+And generally speaking, I suppose I have a suspicion that application
+needs with respect to encoding "detection", tokenization, stemming, stop
+words, synonyms, phrase detection, etc. may be domain specific and
+complex enough that Xapian won't want to try to accommodate the broad
+array of possibilities, at least not in its core library.
+
+Though it might try to handle some or all of that by providing suitable
+customizability (presumably via callbacks or subclassing or...).  And
+since I'm new to Xapian, I'm not completely sure what's already
+available.
+
+> It seems plausible to specify UTF-8 input for the library, but what
+> about the CLI? It seems like the canonicalization operation increases
+> the chance of mangling user input in non-UTF-8 locales.
+
+Yes, the key question: what does notmuch intend?  i.e. given a sequence
+of bytes, how will notmuch interpret them?  I think we should decide
+that, and document it clearly somewhere.
+
+The commit message describes my understanding of how things currently
+work, and if/when I get time, I'd like to propose some related
+documentation updates (perhaps to notmuch-search-terms or
+notmuch-insert/new?).
+
+Oh, and if I do understand things correctly, notmuch may already stand a
+chance of mangling any bytes that aren't an invalid UTF-8 byte sequence,
+but also aren't actually in UTF-8 (excepting encodings that are a strict
+subset of UTF-8, like ASCII).
+
+For example (if I did this right), [0xd1 0xa1] is valid UTF-8, producing
+omega "=D1=A1", and also valid Latin-1, producing "=C3=91=C2=A1".
+
+> I suppose some upgrade code to canonicalize all the terms? That sounds
+> pretty slow.
+
+Perhaps, or I suppose you could just document that older indexed data
+might not be canonicalized, and that you should reindex if that matters
+to you.  Although I suppose anyone with affected characters might well
+want to reindex if the canonical form isn't the one people normally
+receive (which seemed possible).
+
+Hmm, another question -- for terms, does notmuch store ordinal
+positions, Unicode character offsets, input byte offsets, or...?
+Canonicalization will of course change the latter.
+
+I imagine it might be possible to traverse the index terms and just
+detect and merge those affected, but no idea if that would be
+reasonable.
+
+> I really didn't look at the code very closely, but there were a
+> surprising number of calls to talloc_free. But those kind of details can
+> wait.
+
+Right, I wasn't sure what the policies were, so in most cases, I just
+tried to release the data when it was no longer needed.
+
+Thanks
+--=20
+Rob Browning
+rlb @defaultvalue.org and @debian.org
+GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
+GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4