Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]

author Rob Browning <rlb@defaultvalue.org>

Thu, 3 Sep 2015 02:45:12 +0000 (21:45 +1900)

committer W. Trevor King <wking@tremily.us>

Sat, 20 Aug 2016 21:49:30 +0000 (14:49 -0700)
author Rob Browning <rlb@defaultvalue.org>
Thu, 3 Sep 2015 02:45:12 +0000 (21:45 +1900)
committer W. Trevor King <wking@tremily.us>
Sat, 20 Aug 2016 21:49:30 +0000 (14:49 -0700)
diff --git a/20/cc95a3b725c79eeca5b634cbbb7c912eb29e2e b/20/cc95a3b725c79eeca5b634cbbb7c912eb29e2e

new file mode 100644 (file)

index 0000000..24d8d89
--- /dev/null
+++ b/20/cc95a3b725c79eeca5b634cbbb7c912eb29e2e
@@ -0,0 +1,135 @@
+Return-Path: <rlb@defaultvalue.org>\r
+X-Original-To: notmuch@notmuchmail.org\r
+Delivered-To: notmuch@notmuchmail.org\r
+Received: from localhost (localhost [127.0.0.1])\r
+ by arlo.cworth.org (Postfix) with ESMTP id AE53D6DE1B59\r
+ for <notmuch@notmuchmail.org>; Wed,  2 Sep 2015 19:45:17 -0700 (PDT)\r
+X-Virus-Scanned: Debian amavisd-new at cworth.org\r
+X-Spam-Flag: NO\r
+X-Spam-Score: 0.38\r
+X-Spam-Level: \r
+X-Spam-Status: No, score=0.38 tagged_above=-999 required=5 tests=[AWL=0.186,\r
+ RP_MATCHES_RCVD=-0.55, URIBL_SBL=0.644, URIBL_SBL_A=0.1]\r
+ autolearn=disabled\r
+Received: from arlo.cworth.org ([127.0.0.1])\r
+ by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024)\r
+ with ESMTP id iv21yAC3oSsb for <notmuch@notmuchmail.org>;\r
+ Wed,  2 Sep 2015 19:45:14 -0700 (PDT)\r
+Received: from defaultvalue.org (defaultvalue.org [70.85.129.156])\r
+ by arlo.cworth.org (Postfix) with ESMTP id 7C3AF6DE1B58\r
+ for <notmuch@notmuchmail.org>; Wed,  2 Sep 2015 19:45:14 -0700 (PDT)\r
+Received: from trouble.defaultvalue.org (localhost [127.0.0.1])\r
+ (Authenticated sender: rlb@defaultvalue.org)\r
+ by defaultvalue.org (Postfix) with ESMTPSA id 1127B2009F;\r
+ Wed,  2 Sep 2015 21:45:13 -0500 (CDT)\r
+Received: by trouble.defaultvalue.org (Postfix, from userid 1000)\r
+ id 84AC514E0F9; Wed,  2 Sep 2015 21:45:12 -0500 (CDT)\r
+From: Rob Browning <rlb@defaultvalue.org>\r
+To: David Bremner <david@tethera.net>, notmuch@notmuchmail.org\r
+Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP]\r
+In-Reply-To: <87fv2we26p.fsf@maritornes.cs.unb.ca>\r
+References: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org>\r
+ <87fv2we26p.fsf@maritornes.cs.unb.ca>\r
+User-Agent: Notmuch/0.20.1 (http://notmuchmail.org) Emacs/24.5.1\r
+ (x86_64-pc-linux-gnu)\r
+Date: Wed, 02 Sep 2015 21:45:12 -0500\r
+Message-ID: <87io7sw79j.fsf@trouble.defaultvalue.org>\r
+MIME-Version: 1.0\r
+Content-Type: text/plain; charset=utf-8\r
+Content-Transfer-Encoding: quoted-printable\r
+X-BeenThere: notmuch@notmuchmail.org\r
+X-Mailman-Version: 2.1.18\r
+Precedence: list\r
+List-Id: "Use and development of the notmuch mail system."\r
+ <notmuch.notmuchmail.org>\r
+List-Unsubscribe: <http://notmuchmail.org/mailman/options/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=unsubscribe>\r
+List-Archive: <http://notmuchmail.org/pipermail/notmuch/>\r
+List-Post: <mailto:notmuch@notmuchmail.org>\r
+List-Help: <mailto:notmuch-request@notmuchmail.org?subject=help>\r
+List-Subscribe: <http://notmuchmail.org/mailman/listinfo/notmuch>,\r
+ <mailto:notmuch-request@notmuchmail.org?subject=subscribe>\r
+X-List-Received-Date: Thu, 03 Sep 2015 02:45:17 -0000\r
+\r
+David Bremner <david@tethera.net> writes:\r
+\r
+> One way to break this up into more bite sized pieces would be to first\r
+> create one or more tests that fail with current notmuch, and mark those\r
+> as broken.\r
+\r
+Right - for the moment I just wanted to post what I had for\r
+consideration.  I didn't want to spend too much more time on the\r
+approach if was uninteresting/inappropriate.\r
+\r
+One simple place to start might be the included T570-normalization.sh.\r
+Though perhaps that should be "canonicalization"?\r
+\r
+> Can you explain why notmuch is the right place to do this, and not\r
+> Xapian? I know we talked back and forth about this, but I never really\r
+> got a solid sense of what the conclusion was. Is it just dependencies?\r
+\r
+I have no strong opinion there, but to do the work in Xapian will\r
+require a new release at a minimum, and likely new dependencies.\r
+\r
+And generally speaking, I suppose I have a suspicion that application\r
+needs with respect to encoding "detection", tokenization, stemming, stop\r
+words, synonyms, phrase detection, etc. may be domain specific and\r
+complex enough that Xapian won't want to try to accommodate the broad\r
+array of possibilities, at least not in its core library.\r
+\r
+Though it might try to handle some or all of that by providing suitable\r
+customizability (presumably via callbacks or subclassing or...).  And\r
+since I'm new to Xapian, I'm not completely sure what's already\r
+available.\r
+\r
+> It seems plausible to specify UTF-8 input for the library, but what\r
+> about the CLI? It seems like the canonicalization operation increases\r
+> the chance of mangling user input in non-UTF-8 locales.\r
+\r
+Yes, the key question: what does notmuch intend?  i.e. given a sequence\r
+of bytes, how will notmuch interpret them?  I think we should decide\r
+that, and document it clearly somewhere.\r
+\r
+The commit message describes my understanding of how things currently\r
+work, and if/when I get time, I'd like to propose some related\r
+documentation updates (perhaps to notmuch-search-terms or\r
+notmuch-insert/new?).\r
+\r
+Oh, and if I do understand things correctly, notmuch may already stand a\r
+chance of mangling any bytes that aren't an invalid UTF-8 byte sequence,\r
+but also aren't actually in UTF-8 (excepting encodings that are a strict\r
+subset of UTF-8, like ASCII).\r
+\r
+For example (if I did this right), [0xd1 0xa1] is valid UTF-8, producing\r
+omega "=D1=A1", and also valid Latin-1, producing "=C3=91=C2=A1".\r
+\r
+> I suppose some upgrade code to canonicalize all the terms? That sounds\r
+> pretty slow.\r
+\r
+Perhaps, or I suppose you could just document that older indexed data\r
+might not be canonicalized, and that you should reindex if that matters\r
+to you.  Although I suppose anyone with affected characters might well\r
+want to reindex if the canonical form isn't the one people normally\r
+receive (which seemed possible).\r
+\r
+Hmm, another question -- for terms, does notmuch store ordinal\r
+positions, Unicode character offsets, input byte offsets, or...?\r
+Canonicalization will of course change the latter.\r
+\r
+I imagine it might be possible to traverse the index terms and just\r
+detect and merge those affected, but no idea if that would be\r
+reasonable.\r
+\r
+> I really didn't look at the code very closely, but there were a\r
+> surprising number of calls to talloc_free. But those kind of details can\r
+> wait.\r
+\r
+Right, I wasn't sure what the policies were, so in most cases, I just\r
+tried to release the data when it was no longer needed.\r
+\r
+Thanks\r
+--=20\r
+Rob Browning\r
+rlb @defaultvalue.org and @debian.org\r
+GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A\r
+GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4\r
author	Rob Browning <rlb@defaultvalue.org>
	Thu, 3 Sep 2015 02:45:12 +0000 (21:45 +1900)
committer	W. Trevor King <wking@tremily.us>
	Sat, 20 Aug 2016 21:49:30 +0000 (14:49 -0700)