From: Rob Browning Date: Thu, 3 Sep 2015 02:45:12 +0000 (+1900) Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP] X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=79abbb0d593649f217152aabbc70e56316c3feff;p=notmuch-archives.git Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP] --- diff --git a/20/cc95a3b725c79eeca5b634cbbb7c912eb29e2e b/20/cc95a3b725c79eeca5b634cbbb7c912eb29e2e new file mode 100644 index 000000000..24d8d89ca --- /dev/null +++ b/20/cc95a3b725c79eeca5b634cbbb7c912eb29e2e @@ -0,0 +1,135 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by arlo.cworth.org (Postfix) with ESMTP id AE53D6DE1B59 + for ; Wed, 2 Sep 2015 19:45:17 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at cworth.org +X-Spam-Flag: NO +X-Spam-Score: 0.38 +X-Spam-Level: +X-Spam-Status: No, score=0.38 tagged_above=-999 required=5 tests=[AWL=0.186, + RP_MATCHES_RCVD=-0.55, URIBL_SBL=0.644, URIBL_SBL_A=0.1] + autolearn=disabled +Received: from arlo.cworth.org ([127.0.0.1]) + by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id iv21yAC3oSsb for ; + Wed, 2 Sep 2015 19:45:14 -0700 (PDT) +Received: from defaultvalue.org (defaultvalue.org [70.85.129.156]) + by arlo.cworth.org (Postfix) with ESMTP id 7C3AF6DE1B58 + for ; Wed, 2 Sep 2015 19:45:14 -0700 (PDT) +Received: from trouble.defaultvalue.org (localhost [127.0.0.1]) + (Authenticated sender: rlb@defaultvalue.org) + by defaultvalue.org (Postfix) with ESMTPSA id 1127B2009F; + Wed, 2 Sep 2015 21:45:13 -0500 (CDT) +Received: by trouble.defaultvalue.org (Postfix, from userid 1000) + id 84AC514E0F9; Wed, 2 Sep 2015 21:45:12 -0500 (CDT) +From: Rob Browning +To: David Bremner , notmuch@notmuchmail.org +Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP] +In-Reply-To: <87fv2we26p.fsf@maritornes.cs.unb.ca> +References: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org> + <87fv2we26p.fsf@maritornes.cs.unb.ca> +User-Agent: Notmuch/0.20.1 (http://notmuchmail.org) Emacs/24.5.1 + (x86_64-pc-linux-gnu) +Date: Wed, 02 Sep 2015 21:45:12 -0500 +Message-ID: <87io7sw79j.fsf@trouble.defaultvalue.org> +MIME-Version: 1.0 +Content-Type: text/plain; charset=utf-8 +Content-Transfer-Encoding: quoted-printable +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.18 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Thu, 03 Sep 2015 02:45:17 -0000 + +David Bremner writes: + +> One way to break this up into more bite sized pieces would be to first +> create one or more tests that fail with current notmuch, and mark those +> as broken. + +Right - for the moment I just wanted to post what I had for +consideration. I didn't want to spend too much more time on the +approach if was uninteresting/inappropriate. + +One simple place to start might be the included T570-normalization.sh. +Though perhaps that should be "canonicalization"? + +> Can you explain why notmuch is the right place to do this, and not +> Xapian? I know we talked back and forth about this, but I never really +> got a solid sense of what the conclusion was. Is it just dependencies? + +I have no strong opinion there, but to do the work in Xapian will +require a new release at a minimum, and likely new dependencies. + +And generally speaking, I suppose I have a suspicion that application +needs with respect to encoding "detection", tokenization, stemming, stop +words, synonyms, phrase detection, etc. may be domain specific and +complex enough that Xapian won't want to try to accommodate the broad +array of possibilities, at least not in its core library. + +Though it might try to handle some or all of that by providing suitable +customizability (presumably via callbacks or subclassing or...). And +since I'm new to Xapian, I'm not completely sure what's already +available. + +> It seems plausible to specify UTF-8 input for the library, but what +> about the CLI? It seems like the canonicalization operation increases +> the chance of mangling user input in non-UTF-8 locales. + +Yes, the key question: what does notmuch intend? i.e. given a sequence +of bytes, how will notmuch interpret them? I think we should decide +that, and document it clearly somewhere. + +The commit message describes my understanding of how things currently +work, and if/when I get time, I'd like to propose some related +documentation updates (perhaps to notmuch-search-terms or +notmuch-insert/new?). + +Oh, and if I do understand things correctly, notmuch may already stand a +chance of mangling any bytes that aren't an invalid UTF-8 byte sequence, +but also aren't actually in UTF-8 (excepting encodings that are a strict +subset of UTF-8, like ASCII). + +For example (if I did this right), [0xd1 0xa1] is valid UTF-8, producing +omega "=D1=A1", and also valid Latin-1, producing "=C3=91=C2=A1". + +> I suppose some upgrade code to canonicalize all the terms? That sounds +> pretty slow. + +Perhaps, or I suppose you could just document that older indexed data +might not be canonicalized, and that you should reindex if that matters +to you. Although I suppose anyone with affected characters might well +want to reindex if the canonical form isn't the one people normally +receive (which seemed possible). + +Hmm, another question -- for terms, does notmuch store ordinal +positions, Unicode character offsets, input byte offsets, or...? +Canonicalization will of course change the latter. + +I imagine it might be possible to traverse the index terms and just +detect and merge those affected, but no idea if that would be +reasonable. + +> I really didn't look at the code very closely, but there were a +> surprising number of calls to talloc_free. But those kind of details can +> wait. + +Right, I wasn't sure what the policies were, so in most cases, I just +tried to release the data when it was no longer needed. + +Thanks +--=20 +Rob Browning +rlb @defaultvalue.org and @debian.org +GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A +GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4