From 351f632fbc2ee07664c5fcff3ad45453d174df4d Mon Sep 17 00:00:00 2001 From: David Bremner Date: Thu, 3 Sep 2015 22:12:14 +2100 Subject: [PATCH] Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP] --- ca/43af5fe1a7b652d56f6aff17c168b1d504b3c7 | 102 ++++++++++++++++++++++ 1 file changed, 102 insertions(+) create mode 100644 ca/43af5fe1a7b652d56f6aff17c168b1d504b3c7 diff --git a/ca/43af5fe1a7b652d56f6aff17c168b1d504b3c7 b/ca/43af5fe1a7b652d56f6aff17c168b1d504b3c7 new file mode 100644 index 000000000..a7f6b645b --- /dev/null +++ b/ca/43af5fe1a7b652d56f6aff17c168b1d504b3c7 @@ -0,0 +1,102 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by arlo.cworth.org (Postfix) with ESMTP id 1E9AD6DE1B56 + for ; Wed, 2 Sep 2015 18:13:25 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at cworth.org +X-Spam-Flag: NO +X-Spam-Score: 0.114 +X-Spam-Level: +X-Spam-Status: No, score=0.114 tagged_above=-999 required=5 tests=[AWL=0.114] + autolearn=disabled +Received: from arlo.cworth.org ([127.0.0.1]) + by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id hgFK-dBBJ5Bc for ; + Wed, 2 Sep 2015 18:13:23 -0700 (PDT) +Received: from gitolite.debian.net (gitolite.debian.net [87.98.215.224]) + by arlo.cworth.org (Postfix) with ESMTPS id D37D96DE1B51 + for ; Wed, 2 Sep 2015 18:13:22 -0700 (PDT) +Received: from remotemail by gitolite.debian.net with local (Exim 4.80) + (envelope-from ) + id 1ZXJ4s-0004iZ-U2; Thu, 03 Sep 2015 01:12:34 +0000 +Received: (nullmailer pid 17442 invoked by uid 1000); Thu, 03 Sep 2015 + 01:12:14 -0000 +From: David Bremner +To: Rob Browning , notmuch@notmuchmail.org +Subject: Re: [PATCH 1/1] Store and search for canonical Unicode text [WIP] +In-Reply-To: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org> +References: <1440951676-17286-1-git-send-email-rlb@defaultvalue.org> +User-Agent: Notmuch/0.20.2+60~gcb08a2e (http://notmuchmail.org) Emacs/24.5.1 + (x86_64-pc-linux-gnu) +Date: Wed, 02 Sep 2015 22:12:14 -0300 +Message-ID: <87fv2we26p.fsf@maritornes.cs.unb.ca> +MIME-Version: 1.0 +Content-Type: text/plain; charset=utf-8 +Content-Transfer-Encoding: quoted-printable +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.18 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Thu, 03 Sep 2015 01:13:25 -0000 + +Rob Browning writes: + +> +> Before this change, notmuch would index two strings that differ only +> with respect to canonicalization, like to=CC=81ken and t=C3=B3ken, as sep= +arate +> terms, even though they may be visually indistinguishable, and do (for +> most purposes) represent the same text. After indexing, searching for +> one would not find the other, and which one you present to notmuch +> when you search depends on your tools. See test/T570-normalization.sh +> for a working example. + +One way to break this up into more bite sized pieces would be to first +create one or more tests that fail with current notmuch, and mark those +as broken. + +> Up to now, notmuch has let Xapian handle converting the incoming bytes +> to UTF-8. Xapian treats any byte sequence as UTF-8, and interprets +> any invalid UTF-8 bytes as Latin-1. This patch maintains the existing +> behavior (excepting the new canonicalization) by using Xapian's +> Utf8Iterator to handle the initial Unicode character parsing. + +Can you explain why notmuch is the right place to do this, and not +Xapian? I know we talked back and forth about this, but I never really +got a solid sense of what the conclusion was. Is it just dependencies? + +> And because when the input is already UTF-8, it just blindly converts +> from UTF-8 to Unicode code points, and then back to UTF-8 (after +> canonicalization), during each pass. There are certainly +> opportunities to optimize, though it may be worth discussing the +> detection of data encodings more broadly first. + +It seems plausible to specify UTF-8 input for the library, but what +about the CLI? It seems like the canonicalization operation increases +the chance of mangling user input in non-UTF-8 locales. + +> FIXME: what about existing indexed text? + +I suppose some upgrade code to canonicalize all the terms? That sounds +pretty slow. + +> --- +> +> Posted for preliminary discussion, and as a milestone (it appears to +> mostly work now). Though I doubt I'm handling things correctly +> everywhere notmuch-wise, wrt talloc, etc. + +I really didn't look at the code very closely, but there were a +surprising number of calls to talloc_free. But those kind of details can +wait. + + -- 2.26.2