From: Austin Clements Date: Mon, 6 Jun 2016 19:20:19 +0000 (+2000) Subject: Re: searching: '*analysis' vs 'reanalysis' X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=0aad881db63eaa81d2c776ff42967cb172853ead;p=notmuch-archives.git Re: searching: '*analysis' vs 'reanalysis' --- diff --git a/54/b15c0503b31624ab97d77ac7f2d9f32632717f b/54/b15c0503b31624ab97d77ac7f2d9f32632717f new file mode 100644 index 000000000..2d23b946c --- /dev/null +++ b/54/b15c0503b31624ab97d77ac7f2d9f32632717f @@ -0,0 +1,133 @@ +Return-Path: +X-Original-To: notmuch@notmuchmail.org +Delivered-To: notmuch@notmuchmail.org +Received: from localhost (localhost [127.0.0.1]) + by arlo.cworth.org (Postfix) with ESMTP id 6EB466DE01F7 + for ; Mon, 6 Jun 2016 13:09:26 -0700 (PDT) +X-Virus-Scanned: Debian amavisd-new at cworth.org +X-Spam-Flag: NO +X-Spam-Score: -0.823 +X-Spam-Level: +X-Spam-Status: No, score=-0.823 tagged_above=-999 required=5 + tests=[AWL=-0.813, HTML_MESSAGE=0.001, SPF_PASS=-0.001, + T_RP_MATCHES_RCVD=-0.01] autolearn=disabled +Received: from arlo.cworth.org ([127.0.0.1]) + by localhost (arlo.cworth.org [127.0.0.1]) (amavisd-new, port 10024) + with ESMTP id Ti-kiAZDuIZc for ; + Mon, 6 Jun 2016 13:09:17 -0700 (PDT) +X-Greylist: delayed 2913 seconds by postgrey-1.35 at arlo; + Mon, 06 Jun 2016 13:08:57 PDT +Received: from outgoing-tmp.csail.mit.edu (outgoing-tmp.csail.mit.edu + [128.30.2.206]) + by arlo.cworth.org (Postfix) with ESMTP id 063316DE0217 + for ; Mon, 6 Jun 2016 13:08:56 -0700 (PDT) +Received: from mail-yw0-f173.google.com ([209.85.161.173]) + by outgoing-tmp.csail.mit.edu with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) + (Exim 4.82) (envelope-from ) + id 1bA04T-0001og-80 + for notmuch@notmuchmail.org; Mon, 06 Jun 2016 15:20:21 -0400 +Received: by mail-yw0-f173.google.com with SMTP id c127so149455948ywb.1 + for ; Mon, 06 Jun 2016 12:20:21 -0700 (PDT) +X-Gm-Message-State: ALyK8tJzsZjhatT/stPriuVDauiazZsgnRt3IN/TeSV3HzCqPCyvNkrLFljgQTSX0n/Vnw05iAGS63zbY6Cfqg== +X-Received: by 10.129.45.196 with SMTP id t187mr13435296ywt.153.1465240820424; + Mon, 06 Jun 2016 12:20:20 -0700 (PDT) +MIME-Version: 1.0 +Received: by 10.37.200.7 with HTTP; Mon, 6 Jun 2016 12:20:19 -0700 (PDT) +In-Reply-To: <878tyins3j.fsf@tesseract.cs.unb.ca> +References: <1465196150-astroid-3-33kf2otxir-16915@strange> + <87lh2ijxor.fsf@tesseract.cs.unb.ca> + <1465217156-astroid-4-8l08w9cils-2318@strange> + <877fe2tiy8.fsf@uwaterloo.ca> <878tyins3j.fsf@tesseract.cs.unb.ca> +From: Austin Clements +Date: Mon, 6 Jun 2016 15:20:19 -0400 +X-Gmail-Original-Message-ID: + +Message-ID: + +Subject: Re: searching: '*analysis' vs 'reanalysis' +To: David Bremner +Cc: sfischme@uwaterloo.ca, Gaute Hope , + notmuch +Content-Type: multipart/alternative; boundary=001a1141df549cccb00534a0f6b3 +X-BeenThere: notmuch@notmuchmail.org +X-Mailman-Version: 2.1.20 +Precedence: list +List-Id: "Use and development of the notmuch mail system." + +List-Unsubscribe: , + +List-Archive: +List-Post: +List-Help: +List-Subscribe: , + +X-List-Received-Date: Mon, 06 Jun 2016 20:09:26 -0000 + +--001a1141df549cccb00534a0f6b3 +Content-Type: text/plain; charset=UTF-8 + +On Mon, Jun 6, 2016 at 1:29 PM, David Bremner wrote: + +> Sebastian Fischmeister writes: +> +> > +> > I ran into this problem before as well. Storage is cheap. Notmuch could +> > index all emails with reversed text to get around some of this +> > problem. It doesn't solve the problem of *analysis*, but it's still an +> > improvement. +> +> It would probably be more useful to have brute force regexp searches on +> headers. Austin did some experiments that sounded promising, where you +> basically postprocess the result of a xapian query with a regexp. OTOH, +> I don't know what kept him from proposing this for mainline. If it was +> just parser issues, those are probably more or less solved now, at least +> for people using xapian 1.3+ +> + +The experiment was specifically for regexp matching subject, but it should +work for any header we store a literal copy of in the database. The code is +here, though in its current form it builds on my custom query parser: +https://github.com/aclements/notmuch/commit/ce41b29aba4d9b84e2f1eb6ed8df67065196c960. +Based on my understanding of Xapian 1.3+ field processors, these days it +should be quite easy to hook the PostingSource in that commit into the +Xapian QueryProcessor. + +--001a1141df549cccb00534a0f6b3 +Content-Type: text/html; charset=UTF-8 +Content-Transfer-Encoding: quoted-printable + +
On M= +on, Jun 6, 2016 at 1:29 PM, David Bremner <david@tethera.net> wrote:
Sebastian Fischmeister <sfischme@uwaterloo.ca> writes:
+
+>
+> I ran into this problem before as well. Storage is cheap. Notmuch coul= +d
+> index all emails with reversed text to get around some of this
+> problem. It doesn't solve the problem of *analysis*, but it's = +still an
+> improvement.
+
+It would probably be more useful to have brute force regexp searches on
+headers.=C2=A0 Austin did some experiments that sounded promising, where yo= +u
+basically postprocess the result of a xapian query with a regexp. OTOH,
+I don't know what kept him from proposing this for mainline. If it was<= +br> +just parser issues, those are probably more or less solved now, at least +for people using xapian 1.3+

The experi= +ment was specifically for regexp matching subject, but it should work for a= +ny header we store a literal copy of in the database. The code is here, tho= +ugh in its current form it builds on my custom query parser:=C2=A0https://github.com/aclements/notmuch/commit/ce41b29aba4d9b84= +e2f1eb6ed8df67065196c960. Based on my understanding of Xapian 1.3+ fiel= +d processors, these days it should be quite easy to hook the PostingSource = +in that commit into the Xapian QueryProcessor.
+ +--001a1141df549cccb00534a0f6b3--