Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 0C1C2429E25 for ; Wed, 3 Aug 2011 13:53:02 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.699 X-Spam-Level: X-Spam-Status: No, score=-0.699 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id VwNaNHfuk7kU for ; Wed, 3 Aug 2011 13:53:00 -0700 (PDT) Received: from mail-qw0-f45.google.com (mail-qw0-f45.google.com [209.85.216.45]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id A83CD431FB6 for ; Wed, 3 Aug 2011 13:53:00 -0700 (PDT) Received: by qwj8 with SMTP id 8so1069659qwj.18 for ; Wed, 03 Aug 2011 13:47:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=icYt7i0wlGUSUPIDwWuz0F/REmKhA7zdyNf6tOahzt8=; b=dltdYKoB4kF1rxq8j/tqHixhF713zIDR70A4+57B6dHPeG7DOGv3HwfnsMpczNlAqb Mo97IibxW4RB9US5dZ5YXCRqIAvFDF70IlZymyyZ/UqKksuPoHU8aB+0hiWwWp1yjvpV 4Vmt3qtZntcL3IaPDgPapjNnvbSYVxA0UQJko= MIME-Version: 1.0 Received: by 10.229.127.1 with SMTP id e1mr397546qcs.257.1312404452606; Wed, 03 Aug 2011 13:47:32 -0700 (PDT) Sender: amdragon@gmail.com Received: by 10.229.15.136 with HTTP; Wed, 3 Aug 2011 13:47:32 -0700 (PDT) In-Reply-To: <20110705214234.GA15360@mit.edu> References: <20110703171743.GL15901@mit.edu> <1309762318-4530-1-git-send-email-pieter@praet.org> <87sjqldgr7.fsf@praet.org> <87iprg7dm0.fsf@praet.org> <20110705214234.GA15360@mit.edu> Date: Wed, 3 Aug 2011 16:47:32 -0400 X-Google-Sender-Auth: lLBeE0pV7g_AdOaqM1DSiQSUG8Q Message-ID: Subject: Re: [PROTO] possible solution for "Race condition for '*' command" From: Austin Clements To: Pieter Praet , Carl Worth Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: Olly Betts , Notmuch Mail X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Aug 2011 20:53:02 -0000 On Tue, Jul 5, 2011 at 5:42 PM, Austin Clements wrote: > Quoth Pieter Praet on Jul 05 at =A09:04 pm: >> On Mon, 04 Jul 2011 20:48:12 +0200, Pieter Praet wrot= e: >> > On Mon, 04 Jul 2011 13:56:26 -0400, Austin Clements = wrote: >> > > I should probably emit two lists per thread: one of matched IDs and >> > > one of unmatched IDs. Tagging a region can then operate on the >> > > concatenation of these, while * can operate only on the matched >> > > lists. This should be easy to do. I'll send an updated patch when I'= m >> > > back at a computer. >> > >> > The matched MsgIds will be sufficient, as we'll want to operate on >> > either the matched messages or the entire thread (for which the >> > `thread-id' property is already present). >> > >> > Can't think of a use case for non-matched messages right now, >> > but if required, we'll just use `set-exclusive-or'. >> >> Wasn't thinking clearly: >> >> You're right, we *will* be needing both a list of matched as well as one >> of unmatched Message-Id's per result. Otherwise there would still be a >> potential race condition when tagging with +/-. > > Yes, exactly. =A0(I had originally thought this race was a strict > superset of the '*' race; I now realize that's not the case, but > they're related enough that they might as well be addressed together.) > > Below is an updated patch that separates the matched and unmatched > ID's (it's ugly, but no point in cleaning it up since it's a > prototype). =A0Now the tag list on each search line is followed by > something that looks like > > =A0(id:x or id:y) or id:z > > or just > > =A0(id:x or id:y) > > where the parenthesized part of the query is for the matched messages > and the (possibly empty) unparenthesized part is for the non-matched > messages. =A0This is designed to be easy for emacs to parse: grab just > the parenthesized part for the queries used by '*' and the whole thing > for region tagging queries. =A0This should also eliminate corner cases > for pasting together multiple queries with "or". > > *snip* The patch I posted above includes message ID's in search results as a proxy for the match set (which can then be used in a tagging operation to tag exactly the results you saw). However, from an efficiency standpoint, it makes more sense to capture the match set directly as document ID's. I've had an implementation of this for a while, but finally got around to benchmarking the difference between tagging using message ID's versus using document ID's. It looks like tagging spends about 2/3rds of its time performing queries, and only about 1/3rd actually tagging, so tagging using document ID's is 3x-4x faster. The downside to using document ID's is that we need API's to expose them. My prototype exposes these as opaque "object ID"s, which acts a lot like message IDs, but has no intrinsic meaning outside of the library. This needs two library functions: one to retrieve a message's object ID and another to retrieve a message by object ID. 3x-4x isn't enough to make me jump on this added complexity, but it's enough to make me consider it. Carl, I'd love to hear your thoughts. The benchmark results are below. All results are for a corpus of 10K messages taken from the Enron email data set and in all cases the database is in the buffer cache. "P4" is my old Pentium 4 and "C2" is my newer Core 2 Duo. Message IDs Document IDs Diff HDD/ext3, P4 235.8s 76.3s 3.1x SSD/btrfs, P4 239.4s 69.0s 3.5x HDD/reiser, C2 72.4s 23.7s 3.1x I also ran with a patch to Xapian from ojwb that optimizes adding/removing terms that don't have position information [1], which reduces the baseline tagging cost. HDD/reiser, C2 71.6s 19.4s 3.7x [1] http://oligarchy.co.uk/xapian/patches/xapian-chert-optimise-adding-term= -without-positions.patch