MIME-Version: 1.0
Sender: amdragon@gmail.com
In-Reply-To: <20110705214234.GA15360@mit.edu>
References: <20110703171743.GL15901@mit.edu>
	<1309762318-4530-1-git-send-email-pieter@praet.org>
	<e8c5fbf4-4dfa-461a-8f5c-6c696291a270@email.android.com>
	<87sjqldgr7.fsf@praet.org> <87iprg7dm0.fsf@praet.org>
	<20110705214234.GA15360@mit.edu>
Date: Wed, 3 Aug 2011 16:47:32 -0400
Message-ID:
 <CAH-f9WsPj=1Eu=g3sOePJgCTBFs6HrLdLq18xMEnJ8aZ00yCEg@mail.gmail.com>
Subject: Re: [PROTO] possible solution for "Race condition for '*' command"
From: Austin Clements <amdragon@mit.edu>
To: Pieter Praet <pieter@praet.org>, Carl Worth <cworth@cworth.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: Olly Betts <olly@survex.com>, Notmuch Mail <notmuch@notmuchmail.org>
Precedence: list

On Tue, Jul 5, 2011 at 5:42 PM, Austin Clements <amdragon@mit.edu> wrote:
> Quoth Pieter Praet on Jul 05 at =A09:04 pm:
>> On Mon, 04 Jul 2011 20:48:12 +0200, Pieter Praet <pieter@praet.org> wrot=
e:
>> > On Mon, 04 Jul 2011 13:56:26 -0400, Austin Clements <amdragon@MIT.EDU>=
 wrote:
>> > > I should probably emit two lists per thread: one of matched IDs and
>> > > one of unmatched IDs. Tagging a region can then operate on the
>> > > concatenation of these, while * can operate only on the matched
>> > > lists. This should be easy to do. I'll send an updated patch when I'=
m
>> > > back at a computer.
>> >
>> > The matched MsgIds will be sufficient, as we'll want to operate on
>> > either the matched messages or the entire thread (for which the
>> > `thread-id' property is already present).
>> >
>> > Can't think of a use case for non-matched messages right now,
>> > but if required, we'll just use `set-exclusive-or'.
>>
>> Wasn't thinking clearly:
>>
>> You're right, we *will* be needing both a list of matched as well as one
>> of unmatched Message-Id's per result. Otherwise there would still be a
>> potential race condition when tagging with +/-.
>
> Yes, exactly. =A0(I had originally thought this race was a strict
> superset of the '*' race; I now realize that's not the case, but
> they're related enough that they might as well be addressed together.)
>
> Below is an updated patch that separates the matched and unmatched
> ID's (it's ugly, but no point in cleaning it up since it's a
> prototype). =A0Now the tag list on each search line is followed by
> something that looks like
>
> =A0(id:x or id:y) or id:z
>
> or just
>
> =A0(id:x or id:y)
>
> where the parenthesized part of the query is for the matched messages
> and the (possibly empty) unparenthesized part is for the non-matched
> messages. =A0This is designed to be easy for emacs to parse: grab just
> the parenthesized part for the queries used by '*' and the whole thing
> for region tagging queries. =A0This should also eliminate corner cases
> for pasting together multiple queries with "or".
>
> *snip*

The patch I posted above includes message ID's in search results as a
proxy for the match set (which can then be used in a tagging operation
to tag exactly the results you saw).  However, from an efficiency
standpoint, it makes more sense to capture the match set directly as
document ID's.

I've had an implementation of this for a while, but finally got around
to benchmarking the difference between tagging using message ID's
versus using document ID's.  It looks like tagging spends about 2/3rds
of its time performing queries, and only about 1/3rd actually tagging,
so tagging using document ID's is 3x-4x faster.

The downside to using document ID's is that we need API's to expose
them.  My prototype exposes these as opaque "object ID"s, which acts a
lot like message IDs, but has no intrinsic meaning outside of the
library.  This needs two library functions: one to retrieve a
message's object ID and another to retrieve a message by object ID.

3x-4x isn't enough to make me jump on this added complexity, but it's
enough to make me consider it.  Carl, I'd love to hear your thoughts.

The benchmark results are below.  All results are for a corpus of 10K
messages taken from the Enron email data set and in all cases the
database is in the buffer cache.  "P4" is my old Pentium 4 and "C2" is
my newer Core 2 Duo.

                Message IDs   Document IDs   Diff
HDD/ext3,   P4     235.8s         76.3s      3.1x
SSD/btrfs,  P4     239.4s         69.0s      3.5x
HDD/reiser, C2      72.4s         23.7s      3.1x

I also ran with a patch to Xapian from ojwb that optimizes
adding/removing terms that don't have position information [1], which
reduces the baseline tagging cost.

HDD/reiser, C2      71.6s         19.4s      3.7x

[1] http://oligarchy.co.uk/xapian/patches/xapian-chert-optimise-adding-term=
-without-positions.patch