DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=mime-version:sender:in-reply-to:references:date
	:x-google-sender-auth:message-id:subject:from:to:cc:content-type
	:content-transfer-encoding;
	b=wiUBF8qNYDpnV6NoZu+bMzmkd6H4jVaUairzaUlnm1w5/hO+ZMX3N1bRh6eBV4ekFY
	zLCVfk5/JYnl+HOCqFeBvViEe/kEMAe061kPIGmcKbNHvscGa/ur6H3y0I2rE1SbI4Nx
	yzhRKRjXik4R5yFs3WtQAcCwUXCGu/P4f5Ij4=
MIME-Version: 1.0
Sender: amdragon@gmail.com
In-Reply-To: <8762rq8byr.fsf@yoom.home.cworth.org>
References: <87d3nhe3g9.fsf@steelpick.2x.cz>
	<AANLkTinW_n+zMtLC-fy=naUGsAiFDwdd-mAqSWEDvF=W@mail.gmail.com>
	<AANLkTinPph9Lj8h3UztQ74qMaaBVKkXB0rbiLeTX2GmW@mail.gmail.com>
	<87lj0m8ki5.fsf@yoom.home.cworth.org>
	<20110311024730.GA31011@mit.edu>
	<8762rq8byr.fsf@yoom.home.cworth.org>
Date: Mon, 21 Mar 2011 03:41:28 -0400
Message-ID: <AANLkTin_g8Y9SDN7Fm8ZejTpmuKcwDq5PX1yB+j9xpEV@mail.gmail.com>
Subject: Re: Xapian locking errors with custom query parser
From: Austin Clements <amdragon@mit.edu>
To: Carl Worth <cworth@cworth.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Cc: notmuch@notmuchmail.org
Precedence: list

I haven't made any changes to the query parser yet, but I wanted to
reply to your questions.  (I might not get a chance to change things
for a while; I spent this weekend catching my breath and dealing with
all of the things I punted over the past few weeks and tomorrow it's
back to a different grind.  Whoever said being a grad student was
hitting the snooze button on life was a liar.)

Responses inline.

On Fri, Mar 11, 2011 at 12:26 AM, Carl Worth <cworth@cworth.org> wrote:
> On Thu, 10 Mar 2011 21:47:30 -0500, Austin Clements <amdragon@MIT.EDU> wr=
ote:
>> Yes, qparser-3 is ready for you, and has this fix folded in to it (see
>> id:20110202050336.GB28537@mit.edu).
>
> Thanks.
>
> I've finally had a chance to start looking at this.
>
> The first thing that caught my eye was this question:
>
>> +/* XXX notmuch currently registers "tag" as an exclusive boolean
>> + * prefix, which means queries like "tag:x tag:y" will return messages
>> + * with tag x OR tag y. =A0Is this intentional? */
>
> This isn't "intentional" in the sense that it is desired, no.
>
> Our documentation for the search syntax says:
>
> =A0 =A0In addition to individual terms, multiple terms can =A0be =A0combi=
ned =A0with
> =A0 =A0Boolean =A0operators =A0( and, or, not , etc.). Each term in the q=
uery will
> =A0 =A0be implicitly connected by a logical AND if =A0no =A0explicit =A0o=
perator =A0is
> =A0 =A0provided, =A0(except =A0that =A0terms with a common prefix will be=
 implicitly
> =A0 =A0combined with OR until we get Xapian defect #402 fixed).
>
> So, when I originally wrote this code, the add_boolean_prefix function
> didn't have the "exclusive" parameter that it has now. So that's
> something to fix.

Okay.  I suppose that this applies to all boolean prefixes, then, and
not just tag.  That actually simplifies parse_prob, since it can treat
boolean prefixed terms just like other terms and won't have to match
up identical prefixes.

I suppose a separate patch should first change the boolean prefixes to
exclusive to make sure the custom parser is a drop-in replacement.

> The next thing I notice is quite a lot of concern in the testing for
> whether things were precisely Xapian compatible or not. I have two
> different opinions about this:
>
> 1. For "new" search features (ADJ,NEAR,etc.) I do not have a strong
> =A0 interest in compatibility with Xapian.
>
> =A0 I was very careful when I wrote the documentation for the notmuch
> =A0 search syntax to only document features that I had used and tested,
> =A0 and that I was sure I wanted. (I was already thinking forward to
> =A0 perhaps writing a custom query parser at some point.)
>
> =A0 So you should really use our existing documentation as the
> =A0 guide. Please implement and test what it says.
>
> =A0 Beyond that, if you want to add additional features not mentioned in
> =A0 our documentation, then feel free to, and there's no good reason not
> =A0 to be Xapian compatible. But I also don't think there's a strong
> =A0 reason that we have to be compatible.
>
> =A0 Of course, for any new features here I would also like to see the
> =A0 documentation be updated.

I guess I didn't know what "etc" meant in the list of supported
boolean operators, so I took it to mean "whatever Xapian does".  I
leaned hard on Xapian compatibility because I didn't want to beak
anybody's setup, but I'm happy to strip out compatibility stuff
(especially NEAR and ADJ; those add a lot of complexity!)

Besides NEAR and ADJ, the only features I can think of that aren't in
the documentation but that I implemented are + and -.  But I think a
lot of people use these and they're really handy, so perhaps they
should be documented instead of being stripped.

> 2. For term splitting I do have a strong interest in Xapian compatibility=
.
>
> =A0 The difference here is that we aren't doing our own indexing, but
> =A0 instead relying on Xapian to do that for us, and we have also never
> =A0 carefully documented how the term splitting happens.
>
> =A0 What I want to happen here is that if a user grabs a chunk of text
> =A0 from an email, (say, "x#y"), and searches for it, that notmuch will
> =A0 find emails that actually contain that text. So if the indexer and
> =A0 the query parser disagree about something like this, then notmuch can
> =A0 break badly.
>
> =A0 I don't know how well notmuch currently meets that requirement, but
> =A0 I've been trusting in consistent term-splitting in the indexer and
> =A0 query-parser to help with this. So the frequent comments about
> =A0 incompatibility along these lines in your patches make me nervous.
>
> =A0 Can you enlighten me more about the compatibility differences in this
> =A0 area, and how things might break here?

In some sense, the term splitting the custom query parser does is more
Xapian-compatible than the Xapian parser's term splitting.  The custom
query parser uses the exact same TermGenerator that notmuch uses to
split documents in the first place, so the query term splitting will
be identical.  In fact, your x#y example *won't* do what you want with
the Xapian parser (it will be equivalent to x AND y), but it will with
the custom parser (it will be like "x y", which is as close as it can
get since Xapian doesn't index the #).

The real difference between the Xapian parser and the custom parser
lies in how they split *implicit phrases*.  In the Xapian parser,
characters that split terms are treated in one of two different ways:
they can either just split a term, but keep the resulting terms
together in a phrase; or they separate phrases '#' does the latter,
which is why x#y becomes x AND y.  In the custom parser, only
whitespace, '(', ')' and '"' separate phrases, and each phrase is then
split into terms using the TermGenerator.  Hence, the terms you get
are Xapian-compatible, but the custom parser treats more things as
phrases (using much more understandable rules).

I hope that addresses your concern with the term splitting.  If not,
please let me know.