Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id BF7DB431FD0 for ; Mon, 21 Mar 2011 00:41:29 -0700 (PDT) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.699 X-Spam-Level: X-Spam-Status: No, score=-0.699 tagged_above=-999 required=5 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 7GLGsTz19uKD for ; Mon, 21 Mar 2011 00:41:28 -0700 (PDT) Received: from mail-qw0-f53.google.com (mail-qw0-f53.google.com [209.85.216.53]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (No client certificate requested) by olra.theworths.org (Postfix) with ESMTPS id AB0B3431FB5 for ; Mon, 21 Mar 2011 00:41:28 -0700 (PDT) Received: by qwc9 with SMTP id 9so4293702qwc.26 for ; Mon, 21 Mar 2011 00:41:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=syKIE+qHf4KBL7NCu+5ZSc5Z+xESEwtYuCuGfAxougY=; b=YVj/8UmPQQvDe0QAq09jbMDFNL4Scuq6nj1zhTYuvzy9pqlwdI01uUNcDgtrUzQhsl bTsEY6pW1R6/Ns6LIdM/slIXyXL4vUWmAqrUNB03HNGOn70+GW+A7qT8T/4CIX9vcH57 3eUzCHq4lVbeqsoiMMJdolGGUQ+v6E7dZTjpk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; b=wiUBF8qNYDpnV6NoZu+bMzmkd6H4jVaUairzaUlnm1w5/hO+ZMX3N1bRh6eBV4ekFY zLCVfk5/JYnl+HOCqFeBvViEe/kEMAe061kPIGmcKbNHvscGa/ur6H3y0I2rE1SbI4Nx yzhRKRjXik4R5yFs3WtQAcCwUXCGu/P4f5Ij4= MIME-Version: 1.0 Received: by 10.229.130.168 with SMTP id t40mr1633127qcs.140.1300693288071; Mon, 21 Mar 2011 00:41:28 -0700 (PDT) Sender: amdragon@gmail.com Received: by 10.229.30.68 with HTTP; Mon, 21 Mar 2011 00:41:28 -0700 (PDT) In-Reply-To: <8762rq8byr.fsf@yoom.home.cworth.org> References: <87d3nhe3g9.fsf@steelpick.2x.cz> <87lj0m8ki5.fsf@yoom.home.cworth.org> <20110311024730.GA31011@mit.edu> <8762rq8byr.fsf@yoom.home.cworth.org> Date: Mon, 21 Mar 2011 03:41:28 -0400 X-Google-Sender-Auth: pxEXPBciSaHTqioTOCCPlZc4d3M Message-ID: Subject: Re: Xapian locking errors with custom query parser From: Austin Clements To: Carl Worth Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: notmuch@notmuchmail.org X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Mar 2011 07:41:30 -0000 I haven't made any changes to the query parser yet, but I wanted to reply to your questions. (I might not get a chance to change things for a while; I spent this weekend catching my breath and dealing with all of the things I punted over the past few weeks and tomorrow it's back to a different grind. Whoever said being a grad student was hitting the snooze button on life was a liar.) Responses inline. On Fri, Mar 11, 2011 at 12:26 AM, Carl Worth wrote: > On Thu, 10 Mar 2011 21:47:30 -0500, Austin Clements wr= ote: >> Yes, qparser-3 is ready for you, and has this fix folded in to it (see >> id:20110202050336.GB28537@mit.edu). > > Thanks. > > I've finally had a chance to start looking at this. > > The first thing that caught my eye was this question: > >> +/* XXX notmuch currently registers "tag" as an exclusive boolean >> + * prefix, which means queries like "tag:x tag:y" will return messages >> + * with tag x OR tag y. =A0Is this intentional? */ > > This isn't "intentional" in the sense that it is desired, no. > > Our documentation for the search syntax says: > > =A0 =A0In addition to individual terms, multiple terms can =A0be =A0combi= ned =A0with > =A0 =A0Boolean =A0operators =A0( and, or, not , etc.). Each term in the q= uery will > =A0 =A0be implicitly connected by a logical AND if =A0no =A0explicit =A0o= perator =A0is > =A0 =A0provided, =A0(except =A0that =A0terms with a common prefix will be= implicitly > =A0 =A0combined with OR until we get Xapian defect #402 fixed). > > So, when I originally wrote this code, the add_boolean_prefix function > didn't have the "exclusive" parameter that it has now. So that's > something to fix. Okay. I suppose that this applies to all boolean prefixes, then, and not just tag. That actually simplifies parse_prob, since it can treat boolean prefixed terms just like other terms and won't have to match up identical prefixes. I suppose a separate patch should first change the boolean prefixes to exclusive to make sure the custom parser is a drop-in replacement. > The next thing I notice is quite a lot of concern in the testing for > whether things were precisely Xapian compatible or not. I have two > different opinions about this: > > 1. For "new" search features (ADJ,NEAR,etc.) I do not have a strong > =A0 interest in compatibility with Xapian. > > =A0 I was very careful when I wrote the documentation for the notmuch > =A0 search syntax to only document features that I had used and tested, > =A0 and that I was sure I wanted. (I was already thinking forward to > =A0 perhaps writing a custom query parser at some point.) > > =A0 So you should really use our existing documentation as the > =A0 guide. Please implement and test what it says. > > =A0 Beyond that, if you want to add additional features not mentioned in > =A0 our documentation, then feel free to, and there's no good reason not > =A0 to be Xapian compatible. But I also don't think there's a strong > =A0 reason that we have to be compatible. > > =A0 Of course, for any new features here I would also like to see the > =A0 documentation be updated. I guess I didn't know what "etc" meant in the list of supported boolean operators, so I took it to mean "whatever Xapian does". I leaned hard on Xapian compatibility because I didn't want to beak anybody's setup, but I'm happy to strip out compatibility stuff (especially NEAR and ADJ; those add a lot of complexity!) Besides NEAR and ADJ, the only features I can think of that aren't in the documentation but that I implemented are + and -. But I think a lot of people use these and they're really handy, so perhaps they should be documented instead of being stripped. > 2. For term splitting I do have a strong interest in Xapian compatibility= . > > =A0 The difference here is that we aren't doing our own indexing, but > =A0 instead relying on Xapian to do that for us, and we have also never > =A0 carefully documented how the term splitting happens. > > =A0 What I want to happen here is that if a user grabs a chunk of text > =A0 from an email, (say, "x#y"), and searches for it, that notmuch will > =A0 find emails that actually contain that text. So if the indexer and > =A0 the query parser disagree about something like this, then notmuch can > =A0 break badly. > > =A0 I don't know how well notmuch currently meets that requirement, but > =A0 I've been trusting in consistent term-splitting in the indexer and > =A0 query-parser to help with this. So the frequent comments about > =A0 incompatibility along these lines in your patches make me nervous. > > =A0 Can you enlighten me more about the compatibility differences in this > =A0 area, and how things might break here? In some sense, the term splitting the custom query parser does is more Xapian-compatible than the Xapian parser's term splitting. The custom query parser uses the exact same TermGenerator that notmuch uses to split documents in the first place, so the query term splitting will be identical. In fact, your x#y example *won't* do what you want with the Xapian parser (it will be equivalent to x AND y), but it will with the custom parser (it will be like "x y", which is as close as it can get since Xapian doesn't index the #). The real difference between the Xapian parser and the custom parser lies in how they split *implicit phrases*. In the Xapian parser, characters that split terms are treated in one of two different ways: they can either just split a term, but keep the resulting terms together in a phrase; or they separate phrases '#' does the latter, which is why x#y becomes x AND y. In the custom parser, only whitespace, '(', ')' and '"' separate phrases, and each phrase is then split into terms using the TermGenerator. Hence, the terms you get are Xapian-compatible, but the custom parser treats more things as phrases (using much more understandable rules). I hope that addresses your concern with the term splitting. If not, please let me know.