Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id D644341732C for ; Wed, 22 Dec 2010 22:00:46 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: 0 X-Spam-Level: X-Spam-Status: No, score=0 tagged_above=-999 required=5 tests=[none] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id eX9eLqJQTrK4 for ; Wed, 22 Dec 2010 22:00:41 -0800 (PST) Received: from dmz-mailsec-scanner-8.mit.edu (DMZ-MAILSEC-SCANNER-8.MIT.EDU [18.7.68.37]) by olra.theworths.org (Postfix) with ESMTP id 53BF2431FB6 for ; Wed, 22 Dec 2010 22:00:39 -0800 (PST) X-AuditID: 12074425-b7c98ae000000a04-9b-4d12e586e4d1 Received: from mailhub-auth-1.mit.edu ( [18.9.21.35]) by dmz-mailsec-scanner-8.mit.edu (Symantec Brightmail Gateway) with SMTP id D9.05.02564.685E21D4; Thu, 23 Dec 2010 01:00:38 -0500 (EST) Received: from outgoing.mit.edu (OUTGOING-AUTH.MIT.EDU [18.7.22.103]) by mailhub-auth-1.mit.edu (8.13.8/8.9.2) with ESMTP id oBN60bMt020117; Thu, 23 Dec 2010 01:00:37 -0500 Received: from awakening.csail.mit.edu (awakening.csail.mit.edu [18.26.4.91]) (authenticated bits=0) (User authenticated as amdragon@ATHENA.MIT.EDU) by outgoing.mit.edu (8.13.6/8.12.4) with ESMTP id oBN60Z1i004772 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT); Thu, 23 Dec 2010 01:00:37 -0500 (EST) Received: from amthrax by awakening.csail.mit.edu with local (Exim 4.72) (envelope-from ) id 1PVeE7-00024C-MC; Thu, 23 Dec 2010 01:00:35 -0500 From: Austin Clements To: notmuch@notmuchmail.org Subject: [RFC PATCH 0/2] Custom query parser Date: Thu, 23 Dec 2010 01:00:22 -0500 Message-Id: <1293084024-7893-1-git-send-email-amdragon@mit.edu> X-Mailer: git-send-email 1.7.2.3 X-Brightmail-Tracker: AAAAAA== Cc: amdragon@mit.edu X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 Dec 2010 06:00:48 -0000 Following is a patch series that implements a custom query parser. This code is by no means ready for serious review or inclusion in the tree, but I wanted to spark some discussion and get feedback while it's still molten. The query parser is basically compatible with Xapian's, but is designed to be flexible enough to support (at least) date searches, saved queries, folder searches (without a schema change), "from:me", and implicit filters (like defaulting to "-tag:spam"). Such features are implemented as transformations over an intermediate representation, keeping the parser itself as simple as possible. The code already uses transformers for the usual prefixes and the type:mail filter, and I have most of the implementation of folder searches in another branch. If you'd like to play with it, it's on the qparser branch at http://awakening.csail.mit.edu/git/notmuch.git. It's not complete, but it's close (notably, stemming is missing). The grammar is approximately compatible with Xapian's query parser. To my knowledge, it differs in the following ways: 1) The way this parser separates terms is much simpler and, I believe, more predictable. In Xapian, many things can separate terms, and exactly what is context-dependent. For example, "x/y" is parsed as two terms in a phrase, "x#y" is parsed as two separate terms "x" and "y", and "x# y" is parsed as two separate terms "x#" and "y". This parser instead splits the query into "term phrases", which are separated only by whitespace, quote, left paren, or right paren. These term phrases are then broken into terms using the same Xapian::TermGenerator that tokenizes documents. If this results in multiple terms, they're tied into a phrase in the final query. Thus, "x/y" and "x# y" are handled as in Xapian, but "x#y" is parsed as a phrase. 2) This parser handles errors differently. Because queries are roughly natural language, I feel they should never fail to parse (has Google ever told you that your query contains a syntax error?), so this parser tries to do reasonable things in all cases. Xapian's parser has a strange approach. For syntax errors involving boolean operators (for example "x AND"), it returns a parse error. For other syntax errors (for example, "x) OR y"), Xapian will *retry* the parse from scratch with no parser flags set (no operators, parens, or quotes). 3) The handling of NEAR and ADJ is quite different and possibly wrong. I didn't realize how subtle these operators were until I'd already implemented something completely different from Xapian's logic. The code is much simpler than Xapian's query parser because it elides several features that notmuch doesn't use (such as multi-term synonyms), it reuses Xapian's TermGenerator to parse terms (instead of copy-pasting and tweaking the code), and because I placed the boundary between the lexer and the parser differently. This parser is under 1000 lines, while Xapain's is 2000 lines plus a 5000 line parser generator.