Return-Path: X-Original-To: notmuch@notmuchmail.org Delivered-To: notmuch@notmuchmail.org Received: from localhost (localhost [127.0.0.1]) by olra.theworths.org (Postfix) with ESMTP id 9D5AD431FAF for ; Tue, 25 Dec 2012 17:23:07 -0800 (PST) X-Virus-Scanned: Debian amavisd-new at olra.theworths.org X-Spam-Flag: NO X-Spam-Score: -0.7 X-Spam-Level: X-Spam-Status: No, score=-0.7 tagged_above=-999 required=5 tests=[RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Received: from olra.theworths.org ([127.0.0.1]) by localhost (olra.theworths.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id m+ueMU2URmHt for ; Tue, 25 Dec 2012 17:23:07 -0800 (PST) Received: from dmz-mailsec-scanner-3.mit.edu (DMZ-MAILSEC-SCANNER-3.MIT.EDU [18.9.25.14]) by olra.theworths.org (Postfix) with ESMTP id DC466431FAE for ; Tue, 25 Dec 2012 17:23:06 -0800 (PST) X-AuditID: 1209190e-b7fa16d000001402-81-50da51791370 Received: from mailhub-auth-3.mit.edu ( [18.9.21.43]) by dmz-mailsec-scanner-3.mit.edu (Symantec Messaging Gateway) with SMTP id 5E.0D.05122.9715AD05; Tue, 25 Dec 2012 20:23:05 -0500 (EST) Received: from outgoing.mit.edu (OUTGOING-AUTH.MIT.EDU [18.7.22.103]) by mailhub-auth-3.mit.edu (8.13.8/8.9.2) with ESMTP id qBQ1N43n016953; Tue, 25 Dec 2012 20:23:04 -0500 Received: from awakening.csail.mit.edu (awakening.csail.mit.edu [18.26.4.91]) (authenticated bits=0) (User authenticated as amdragon@ATHENA.MIT.EDU) by outgoing.mit.edu (8.13.6/8.12.4) with ESMTP id qBQ1N1HD002404 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Tue, 25 Dec 2012 20:23:02 -0500 (EST) Received: from amthrax by awakening.csail.mit.edu with local (Exim 4.80) (envelope-from ) id 1Tnfi0-0008Ut-Qk; Tue, 25 Dec 2012 20:23:00 -0500 Date: Tue, 25 Dec 2012 20:23:00 -0500 From: Austin Clements To: David Bremner Subject: Re: [PATCH 2/5] util: Function to parse boolean term queries Message-ID: <20121226012300.GW6187@mit.edu> References: <1356415076-5692-1-git-send-email-amdragon@mit.edu> <1356415076-5692-3-git-send-email-amdragon@mit.edu> <87obhidxkt.fsf@zancas.localnet> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87obhidxkt.fsf@zancas.localnet> User-Agent: Mutt/1.5.21 (2010-09-15) X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprOKsWRmVeSWpSXmKPExsUixCmqrVsZeCvAYMNMFYsbrd2MFk3TnS1W z+WxuH5zJrMDi8fOWXfZPW7df83u8WzVLWaPLYfeMwewRHHZpKTmZJalFunbJXBldHxbwFbQ KVzx/vZa1gbGDr4uRk4OCQETibZfTUwQtpjEhXvr2boYuTiEBPYxSnzpnMgO4WxglLi1qAPK ucgk0bPpOzOEs4RRYtKdh2D9LAKqEjdnvWcEsdkENCS27V8OZosAxa9um8wGYjMLuEusn3gG rF5YwFXifmMvWA2vgLbE5+tTGSGGzmCUePRxFTtEQlDi5MwnLBDNWhI3/r0EauYAsqUllv/j AAlzCuhKNP4+BzZTVEBFYsrJbWwTGIVmIemehaR7FkL3AkbmVYyyKblVurmJmTnFqcm6xcmJ eXmpRbrGermZJXqpKaWbGMHhL8m3g/HrQaVDjAIcjEo8vBu/3wwQYk0sK67MPcQoycGkJMp7 3v9WgBBfUn5KZUZicUZ8UWlOavEhRgkOZiURXuePQOW8KYmVValF+TApaQ4WJXHeKyk3/YUE 0hNLUrNTUwtSi2CyMhwcShK8cwOAhgoWpaanVqRl5pQgpJk4OEGG8wANnwpSw1tckJhbnJkO kT/FqCglzpsLkhAASWSU5sH1wtLTK0ZxoFeEeZeBVPEAUxtc9yugwUxAg2P5boAMLklESEk1 ME73DVxwvsr5by/vJsFPWXd/yM+ePuHeuYKlrpPnqZTZSd9dZKuRz5JzMe/5JtEQv+X8eZzb lCYb6+4yrtjJpp9wrpXhjWzMxbK38VtX3zrCkvtisl3prO8TXqfpBN3eKlpt6lFXOpNr6tag d/H+a39N/N9wfQHLUXOjC2d9a+3cbQ6pWM0WdVNiKc5INNRiLipOBAAJELYQKgMAAA== Cc: notmuch@notmuchmail.org X-BeenThere: notmuch@notmuchmail.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: "Use and development of the notmuch mail system." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Dec 2012 01:23:07 -0000 Quoth David Bremner on Dec 25 at 10:18 am: > Austin Clements writes: > > > + if (consume_double_quote (&pos)) { > > + char *out = talloc_strdup (ctx, pos); > > + pos = *term_out = out; > > + while (1) { > > Overall the control flow here is a bit tricky to follow. I'm not sure if > a real loop condition would help or make it worse. > > > + if (! *pos) { > > + /* Premature end of string */ > > + goto FAIL; > > + } else if (*pos == '"') { > > + if (*++pos != '"') > > + break; > > + } else if (consume_double_quote (&pos)) { > > + break; > > + } > > I'm confused by the asymmetry here. Quoted strings can start with > unicode quotes, but internally can only have ascii '"'? Is this > documented somewhere? Only in the source, to my knowledge. Here's how Xapian parses these things (where 'it' is a UTF8 string iterator): if (it != end && is_double_quote(*it)) { // Quoted boolean term (can contain any character). ++it; while (it != end) { if (*it == '"') { // Interpret "" as an escaped ". if (++it == end || *it != '"') break; } else if (is_double_quote(*it)) { ++it; break; } Unicode::append_utf8(name, *it++); } } else { // Can't boolean filter prefix a subexpression, so // just use anything following the prefix until the // next space or ')' as part of the boolean filter // term. while (it != end && *it > ' ' && *it != ')') Unicode::append_utf8(name, *it++); } > > + } else { > > + while (*pos > ' ' && *pos != ')') > > + ++pos; > > + if (*pos) > > + goto FAIL; > > + } > > So if there is no quote, we skip the part after the ':'? I guess I > probably missed something because that doesn't sound like the intended > behaviour. This isn't skipping it; it's checking its well-formedness. In this case, *term_out already points to a correct string that can be used literally; we just have to check that there's no trailing garbage after the boolean query. This is certainly worth commenting. For the record, I also tried passing the query straight to the library, without parsing it in the CLI (and simply checking that the query returned exactly one result), and it was noticeably slower (the restore performance test took between 3.82 and 5.25 seconds for the code in this series and ~7.2 seconds using a general query.)